If you have a multi-core desktop machine, access to a server with multiple CPUs or perhaps even a supercomputer, this note may be of interest to you in your Chandra analysis. If not, feel free to hit DELETE now. In a previous note and paper http://asc.harvard.edu/chandra-users/0428.html http://arxiv.org/abs/astro-ph/0510688 we described how PVM has been used on a network of workstations to parallelize X-ray modeling and analysis in ISIS. While successful, to date that approach requires a fair amount of coding on the part of the end user; it also leaves unexplored the facts that more and more desktop machines today contain multiple CPUs or cores & that many data centers operate multi-CPU servers or Beowulf clusters, while at the same time very few widely used astronomy software packages exploit these extra processors for common analysis tasks. We are therefore pleased to announce a new method by which users may, with relative ease, exploit the emerging multicore reality for SLang-based modeling and analysis in ISIS. This capability comes from the -openmp switch in version 1.9.2 of the SLIRP code generator, now available at http://space.mit.edu/cxc/slirp For example, consider unix% slirp -openmp par.h where par.h contains double cos(double x); double sin(double x); double log(double x); double atof(const char *s); The generated module will be vectorized for parallelization by OpenMP-aware compilers such as Intel 9.1, Sun Studio 9 or later, and prerelease versions of GCC 4.2 or 4.3. Here's a sample run of ISIS utilizing 4 of 8 750Mhz CPUs on a Solaris 5.9 machine: tdc_at_email.domain.hidden isis> x = [-PI*100000: PI*100000: .05] isis> tic; c = cos(x); toc 6.43994 isis> import("par") This import replaces the built-in cos(), sin(), etc. functions with parallelized versions. Keeping the same names lets the body of analysis scripts remain the same, regardless of whether they execute in a sequential or parallel context. Repeating the test from above isis> tic; pc = cos(x); toc 1.62847 reveals that on these CPUs our parallelized cos() is nearly 4X faster for the given array size, while yielding the same numerical result: isis> length( where(c != pc) ) 0 The benefits extend to more involved expressions, too, such as isis> tic; () = sin(x) + cos(x); toc 14.0862 isis> import("par") isis> tic; () = sin(x) + cos(x); toc 4.13241 Similar constructs are relatively common in analysis scripts, such as ISIS models implemented in pure S-Lang, where the core of the model may be computed scores, hundreds, or even thousands of times (e.g. during confidence analysis). Speedups like those shown above would accumulate into significant differences over the life of such a long-running computation. A more detailed and topical example, taken directly from experience at our institute, is appended below. What makes this new capability exciting is not that it guarantees some amazing factor of N speedup on N cpus, because it doesn't. Speedup is highly dependent upon the structure and size of the problem, as well as the speed of the CPUs utilized; optimal (i.e. linear) speedups are not the norm. Rather, the importance is in how little the end-user needs to do, in terms of learning about threading or other forms of parallel programming, or rewriting algorithms and scripts, in order to gain at least *some* speedup. Plots of speedup as a function of array size for several arithmetic operations, on the 2 machines used in the examples here, are given in http://space.mit.edu/cxc/slirp/multicore.pdf They suggest relatively small inflection points -- where arrays are big enough to gain at least some speedup from threading to multiple CPUs -- and that faster processors tend to require larger arrays. The presentation also discusses several limitations in the approach. Although you can use the parallelization features of SLIRP right now, we're in the process of developing a module much like that outlined above. The aim is to make it possible for users to parallelize their analysis for multicore use simply by adding something like require("parallel") to the beginning of their scripts. Please contact me if you think you might benefit from this work, or have any thoughts, criticisms, or even offers of help! Regards, Michael S. Noble ---------------------------------------------------------------------- Vector Parallelization Example #2 Consider a volume of 320*320*320 real-valued voxels representing doppler velocity mappings of silicon II infrared emission observed with the Spitzer IRS. The data was stored in ASCII format and was given to me by a colleague so we could view it in the volview() 3D visualizer (space.mit.edu/cxc/software/slang/modules/volview). Since I/O on 130 Mb ASCII datasets can be cumbersome and slow, and because we would undoubtedly be repeating the visualization task a number of times, I first converted the volume to the high-performance HDF5 binary format. This involved some 320^3 calls to the atof() function, which converts string data to double. Now, unlike the trig functions used above, the atof() function in S-Lang is not vectorized; the only way it can be used to convert an array of strings is by looping over each element, and normally the best way to do that in S-Lang is with array_map(). For example, with a faked & much smaller 3D volume (100^3 voxels) linux% isis isis> avol = array_map(String_Type, &sprintf, "%d", [1:100*100*100]) the time to convert it to Double_Type using array_map() is isis> tic; dvol = array_map(Double_Type, &atof, avol); toc 13.7544 This was executed on my dual 1.8Ghz Athlon desktop, running Debian 3.1 with 2GB RAM. Importing our vector-parallel module as above and repeating the test isis> tic; pdvol = atof(avol); toc 0.144219 shows an astounding speedup of 95X, while again isis> length(where(dvol != pdvol)) 0 yielding the same result. The reason for the vastly superlinear speedup is that, in addition to utilizing both CPUs on my desktop, the SLIRP version of atof() is vectorized to operate directly upon arrays, at the speed of compiled C. All OpenMP-enabled wrappers generated by SLIRP are vectorized in this manner. Even without multiple CPUs the vectorized atof() is considerably faster than the non-vectorized version in S-Lang. Finally, suppose you wanted to log scale the SiII doppler velocities: isis> si2 = h5_read("si2vel.h5") isis> tic; () = log(si2); toc 3.82157 isis> import("par") isis> tic; () = log(si2); toc 2.09266 ---- You received this message because you are subscribed to the isis-users list. To unsubscribe, send a message to isis-users-request_at_email.domain.hiddenwith the first line of the message as: unsubscribeReceived on Mon Mar 26 2007 - 13:50:48 EDT
This archive was generated by hypermail 2.2.0 : Mon Mar 26 2007 - 14:12:34 EDT