getting more from your multicore

From: Michael Noble <mnoble_at_email.domain.hidden> Date: Mon, 26 Mar 2007 13:50:41 -0400 · This archive was generated by hypermail 2.2.0 : Mon Mar 26 2007 - 14:12:34 EDT

If you have a multi-core desktop machine, access to a server with
multiple CPUs or perhaps even a supercomputer, this note may be of
interest to you in your Chandra analysis.  If not, feel free to hit
DELETE now.

In a previous note and paper

	http://asc.harvard.edu/chandra-users/0428.html
	http://arxiv.org/abs/astro-ph/0510688

we described how PVM has been used on a network of workstations to
parallelize X-ray modeling and analysis in ISIS.  While successful,
to date that approach requires a fair amount of coding on the part
of the end user; it also leaves unexplored the facts that more and
more desktop machines today contain multiple CPUs or cores & that
many data centers operate multi-CPU servers or Beowulf clusters,
while at the same time very few widely used astronomy software
packages exploit these extra processors for common analysis tasks.

We are therefore pleased to announce a new method by which users
may, with relative ease, exploit the emerging multicore reality
for SLang-based modeling and analysis in ISIS.  This capability
comes from the -openmp switch in version 1.9.2 of the SLIRP code
generator, now available at

	http://space.mit.edu/cxc/slirp

For example, consider

	unix%  slirp -openmp par.h

where par.h contains

	double cos(double x);
	double sin(double x);
	double log(double x);
	double atof(const char *s);

The generated module will be vectorized for parallelization by
OpenMP-aware compilers such as Intel 9.1, Sun Studio 9 or later,
and prerelease versions of GCC 4.2 or 4.3.  Here's a sample run
of ISIS utilizing 4 of 8 750Mhz CPUs on a Solaris 5.9 machine:

   tdc_at_email.domain.hidden
   isis>   x = [-PI*100000: PI*100000: .05]

   isis>   tic; c = cos(x); toc
   6.43994

   isis>   import("par")

This import replaces the built-in cos(), sin(), etc. functions
with parallelized versions.  Keeping the same names lets the body
of analysis scripts remain the same, regardless of whether they
execute in a sequential or parallel context.  Repeating the test
from above

   isis>   tic; pc = cos(x); toc
   1.62847

reveals that on these CPUs our parallelized cos() is nearly 4X faster
for the given array size, while yielding the same numerical result:

   isis>   length( where(c != pc) )
   0

The benefits extend to more involved expressions, too, such as

   isis> tic; () = sin(x) + cos(x); toc
   14.0862

   isis> import("par")

   isis> tic; () = sin(x) + cos(x); toc
   4.13241

Similar constructs are relatively common in analysis scripts, such
as ISIS models implemented in pure S-Lang, where the core of the
model may be computed scores, hundreds, or even thousands of times
(e.g. during confidence analysis).  Speedups like those shown above
would accumulate into significant differences over the life of such
a long-running computation.  A more detailed and topical example,
taken directly from experience at our institute, is appended below.

What makes this new capability exciting is not that it guarantees
some amazing factor of N speedup on N cpus, because it doesn't.
Speedup is highly dependent upon the structure and size of the
problem, as well as the speed of the CPUs utilized; optimal (i.e.
linear) speedups are not the norm.

Rather, the importance is in how little the end-user needs to do,
in terms of learning about threading or other forms of parallel
programming, or rewriting algorithms and scripts, in order to gain
at least *some* speedup.  Plots of speedup as a function of array
size for several arithmetic operations, on the 2 machines used in
the examples here, are given in

	http://space.mit.edu/cxc/slirp/multicore.pdf

They suggest relatively small inflection points -- where arrays are
big enough to gain at least some speedup from threading to multiple
CPUs -- and that faster processors tend to require larger arrays.
The presentation also discusses several limitations in the approach.

Although you can use the parallelization features of SLIRP right now,
we're in the process of developing a module much like that outlined
above.  The aim is to make it possible for users to parallelize their
analysis for multicore use simply by adding something like

	require("parallel")

to the beginning of their scripts.  Please contact me if you think
you might benefit from this work, or have any thoughts, criticisms,
or even offers of help!

Regards,
Michael S. Noble

----------------------------------------------------------------------

Vector Parallelization Example #2

Consider a volume of 320*320*320 real-valued voxels representing
doppler velocity mappings of silicon II infrared emission observed
with the Spitzer IRS.  The data was stored in ASCII format and was
given to me by a colleague so we could view it in the volview() 3D
visualizer (space.mit.edu/cxc/software/slang/modules/volview).

Since I/O on 130 Mb ASCII datasets can be cumbersome and slow, and
because we would undoubtedly be repeating the visualization task a
number of times, I first converted the volume to the high-performance
HDF5 binary format.  This involved some 320^3 calls to the atof()
function, which converts string data to double.

Now, unlike the trig functions used above, the atof() function in
S-Lang is not vectorized; the only way it can be used to convert an
array of strings is by looping over each element, and normally the
best way to do that in S-Lang is with array_map().  For example,
with a faked & much smaller 3D volume (100^3 voxels)

    linux%  isis
    isis> avol = array_map(String_Type, &sprintf, "%d", [1:100*100*100])

the time to convert it to Double_Type using array_map() is

    isis> tic; dvol = array_map(Double_Type, &atof, avol); toc
    13.7544

This was executed on my dual 1.8Ghz Athlon desktop, running Debian
3.1 with 2GB RAM.  Importing our vector-parallel module as above
and repeating the test

    isis> tic; pdvol = atof(avol); toc
    0.144219

shows an astounding speedup of 95X, while again

    isis> length(where(dvol != pdvol))
    0

yielding the same result.  The reason for the vastly superlinear speedup
is that, in addition to utilizing both CPUs on my desktop, the SLIRP
version of atof() is vectorized to operate directly upon arrays, at the
speed of compiled C.  All OpenMP-enabled wrappers generated by SLIRP are
vectorized in this manner.  Even without multiple CPUs the vectorized
atof() is considerably faster than the non-vectorized version in S-Lang.
Finally, suppose you wanted to log scale the SiII doppler velocities:

    isis>  si2 = h5_read("si2vel.h5")

    isis> tic; () = log(si2); toc
    3.82157

    isis> import("par")
    isis> tic; () = log(si2); toc
    2.09266
----
You received this message because you are
subscribed to the isis-users list.
To unsubscribe, send a message to
isis-users-request_at_email.domain.hiddenwith the first line of the message as:
unsubscribe