SLIRP Reference Manual, Version 1.9.8: Vectorization

7. Vectorization

One of the most powerful features of S-Lang is the transparent manner in which scalars and arrays may be used in the same context. For example,


                        y = sin(x)

produces a scalar result when x is a scalar, or a vector when x is a 1D array, and so on. This feature is referred to as vectorization, since it obviates the need to explicitly loop over array indices in S-Lang scope; instead, the loop is implied and executes in compiled C scope, yielding significantly greater performance. Vectorization encompasses more than the simple promotion of scalar arguments to arrays, however. More generally, we say a function is vectorized when any of its arguments may be of multiple ranks; no distinction is made between the promotion of a scalar (rank 0) argument to 1D or higher, a 2D array to 3D, and so forth. When a vectorized function is invoked with an argument whose rank exceeds that of its default usage (e.g. as specified in a C function prototype), we say that both the argument and the invocation itself are vectored.

Vectorization naturally complements the strong array handling facilities within S-Lang. It provides powerful multidimensional mathematical capability, nearly analytic in its expressiveness, and with performance on par with -- or exceeding -- that of compiled C code and commercial analysis packages; together these make S-Lang a very strong platform for numerical analysis in science and engineering. The fact that vectorization is native to S-Lang distinguishes it from other popular scripting languages like Perl, Python, or Tcl, which natively lack high-performance multidimensional numerical capability. Array handling within S-Lang is typically much faster than its counterparts in these languages, as much as an order of magnitude in some cases, as well as more concise because it's built into the very fabric of the language.

SLIRP extends this capability, with commensurate performance and economy of codesize, to a wide range of external C, C++, and Fortran codes; this enables wrappers to be called with either vectored or non-vectored semantics, and with few restrictions on the quantity, datatype, or rank of the arguments to functions that may be automatically vectorized. The -openmp option may be used to tune vectorized code so that OpenMP-aware compilers will automatically parallelize the wrappers, boosting performance even further in the presence of multiple cpus (or cores).

As a concrete example of the advantage of vectorization, here are performance ratios of the S-Lang strlen intrinsic versus a vectorized wrapper for strlen generated by SLIRP :

The vectorized strlen is roughly 1 to 2 orders of magnitude faster than the optimal native S-Lang usage (employing array_map to iterate over the string array); other S-Lang looping constructs -- such as for, while, or even loop -- yield an even greater performance advantage for the vectorized strlen. The quality of SLIRP vectorization is evident in the following plots

which demonstrate that the generated wrappers for cos and sin have vectorized performance equivalent to the hand-crafted cos and sin intrinsics native to S-Lang (see the examples/vec directory for examples and more performance details).

7.1 Making and Using a Vectorized Function

Vectorization is a transformation of the entire function semantics, in the sense that every argument received by the wrapper is tagged to optionally accept a vectored value. A vectorized function may be called with both vectored and non-vectored arguments at runtime, but at code generation time SLIRP will either vectorize every argument accepted by the wrapper or none at all. When SLIRP cannot vectorize one or more arguments of a function (see below) it will instead generate a standard, non-vectorized wrapper. There are several ways to request that a function be vectorized:

use the -vec option, which causes SLIRP to attempt to vectorize every function in the interface being wrapped
add the function name to a #vectorize block
prototype the function within a #vectorize block
use the -openmp option, in which case vectorized wrappers may be parallelized with OpenMP

For example, given two functions with C prototypes of


        void     vmult          (double *x, double *y, double *result, int len);
        size_t   strlen         (const char *s);

an interface file containing


        #vectorize
           strlen
           void vmult(double *x, double *y, double *OUT, int DIM1);
        #end

would yield a module with vectorized wrappers for the strlen and vmult functions (these and other examples are included within the examples/vec subdirectory of SLIRP). While the types in the vectorized prototype for vmult match, of course, that of the underlying function being wrapped, the parameter names chosen change how the function will be called from S-Lang scope. First, as discussed in the previous chapter, the double *OUT parameter is an output #argmap, which transforms the result argument into an output value of the wrapper. Second, DIM1 denotes that the final argument describes the length of each array and that it, too, may be omitted when calling the wrapper. S-Lang calls to vmult therefore operate as if it accepts two arrays and returns 1, all of Double_Type. If we were to load the module into slsh, for example, and invoke vmult with no arguments


        slsh> import("vec")
        slsh> vmult
        Usage:  double[] = vmult(double[],double[])
                This function has been vectorized.
        slsh>

(these examples assume a S-Lang 2 shell, with semicolon appending turned on), the usage statement reflects this, as well as the fact that the function has been vectorized. The DIM1 annotation also plays another important role, in that it signifies to SLIRP that before the wrapper calls vmult proper it must ensure that the array arguments are all of equal length:


        slsh> vmult([1,2,3], [3,4])
        Array shape or length mismatch

        slsh> vmult([1,2,3], 4)
        Scalar cannot be used here

Without such a precaution the underlying vmult function would likely yield undefined results or memory corruption and an application crash. Two examples of calling vmult correctly, then, are


        slsh> print( vmult([1,2,3], [5,5,5]) )
        5
        10
        15

        slsh>  Arr = Int_Type[2,3]
        slsh>  Arr[0,*] = 5
        slsh>  Arr[1,*] = 100
        slsh>  print( vmult(Arr, [3, 4, 5]) )
        15 20 25
        300 400 500

Observe that the first call is not vectored, because vmult is prototyped to expect an array, assumed to be of dimension 1, and that's what was passed in; vmult is called only once by the wrapper. The second call is vectored, because one of its arguments is a 2D array; now the wrapper calls vmult twice, once as


                vmult([5, 5, 5], [3, 4, 5], retval, 3)

and again as


                vmult([100, 100, 100], [3, 4, 5], retval+3, 3)

(where retval is a double* dynamically allocated to store 6 values) and yields the 2D array


                        | 15     20     25  |
                        | 300    400    500 |

To convince yourself that the latter is actually true, consider:


        slsh>  vmult(Arr, [3, 4, 5])
        Double_Type[2, 3]

Note that the vectorization behavior in the second case resembles that of the array_map() intrinsic when passed arguments of different lengths: only the first element of shorter arrays will be (re)used in each iteration. Here the vector [3,4,5] is treated as the first element of a vacuously 2D array of dimension 1x3. This flexibility can yield shorter and cleaner code, e.g. by avoiding the creation of temporary container arrays simply to satisfy dimensionality constraints. Finally, it's worth noting that even though Arr and [3,4,5] are arrays of Integer_Type, S-Lang transparently casts them to Double_Type when the arguments are transferred to the wrapper.

7.2 Dimensionality Theory

At runtime SLIRP uses a few simple metrics to decide

whether a wrapper has been called with vectored semantics
and, if so, how many times to call the wrapped function
and, with what inputs

Collectively these are referred to as the parameters of vectorization, or simply the vectorization.

Expected Rank

To begin, each argument passed to a wrapper has an expected rank: a non-negative integer representing its dimensionality, or the number of indices required to uniquely identify a single element within the argument. SLIRP infers the expected rank of each function argument directly from its prototype, at code generation time. For example, the primary arguments of vmult and strlen from the preceding section are of ranks 1 and 0: when invoked as prototyped the primary inputs to vmult are 1D arrays, while strlen expects a scalar (one C-style string), which for generality is considered an array of dimension zero. An N-dimensional C array such as


                double matrix[3][3][5];

has a rank of N (3 in this case), with the proviso that arguments such as


                int something(double x[3][5])
                int something_else(double **x)

are not commensurate. The first function will be vectorized with rank 2; the second, however, will be vectorized with rank 1, because double** describes a 1D array of pointers, represented as opaque types in S-Lang scope. A pointer argument may be tagged as multidimensional, though, through the use of multiple DIM annotations. For example, a C prototype of


           void   sub1_2d(int *matrix, int numrows, int numcols);

vectorized with


        #vectorize
           void   sub1_2d(int *matrix, int DIM1, int DIM2);
        #end

yields a wrapper which accepts 1 argument, an integer array of rank 2. The value of the second and third arguments to the underlying C function will be calculated automatically from the input S-Lang array.

Number of Iterations

The actual rank of an argument is its dimensionality as passed at runtime. When the actual rank of any argument exceeds its expected rank, SLIRP needs to determine how many times the wrapped function should be called, or the number of iterations of the vectorization. In the preceding vmult example this value is 2; extended to 3D


        slsh>  Arr3D = Double_Type[2,2,3]
        slsh>  Arr3D[0, *, *] = Arr
        slsh>  Arr3D[1, *, *] = 2 * Arr
        slsh>  result = vmult(Arr3D, [7,8,9] )

vmult proper will be called 4 times, and the wrapper will yield the following 2x2x3 array


                        |  | 35   40   45   |  |
                        |  | 700  800  900  |  |
                        |                      |
                        |  | 70   80   90   |  |
                        |  | 1400 1600 1800 |  |

SLIRP determines the number of iterations by selecting a master array M, simply the input argument of highest rank, and then computing the product of its excess dimensions. Formally, if A and E respectively represent the actual and expected ranks of M, and D is a vector of length A describing the size of each dimension of M (in row-major form), then

with the expected proviso that

is by definition when K < J.

Stride

Finally, SLIRP determines what to pass to each wrapped function invocation by calculating a stride for each input argument; this indicates by how much the index into the given argument --- viewed as a linear sequence of contiguous elements --- should be advanced after each call. SLIRP aims for flexibility by allowing arrays of different shapes to be used within a single vectored call. For instance, in the 3D vmult call above the strides of the first and second arguments are 3 and 0, respectively; within the wrapper Arr3D and [7, 8, 9] are effectively represented as


        double  *arg1 = {5, 5, 5, 100, 100, 100, 10, 10, 10, 200, 200, 200};
        double  *arg2 = {7, 8, 9};

the return value is allocated as


        double  *retval = malloc( sizeof(double) * 12);

and the 4 calls to vmult proper are executed as


        vmult(arg1, arg2, retval, 3);
        vmult(arg1+3, arg2+0, retval+3, 3);
        vmult(arg1+6, arg2+0, retval+6, 3);
        vmult(arg1+9, arg2+0, retval+9, 3);

The stride of the master array M, and all isomorphic arguments, is thus the number of elements in M not covered by its excess dimensions; that is, the product of its expected dimensions (rank)

To see why, recall that the product of all dimensions of M, i.e. the total number of elements, is

and must be equal to the product of the excess and expected dimensions

Because the first term here is the number of iterations, the stride may be expressed concisely as

An error will be signaled if arguments not isomorphic to M have number of elements not equal to the stride of M; such arguments will otherwise be assigned a stride of 0.

7.3 When Functions Will Not be Vectorized

In the majority of cases SLIRP attempts to faithfully vectorize every function requested. However, it will refuse to vectorize functions

in which an annotation has already been applied to one or more arguments, because vectorization is itself an annotation.
with an empty argument list, because an argument would have to be introduced in order to specify the number of iterations of the vectorization.
with more than 10 input arguments.
listed within a #novectorize block.
coded in Fortran and accepting an array argument of dimensionality 2 or greater, because of the constraints of transposition.

The constraint on the number of vectorized input arguments may be changed by re-configuring the distribution with --with-nvec=NEW_VALUE, although you should consider that functions with too many input arguments may indicate questionable design.

7.4 Parallelization

As noted earlier, the -openmp option will tune the emission of vectorized wrappers for parallelization with a suitable OpenMP-aware compiler, making it very easy to gain potentially significant performance increases on multiprocessor (or multicore) machines. Consider the plots below, which indicate the performance of SLIRP OpenMP-aware wrappers for a selection of commonly used ANSI C math functions, over a range of real-valued arrays. The performance numbers for the first plot were generated with 2 1.8 GHz AMD Athlon MP 2200 processors, on Debian Gnu/Linux 3.1 (Sarge) using the 20070221 pre-release version of GCC 4.2; the second set of numbers was generated with 4 750Mhz sparcv9 processors, on Solaris 5.9 with Sun Studio 9. The module and scripts used to generate and plot these performance data, as well as the data themselves, are located in the examples/openmp subdirectory of the SLIRP distribution. The plots were generated within the ISIS modeling and analysis tool ( http://space.mit.edu/cxc/isis/).

Limitations

At present the -openmp option cannot be used to wrap Fortran functions. It also reduces the flexibility of striding, in the sense that each array passed to a parallelized wrapper must have a stride of 1.

Next Previous Contents