Scientific Computing C++ versus Fortran

Document Sample
Scientific Computing C++ versus Fortran Powered By Docstoc
					Scientific Computing: C++ versus Fortran
      Todd Veldhuizen (tveldhui@monet.uwaterloo.ca)
                                          July 1997




Author biography

Todd is a grad student in Computer Science at Indiana University, and the author of the Blitz++
class library. He can be reached at tveldhui@extreme.indiana.edu.


1: Introduction
The language C++ has features which make it attractive for scientific computing: templates for
generic programming, operator overloading for expressiveness, and object-orientation for
abstraction and code reuse. Despite these features, the scientific computing community has been
reluctant to adopt C++, partly due to performance problems. In the early 1990s, C++ programs
were much slower than their Fortran counterparts -- typical benchmarks showed C++ lagging
behind Fortran's performance by anywhere from 20% to a factor of ten. Users of C++ often
rewrote the number crunching parts of their programs in Fortran to get acceptable performance.

In the past five years, the performance of C++ programs has improved markedly due to a
combination of better optimizing C++ compilers and new library techniques. In some recent
benchmarks, C++ is even faster than Fortran. In this article, I'll explain these techniques and
show benchmarks comparing a C++ class library (Blitz++) to Fortran.


2: Why C++ programs are much faster than they
used to be
Expression templates

Operations on vectors, matrices, and arrays form the foundation of most scientific computing
codes. In C++, classes can customize the meaning of operators such as (+, -, *, /). This
capability, called operator overloading, enables C++ matrix/vector class libraries to provide a very
natural notation. For example, this code might represent the summation of three vectors:

w = x + y + z;


Operator overloading in C++ is pairwise: to evaluate this expression, C++ forces you to first
evaluate t1=x+y, then t2=t1+z, and finally assign w=t2. The t1 and t2 objects are temporaries
used to store intermediate results.
Pairwise expression evaluation for vectors is slow: instead of a single loop to evaluate an
expression, multiple loops and temporary vectors are generated. These temporary vectors are
usually allocated dynamically, which causes a severe performance loss for small vectors. For
example, suppose we evaluated w=x+y+z for vectors of length 3. The floating point arithmetic
takes only a handful of clock cycles; dynamically allocating and deleting the temporary vectors
consumes hundreds (or even thousands) of cycles. Consequently, C++ matrix/vector libraries
tended to be horribly slow for operations on small objects.

For large vectors, the main performance issue becomes memory bandwidth. Many scientific
computing codes are constrained in performance not by floating point arithmetic, but by the time
required to ship data between main memory and the CPU. Suppose that each vector contained 1
Mb of data. The operator-overloaded version of w=x+y+z requires 8 Mb of data to be transferred
between the CPU and main memory, exactly twice what is actually needed. As a consequence, the
speed is roughly half what could be attained.

The pairwise evaluation problem has been solved by the expression templates technique. In an
expression templates implementation, operators like + don't return intermediate results. From the
compiler's point of view, the expression is still evaluated pairwise, but operators like + return
partial parse trees of the expression (rather than intermediate results). The parse tree is encoded
as a C++ class type using templates. When the parse tree is assigned to an object, the expression
can be evaluated in one pass without having to generate intermediate results.

The technique has some overhead associated with the expression parsing, but the performance
for long vectors is just as good as hand-coded C. A good optimizing C++ compiler is able to
eliminate this overhead completely, so that performance is reasonable for small vectors too. (To
get optimal performance, you have to unroll the loops completely for small vectors, as I'll describe
next.)

Optimizations for small objects
In a previous article (DDJ August '96), I described special optimizations for small vectors and
matrices using the template metaprogram technique. Briefly summarized, it turns out that C++
compilers can be persuaded to do arbitrary computations at compile time by metaprograms which
exploit the template instantiation mechanism. One good use of template metaprograms is creating
specialized algorithms. As an example, let's consider this little bit of code, which calculates a 3x3
matrix-vector product:

Matrix A(3,3);
Vector b(3), c(3);

// ...

c = product(A,b);


Calculating the result requires 15 floating-point operations, but this bit of code will consume
hundreds of clock cycles. Why? The Matrix and Vector objects must allocate their memory
dynamically, which takes a lot of time. The product() function resides in a library somewhere,
and likely consists of two nested loops to compute the matrix-vector product. When compilers
optimize loop code, they often assume that the loop will be executed many times. The result is
code which is very fast for large matrices, but mediocre for small matrices.

The solution I've adopted in the class library Blitz++ is to provide two versions of objects like
Matrix and Vector. One version is optimized for large objects (e.g. Vector), and the other is
optimized for small objects (e.g. TinyVector). The Tiny... objects use template metaprograms
to create very fast code. Here's the matrix-vector product implemented with Tiny... objects:

TinyMatrix A;
TinyVector b, c;

c = product(A,b);


Here's a skeletal declaration of TinyVector:

template
class TinyVector {

private:
    float data[N];
};

Putting the vector data inside the object allows it to reside on the stack, so the overhead of
allocating memory is avoided. The function product() invokes a template metaprogram which
creates a specialized matrix-vector multiplication algorithm. The metaprogram produces this code:

c[0] = A[0][0]*b[0] + A[0][1]*b[1] + A[0][2]*b[2];
c[1] = A[1][0]*b[0] + A[1][1]*b[1] + A[1][2]*b[2];
c[2] = A[2][0]*b[0] + A[2][1]*b[1] + A[2][2]*b[2];


Big blocks of uninterrupted floating point code like this are exactly what's needed to get peak
performance on modern pipelined, superscalar CPUs. The metaprogram version of the 3x3 matrix-
vector product takes 13 clock cycles in sustained execution on an RS/6000 Model 43P. (It's this
fast because the matrix A is kept in registers and combined multiply-add instructions are used.)


Better compilers
A large part of C++'s performance problem was inadequate compilers. Happily, C++ compilers
have come a long way in the past decade. To demonstrate this, let's look at KAI C++ from Kuck
and Associates Inc., which has incredible optimization abilities. As an example, we'll examine a
little piece of code which reflects a ray of light off a surface. We'll call the incident ray vector x,
the reflected ray y, and the surface normal vector n (Figure 1).


                Figure 1 is shown here.




                                       Figure 1: Ray reflection
We'll work in a three dimensional space, so it makes sense to use the class TinyVector, which
does special optimizations for small vectors. Here's the reflection function:

typedef TinyVector ray;

inline void reflect(ray& y, const ray& x,
  const ray& n)
{
  y = x - 2 * dot(x,n) * n;
}


The reflect() code looks very similar to the mathematical equation which it implements: y = x -
2 (x . n) n. Behind this simplicity lies an elaborate implementation: first, a template metaprogram
unrolls the dot product dot(x,n) into the code x[0]*n[0]+x[1]*n[1]+x[2]*n[2]. The vector
arithmetic is then parsed using expression templates, which produces temporary objects
corresponding to partial parse trees. Evaluation of the expression for y[0], y[1] and y[2] is
unrolled using another template metaprogram. (KAI C++ is smart enough to unroll the loops itself
without metaprograms, but other C++ compilers aren't). KAI C++ is able to optimize away all the
expression templates overhead using copy propagation and dead code elimination, and produce
code which is equivalent to this:

double _t1 = x[0]*n[0] + x[1]*n[1]
    + x[2]*n[2];
double _t2 = _t1 + _t1;
y[0] = x[0] - _t2 * n[0];
y[1] = x[1] - _t2 * n[1];
y[2] = x[2] - _t2 * n[2];


Examining this code, we see 6 multiplies, 3 additions, 3 subtractions, 6 loads and 3 stores for a
total of 21 operations. On an RS/6000, the assembly code generated by KAI C++ and the back
end compiler consists of only 20 opcodes. The data are kept in registers, and operated on by
combined multiply-add instructions. This is very impressive considering how complex the
expression templates machinery is. But it gets better! Suppose we want to calculate the reflection
of a particular ray. Overloading the comma operator gives us a concise way to assign vector
literals:

void test(ray& y)
{
  ray x, n;

    x = 1.00,   0.40,   -1.00;
    n = 0.31,   0.20,    0.93;

    reflect(y, x, n);
}


When this code is compiled, KAI C++ discards the temporary objects created by the overloaded
comma operator, does the optimizations mentioned before for reflect(), and performs constant
folding to evaluate arithmetic on constant values. The final code is equivalent to:

void test(ray& y)
{
  y[0] = 1.3348;
    y[1] = 0.6160;
    y[2] = 0.0044;
}


The compiler calculates the reflection at compile time, and ditches the x and n vectors, since
they're not needed anymore. Now that's an optimizing compiler!

So C++ compilers have come a long way in the past decade. When compilers like KAI C++ are
combined with library techniques such as expression templates and template metaprograms, the
resulting performance is very impressive, as the benchmarks will illustrate.


3: Benchmark results
All benchmarks were run on a 100 MHz IBM RS/6000 Model 43P (PowerPC) with 16 Kb of L1 cache
and 256 Kb of L2 cache. The C++ programs were compiled using KAI C++ at +K3 -O2, and
Fortran 77 code was compiled with XL Fortran at -O3.

Linear algebra

The first benchmark measures the performance of a DAXPY operation, short for "Double precision
A times X Plus Y". This routine is the workhorse of many linear algebra operations, such as
Gaussian elimination. Here's a DAXPY operation using the Blitz++ class library:

Vector x(N), y(N);
double a;
   .
   .
y += a * x;


Figure 2 shows benchmark results for four versions: the Blitz++ classes TinyVector and Vector,
Fortran 77, and the class valarray (Modena implementation) from the draft ISO/ANSI C++
standard. The class TinyVector is optimized for very small vectors, so its performance is much
better than the other implementations for vectors of length 1..10. The Vector class and the
Fortran 77 implementation have some loop overhead which makes their performance poorer for
small vectors. As you can see, for longer vectors their performance is very similar. The sudden
performance drop around N=1000 occurs because the vectors are no longer small enough to fit in
the L1 cache.

The performance of valarray is typical of older C++ class libraries which use pairwise evaluation
and dynamically allocated temporaries. Notice that for very small vectors, memory allocation
overhead causes the performance to be awful -- about a tenth that of Vector and Fortran 77. Its
peak performance is also terrible, since pairwise evaluation forces it to move twice as much
memory around as necessary. The extra memory use causes valarray to experience the L1 cache
crunch much sooner than Vector or Fortran 77.

Fortran 90, the successor to Fortran 77, was also benchmarked but on this platform the compiler
is not very good -- the performance was much worse than Fortran 77.
                                   Figure 2: DAXPY Benchmark

Array stenciling
Array stencils are common in signal processing and the solution of partial differential equations. A
stencil updates an array element using a function of the elements in a local neighbourhood. Here's
a Fortran 77 stencil operation which smoothes a 3D array:

DO k=2, N-1
 DO j=2, N-1
  DO i=2, N-1
   A(i,j,k) = (B(i,j,k)    + B(i+1,j,k) + B(i-1,j,k)
   .          + B(i,j+1,k) + B(i,j-1,k) + B(i,j,k+1)
   .         + B(i,j,k-1)) / 7.0;
  END DO
 END DO
END DO

The performance of stencil operations is heavily constrained by memory accesses, so efficient
cache use is critical. Most computers have (at least) two levels of cache: the L1 cache, which is
part of the CPU, and the external L2 cache. Of these, the L1 cache is faster, so for good
performance the L1 cache must be used as much as possible. In the straightforward Fortran
implementation above, the L1 cache hit rate is typically about 54% for a large array (this means
that 54% of the data is found in the L1 cache; the remaining data has to be pulled in from the L2
cache or main memory). To get a higher L1 cache hit rate, it's necessary to traverse the array in a
different order. One approach is dividing the array into smaller subarrays, and applying the stencil
to each subarray in turn. This is commonly called tiling or blocking. Converting the above stencil
to a blocked version would require five (or six) nested loops. In some cases, Fortran compilers can
transform an algorithm into a blocked version automatically, but often not for complicated
stencils.

One problem with blocking is selecting the block size: it has to be chosen carefully with the
particular architecture and cache sizes in mind to get optimal performance. In the Blitz++ library,
I decided to adopt a slightly different approach, based on the Hilbert space filling curve (Figure 3).
In this approach, there are two loops: an outer loop which follows an approximation to the Hilbert
space filling curve (1), and an inner loop (2) which traverses the corresponding column of the
array. The space filling curve ensures good cache use, and the inner loop allows the CPU pipelines
to pick up steam as the column is traversed.




                               Figure 3: Hilbert curve array traversal

I've found this traversal ordering has a very nice property: performance increases steadily for
every extra bit of cache you have, resulting in good performance on all platforms. For most
platforms, the L1 cache hit rate is 70% to 76% (compared to only 54% for the Fortran 77
program described above). This isn't the case for blocked algorithms, where performance makes
abrupt jumps at critical cache sizes, making tuning very platform-dependent.

Implementing the Hilbert curve traversal is complicated: it requires about a hundred lines of code.
In Blitz++, this complexity is completely hidden inside the library. The user level code looks like
this:

const int N = 64;
Array A(N,N,N), B(N,N,N);
 .
 .
Range I(1,N-2), J(1,N-2), K(1,N-2);
A(I,J,K) = (B(I,J,K) + B(I+1,J,K) + B(I-1,J,K)
   + B(I,J-1,K) + B(I,J+1,K) + B(I,J,K-1)
   + B(I,J,K+1)) / 7.0;


Using the expression templates technique, this code is transformed into a Hilbert curve traversal.
To implement a similar traversal in Fortran, you'd have to write a hundred lines of code, or rewrite
the compiler.

Figure 4 shows a benchmark comparing the Blitz++ version of the 3D array stencil with the
Fortran 77 version for a variety of array sizes. For very small arrays, Fortran 77 has the upper
hand, because its code is much simpler. For larger arrays, the extra complexity of the Hilbert
curve traversal pays off: the higher L1 cache hit rate translates into better performance.




                              Figure 4: 3D Array Stencil Benchmark

Lattice Quantum Chromodynamics
The last benchmark I'd like to describe is drawn from Lattice Quantum Chromodynamics (QCD).
Lattice QCD is a computational version of the Standard Model of particle physics. It's used to
calculate particle properties (such as mass) predicted by current theory. Lattice QCD is very
demanding: these programs engage supercomputers for months or years at a time.

This benchmark is based on a Lattice QCD implementation from the Edinburgh Parallel Computing
Centre in the UK. In the EPCC implementation, the bulk of CPU time is spent multiplying 3x3 and
3x2 matrices of complex numbers. This is an ideal application for the Blitz++ TinyMatrix class,
which uses template metaprograms to create specialized algorithms for matrix-matrix
multiplication. Here's an initial version of the Blitz++ benchmark:

typedef TinyMatrix<complex,3,3> gaugeElement;
typedef TinyMatrix<complex,3,2> spinor;

void qcdCalc(Vector& res, Vector& M,
  Vector& src)
{
    for (int i=0; i < res.length(); ++i)
      res[i] = product(M[i], src[i]);
}


The arguments res and src are vectors of 3x2 complex matrices; M is a vector of 3x3 complex
matrices. Here's the initial Fortran 77 version:

subroutine qcdf(M, res, src, V)
       integer V, iters
       double complex M(V,3,3), res(V,3,2), src(V,3,2)

         DO site=1,V
           DO spin=1,2
             DO col=1,3
                res(site,col,spin) = M(site,col,1) * src(site,1,spin)
       .           + M(site,col,2) * src(site,2,spin)
       .           + M(site,col,3) * src(site,3,spin)
             ENDDO
           ENDDO
         ENDDO

         return
         end


Figure 5 shows the benchmark results for these two codes. The performance of the initial Fortran
77 version was disappointing due to poor memory layout of the data, and the failure of the
compiler to unroll the inner loops. The performance doubled after optimizing the memory layout
and manually unrolling the inner loops (see the series "Fortran 77 tuned" in Figure 5), but still
wasn't as good as Blitz++. Examination of the assembly code revealed that the Fortran compiler
was suffering from "register spillage" -- it was unable to fit all the data in registers, so their
contents had to be spilled out onto the stack. This is a compiler problem; with a better code
generator, the Fortran version ought to be as fast as the initial Blitz++ version.

To tune the Blitz++ version, I took a step that is second nature to C++ programmers. The related
objects were encapsulated in a single class:

class latticeUnit {
...
protected:
   spinor one;
  gaugeElement ge;
  spinor two;
};


The revised qcdCalc() routine operated on a single vector of latticeUnit objects, rather than
three separate vectors. This encapsulation ensured that related objects were stored nearby in
memory, a property known as locality of reference. Cache designs assume programs will exhibit
locality of reference, so encapsulation tends to improve performance. In the QCD benchmark,
encapsulation boosted performance by 15-20% (see the "Blitz++ tuned" series in Figure 5).
                                Figure 5: Lattice QCD Benchmark


4: Conclusion
After nearly a decade of being dismissed as too slow for scientific computing, C++ has caught up
with Fortran and is giving it stiff competition. The combination of new library design techniques
(expression templates, template metaprograms) and better compilers means that programmers
can enjoy the expressiveness and ease of C++ without paying a performance price.


5: For more information
For references and more information about the techniques described in this article, see the
Blitz++ web page at http://monet.uwaterloo.ca/blitz/