Proceedings Template - WORD by gigi12


									 Middleware for programming NVIDIA GPUs from Fortran 9X
           Nail A. Gumerov                             Ramani Duraiswami                             William D. Dorland
             Fantalgo, LLC                                   Fantalgo, LLC                           University of Maryland
         7496 Merrymaker Way                                                  Physics & IREAP
          Elkridge, MD 21075                              Elkridge, MD 21075                        College Park, MD 20742

                                                                      operations. The NVIDIA provided CUBLAS/CUFFT functions
1. INTRODUCTION                                                       are also encapsulated in a convenient overloaded syntax, which
There is a revolution taking place in computing with the              avoids bugs due to calling errors.
availability of commodity graphical processors capable of several
hundred gigaflop performance and with multicore CPUs. Using           3. RESULTS & CONCLUSIONS
these it should be possible to have relatively inexpensive teraflop        We accelerated two sample sci. comp. applications. In each
workstations. These machines however are not so easily                case, we had original Fortran 90 code available, and we translated
programmed, and scientific computing may not map on to them.          this original code to run on the GPU. The first application was
Over the past year, the major graphics vendors have realized there    from plasma turbulence using a simplified but relevant 2D model
is a market for these in HPC, and have started producing tools to     [3]. This is a pseudospectral code that makes use of the wrapped
allow programming them from high level languages, ATI’s Close-        CUFFT library. Computationally the most important part of this
To-Machine (CTM) and NVIDIA’s CUDA [1]. We are more                   code is that evaluating nonlinear evolution terms in a time
familiar with the latter architecture, for which both a beta          stepping loop. Fig. 2 shows the translation of this routine using
programming environment and publicly available graphics boards        middleware function calls. Fig. 3 shows the speedup on this code,
with impressive speedups over their previous generation, and          for N=16, …, 1024. A speedup of about 25 is achieved vis-à-vis
restrict ourselves to them.                                           the serial CPU code, executed on an Intel QX6700 CPU processor
                                                                      and the expected scaling N2logN is seen.
2. GPU PROGRAMMING MIDDLEWARE                                              The second application is from the fitting of radial basis
CUDA views the GPU as a set of multiprocessors, each with some        functions to scattered data, using an iterative algorithm [2]. This is
local cache memory (of various types), and all able to talk to        representative of many applications in iterative methods that
global device memory. Because of the architecture, applications       should see significant speedups. Here an incredible speedup of
that map to the shared multiprocessor parallel environment,           662 times over a serial CPU code is seen.
effectively use the available number of processors, and mostly use         We are continuing to extend this environment, and have
cache memory, can see significant speedups.                           recently used it to develop a version of the FMM on the GPU [4].
While the NVCC compiler provided with CUDA does compile               In the future, programming of multiple GPUs, and shared
host C code, it is mostly focused at producing software that runs     computation on distributed CPUs and GPUs will be developed.
on the GPU. While this is useful to develop small programs to run     [1] NVIDIA     CUDA Compute Unified Device                 Architecture
on the GPU, when GPUs will be used for high performance                    Programming Guide, V1.0, NVIDIA Corp. 06/2007.
computing they should be more properly viewed as compute
                                                                      [2] A. Faul, G. Goodsell, M.J. Powell, “A Krylov subspace algorithm for
coprocessors, to which data from a large program running on the            multiquadric interpolation in many dimensions,” IMA J. Numer.
CPU host/cluster is farmed out. Of course, since host-GPU                  Anal., 25, 1-24, 2005.
communication is relatively slow, back and forth data exchange        [3] B.N. Rogers, W. Dorland, and M. Kotschenreuther, “The Generation
should be avoided. Instead, our viewpoint of GPU programming               and Stability of Zonal Flows in Ion Temperature Gradient Mode
is to provide a high level language such as Fortran 9X with a set          Turbulence,” Phys. Review Letters 85, 5536 (2000).
of functions that give it the ability to manipulate data on the GPU   [4] N.A. Gumerov and R. Duraiswami, “Fast Multipole Methods on
via a middleware library, and augment the middleware functions             Graphics Processors,” submitted.
with a small number of problem specific functions written in CU.
      With the understanding that the performance of the GPU
substantially depends on the thread block size, number of
multiprocessors employed, and even position of the elements in
the arrays, we introduce the concept of device variables, which
have sizes and allocations on the GPU to provide high
performance operations. We implement device variables as
structures, which encapsulate information about the pointer, size,
and other parameters, e.g., the type, dimension, leading dims,
allocation status, etc. Fortran modules allow wrapping of function
calls. Overloaded functions suitable for the use with different
types, shapes, and optional parameters are developed. Several
device functions, callable via wrappers are also provided. These
are for initializing variables, copying them, and performing other     Fig. 1: GPU and CPU growth in speed over the last 6 years.
           Fig. 2: Example of code conversion using Fantalgo’s middleware: Left: original Fortran-90 code; Right: ported code.

           1.E+02                                                                           1.E+04
                                           y=ax2                                                     FGP'05 Algorithm
                                                                                                     (Iterative Solution of Dense RBF System)
                                                              y=bx2                         1.E+03                                               y=ax2
Time (s)

                                                                                 Time (s)

           1.E+00                                                                           1.E+01

           1.E-01                                                                                                                          a/b=662
                                  2D Plasma Turbulence Simulations
                                  (100 Time Steps of Pseudospectral
           1.E-02                 Method)                                                   1.E-02
               1.E+01             1.E+02             1.E+03           1.E+04                    1.E+02           1.E+03           1.E+04                1.E+05
                                 N (equivalent grid (NxN))                                                        Number of Sources, N

               Fig. 3: Acceleration of a 2D plasma turbulence code on                           Fig. 4: Acceleration of a RBF fitting code using the
               the GPU using the developed middleware.                                          developed middleware.

To top