Middleware for programming NVIDIA GPUs from Fortran 9X
Nail A. Gumerov Ramani Duraiswami William D. Dorland
UMIACS, University of Maryland CS & UMIACS, University of Maryland
Fantalgo, LLC Elkridge, MD University of Maryland, College Park Physics & IREAP
email@example.com Fantalgo, LLC Elkridge, MD 21075 College Park, MD 20742
firstname.lastname@example.org email@example.com firstname.lastname@example.org
operations. The NVIDIA provided CUBLAS/CUFFT functions
1. INTRODUCTION are also encapsulated in a convenient overloaded syntax, which
There is a revolution taking place in computing with the avoids bugs due to calling errors.
availability of commodity graphical processors capable of several
hundred gigaflop performance and with multicore CPUs. Using 3. RESULTS & CONCLUSIONS
these it should be possible to have relatively inexpensive teraflop We accelerated two sample sci. comp. applications. In each
workstations. These machines however are not so easily case, we had original Fortran 90 code available, and we translated
programmed, and scientific computing may not map on to them. this original code to run on the GPU. The first application was
Over the past year, the major graphics vendors have realized there from plasma turbulence using a simplified but relevant 2D model
is a market for these in HPC, and have started producing tools to . This is a pseudospectral code that makes use of the wrapped
allow programming them from high level languages, ATI’s Close- CUFFT library. Computationally the most important part of this
To-Machine (CTM) and NVIDIA’s CUDA . We are more code is that evaluating nonlinear evolution terms in a time
familiar with the latter architecture, for which both a beta stepping loop. Fig. 2 shows the translation of this routine using
programming environment and publicly available graphics boards middleware function calls. Fig. 3 shows the speedup on this code,
with impressive speedups over their previous generation, and for N=16, …, 1024. A speedup of about 25 is achieved vis-à-vis
restrict ourselves to them. the serial CPU code, executed on an Intel QX6700 CPU processor
and the expected scaling N2logN is seen.
2. GPU PROGRAMMING MIDDLEWARE The second application is from the fitting of radial basis
CUDA views the GPU as a set of multiprocessors, each with some functions to scattered data, using an iterative algorithm . This is
local cache memory (of various types), and all able to talk to representative of many applications in iterative methods that
global device memory. Because of the architecture, applications should see significant speedups. Here an incredible speedup of
that map to the shared multiprocessor parallel environment, 662 times over a serial CPU code is seen.
effectively use the available number of processors, and mostly use We are continuing to extend this environment, and have
cache memory, can see significant speedups. recently used it to develop a version of the FMM on the GPU .
While the NVCC compiler provided with CUDA does compile In the future, programming of multiple GPUs, and shared
host C code, it is mostly focused at producing software that runs computation on distributed CPUs and GPUs will be developed.
on the GPU. While this is useful to develop small programs to run  NVIDIA CUDA Compute Unified Device Architecture
on the GPU, when GPUs will be used for high performance Programming Guide, V1.0, NVIDIA Corp. 06/2007.
computing they should be more properly viewed as compute
 A. Faul, G. Goodsell, M.J. Powell, “A Krylov subspace algorithm for
coprocessors, to which data from a large program running on the multiquadric interpolation in many dimensions,” IMA J. Numer.
CPU host/cluster is farmed out. Of course, since host-GPU Anal., 25, 1-24, 2005.
communication is relatively slow, back and forth data exchange  B.N. Rogers, W. Dorland, and M. Kotschenreuther, “The Generation
should be avoided. Instead, our viewpoint of GPU programming and Stability of Zonal Flows in Ion Temperature Gradient Mode
is to provide a high level language such as Fortran 9X with a set Turbulence,” Phys. Review Letters 85, 5536 (2000).
of functions that give it the ability to manipulate data on the GPU  N.A. Gumerov and R. Duraiswami, “Fast Multipole Methods on
via a middleware library, and augment the middleware functions Graphics Processors,” submitted.
with a small number of problem specific functions written in CU.
With the understanding that the performance of the GPU
substantially depends on the thread block size, number of
multiprocessors employed, and even position of the elements in
the arrays, we introduce the concept of device variables, which
have sizes and allocations on the GPU to provide high
performance operations. We implement device variables as
structures, which encapsulate information about the pointer, size,
and other parameters, e.g., the type, dimension, leading dims,
allocation status, etc. Fortran modules allow wrapping of function
calls. Overloaded functions suitable for the use with different
types, shapes, and optional parameters are developed. Several
device functions, callable via wrappers are also provided. These
are for initializing variables, copying them, and performing other Fig. 1: GPU and CPU growth in speed over the last 6 years.
Fig. 2: Example of code conversion using Fantalgo’s middleware: Left: original Fortran-90 code; Right: ported code.
y=ax2 FGP'05 Algorithm
(Iterative Solution of Dense RBF System)
y=bx2 1.E+03 y=ax2
2D Plasma Turbulence Simulations
(100 Time Steps of Pseudospectral
1.E-02 Method) 1.E-02
1.E+01 1.E+02 1.E+03 1.E+04 1.E+02 1.E+03 1.E+04 1.E+05
N (equivalent grid (NxN)) Number of Sources, N
Fig. 3: Acceleration of a 2D plasma turbulence code on Fig. 4: Acceleration of a RBF fitting code using the
the GPU using the developed middleware. developed middleware.