Middleware for programming NVIDIA GPUs from Fortran 9X Nail A. Gumerov Ramani Duraiswami William D. Dorland Fantalgo, LLC Fantalgo, LLC University of Maryland 7496 Merrymaker Way www.fantalgo.com Physics & IREAP Elkridge, MD 21075 Elkridge, MD 21075 College Park, MD 20742 email@example.com firstname.lastname@example.org email@example.com operations. The NVIDIA provided CUBLAS/CUFFT functions 1. INTRODUCTION are also encapsulated in a convenient overloaded syntax, which There is a revolution taking place in computing with the avoids bugs due to calling errors. availability of commodity graphical processors capable of several hundred gigaflop performance and with multicore CPUs. Using 3. RESULTS & CONCLUSIONS these it should be possible to have relatively inexpensive teraflop We accelerated two sample sci. comp. applications. In each workstations. These machines however are not so easily case, we had original Fortran 90 code available, and we translated programmed, and scientific computing may not map on to them. this original code to run on the GPU. The first application was Over the past year, the major graphics vendors have realized there from plasma turbulence using a simplified but relevant 2D model is a market for these in HPC, and have started producing tools to . This is a pseudospectral code that makes use of the wrapped allow programming them from high level languages, ATI’s Close- CUFFT library. Computationally the most important part of this To-Machine (CTM) and NVIDIA’s CUDA . We are more code is that evaluating nonlinear evolution terms in a time familiar with the latter architecture, for which both a beta stepping loop. Fig. 2 shows the translation of this routine using programming environment and publicly available graphics boards middleware function calls. Fig. 3 shows the speedup on this code, with impressive speedups over their previous generation, and for N=16, …, 1024. A speedup of about 25 is achieved vis-à-vis restrict ourselves to them. the serial CPU code, executed on an Intel QX6700 CPU processor and the expected scaling N2logN is seen. 2. GPU PROGRAMMING MIDDLEWARE The second application is from the fitting of radial basis CUDA views the GPU as a set of multiprocessors, each with some functions to scattered data, using an iterative algorithm . This is local cache memory (of various types), and all able to talk to representative of many applications in iterative methods that global device memory. Because of the architecture, applications should see significant speedups. Here an incredible speedup of that map to the shared multiprocessor parallel environment, 662 times over a serial CPU code is seen. effectively use the available number of processors, and mostly use We are continuing to extend this environment, and have cache memory, can see significant speedups. recently used it to develop a version of the FMM on the GPU . While the NVCC compiler provided with CUDA does compile In the future, programming of multiple GPUs, and shared host C code, it is mostly focused at producing software that runs computation on distributed CPUs and GPUs will be developed. on the GPU. While this is useful to develop small programs to run  NVIDIA CUDA Compute Unified Device Architecture on the GPU, when GPUs will be used for high performance Programming Guide, V1.0, NVIDIA Corp. 06/2007. computing they should be more properly viewed as compute  A. Faul, G. Goodsell, M.J. Powell, “A Krylov subspace algorithm for coprocessors, to which data from a large program running on the multiquadric interpolation in many dimensions,” IMA J. Numer. CPU host/cluster is farmed out. Of course, since host-GPU Anal., 25, 1-24, 2005. communication is relatively slow, back and forth data exchange  B.N. Rogers, W. Dorland, and M. Kotschenreuther, “The Generation should be avoided. Instead, our viewpoint of GPU programming and Stability of Zonal Flows in Ion Temperature Gradient Mode is to provide a high level language such as Fortran 9X with a set Turbulence,” Phys. Review Letters 85, 5536 (2000). of functions that give it the ability to manipulate data on the GPU  N.A. Gumerov and R. Duraiswami, “Fast Multipole Methods on via a middleware library, and augment the middleware functions Graphics Processors,” submitted. with a small number of problem specific functions written in CU. With the understanding that the performance of the GPU substantially depends on the thread block size, number of multiprocessors employed, and even position of the elements in the arrays, we introduce the concept of device variables, which have sizes and allocations on the GPU to provide high performance operations. We implement device variables as structures, which encapsulate information about the pointer, size, and other parameters, e.g., the type, dimension, leading dims, allocation status, etc. Fortran modules allow wrapping of function calls. Overloaded functions suitable for the use with different types, shapes, and optional parameters are developed. Several device functions, callable via wrappers are also provided. These are for initializing variables, copying them, and performing other Fig. 1: GPU and CPU growth in speed over the last 6 years. Fig. 2: Example of code conversion using Fantalgo’s middleware: Left: original Fortran-90 code; Right: ported code. 1.E+02 1.E+04 y=ax2 FGP'05 Algorithm (Iterative Solution of Dense RBF System) y=bx2 1.E+03 y=ax2 1.E+01 CPU 1.E+02 CPU Time (s) GPU Time (s) y=bx2 1.E+00 1.E+01 a/b=25 GPU 1.E+00 1.E-01 a/b=662 1.E-01 2D Plasma Turbulence Simulations (100 Time Steps of Pseudospectral 1.E-02 Method) 1.E-02 1.E+01 1.E+02 1.E+03 1.E+04 1.E+02 1.E+03 1.E+04 1.E+05 N (equivalent grid (NxN)) Number of Sources, N Fig. 3: Acceleration of a 2D plasma turbulence code on Fig. 4: Acceleration of a RBF fitting code using the the GPU using the developed middleware. developed middleware.
Pages to are hidden for
"Proceedings Template - WORD"Please download to view full document