ieee paper template in a4 (v1)

Document Sample
ieee paper template in a4 (v1) Powered By Docstoc
					How to achieve a 47000x GPU/CUDA Speedup on a Matrix Multiplication
Moussa Taifi #1, Yuan Shi *2
Computer science department, Temple University 324 Wachman Hall, 1805 N. Broad Street Philadelphia, PA 19122, USA
1 2

Abstract— Graphical Processors have shown extensive performance increase over the past few years in regards to computation intensive scientific simulations. While the speed up is still a central measure of the improvement achieved by the current state of the art research, care should be taken about this measure since comparing GPUs to CPUs is different from comparing GPUs to GPUs w.r.t performance. We give an example of how a speed up of 47000x, when using the GPU to GPU comparison approach, can be achieved with little effort on the matrix multiplication problem. Using the steady state timing model we experimented with parallel matrix multiplication to clear this misunderstanding about absolute versus relative speed up. Also using the timing model we discuss how the delivered memory bandwidth is affected by the number of threads running on the GPUs. The resulting conclusions presented in here can help the research community define better metrics to compare and benchmark larger parallel applications.

II. BACKGROUND In this section we give a brief description of the new CUDA/GPU programming model and of the steady state timing model used in the analysis of the memory bandwidth barrier in our matrix multiplication experiments. A. The GPU and CUDA Previous attempts to take advantage of the Graphical Processors units faced a software complexity challenge since the developers used graphical APIs to convert sequential applications to their parallel counterparts. Using the CUDA framework [9] allows the programmers to use familiar constructs to build a set of kernels that run on the GPU. This facilitates the conversion and portability of the GPU code which makes GPU research today much more accessible to a larger community of parallel research aficionados. The GPU/CUDA model allows for massively threaded programs to run on a single graphic card and by using innovative hardware and software constructs, the model is able to manage a large number of threads divided into grids, blocks and wraps [9] running on a impressive number of cores. B. The Steady State Timing Model The main metric that we are considering in the scope of our research is the speed up which is the fraction of the sequential time over the parallel time. The steady state timing model allows for a holistic approach for modeling the behavior of parallel applications [7]. This model predicts that for a shared memory parallel matrix multiplication the speed up can be represented by:

I. INTRODUCTION The use of GPUs in parallel computation intensive tasks has seen a sudden and so far sustained rise due to the amazing capabilities of the current graphical cards [1]. Much research has reported considerable improvement over previous implementations of important scientific computations due to the innovative CUDA programming model. Important fields/applications such as genetic sequencing (3.5x) [2], computational fluid dynamics (8x) [3], list ranking (15x) [4], computational physics (60x) [6] and many others have benefited from this movement. The improvements of the running times are reported using the relative speed up of the GPU implementation compared to a CPU counter part. In this paper we argue that some care should be taken when reporting such speedups and we give an example of how the GPU vs GPU(s) comparison is different from the traditional GPU vs CPU(s) speed up approach. Using the steady state timing model [7] we analyse and experiment with the state of the art parallel task which is matrix multiplication [5]. We figure out how to achieve a 47000x speedup on a GPU, why such a large speedup is at the same time super and sub linear and how it was obtained using the definition of the absolute speed up [8] against a single GPU thread. Also using the steady state timing model we are able to find out what happens to the GPU memory bandwidth when the number of threads grows to match the size of large matrices.

Fig. 1. Speed up formula for shared memory parallel matrix computation

In Figure 1 we have the steady state speed up formula for the matrix multiplication on a shared memory architecture such as GPUs. In this figure n is the data size, P is the number of processing units and Ws is the work done by a single

processing unit. Bseq and Bdelivered are the memory bandwidths in a sequential and parallel setting respectively. III. METHODOLOGY The experiments that we run are the following: Matrix multiplication on CPU versus GPU Matrix multiplication on GPU versus GPU While the first one is used to check the validity of the performance of the matrix multiplication implementation, the second one is an often underrepresented experiment that sheds light on the issue of the quality of the speed up reported. According to [8] the absolute speed up is defined by:

Fig. 2. Absolute Speed up formula

This formula assumes that the T1 is run with the best sequential implementation and Tp is the best parallel running time of the parallel version running on P processors. This is contrasted with the relative speed up where the sequential and parallel implementation runs on a heterogeneous set of processors In our experiment we do both experimentations where we run a set of matrix multiplications on an Intel Xeon CPU with 2.53 GHz and 11GB of RAM as well as on a NVidia Tesla cluster of two Tesla C1060 cards with a total of 480 cores running at 1.3 GHz and a 102 GB/sec memory bandwidth [10]. IV. PRELIMINARY RESULTS

Fig. 4. The speed up (GPU vs GPU (1 thread)) of the matrix multiplication

On the other hand when the sequential version is run on the GPU itself one can reach an amazing 47000x speed up. This is due to the definition of the absolute speed up where a respectably performing sequential matrix multiplication was run on a single thread on a single block and on a single grid on the GPU. This speed up is largely due to the low processing power of the single thread on the GPU. Also when reaching a 47000x speed up we notice that it is super-linear if compared to number of processing cores (480). However this is not super-linear when compared to the number of threads due to reduced memory delivered bandwidth as compared to the available 102 GB/s. Table 1 shows this information where the speed up is still below the total number of threads used but is above the constant number of processing cores. The delivered memory bandwidth is calculated using the steady state timing model and gives the following estimates:

Fig. 5 Predicted delivered memory bandwidth for the matrix multiplication TABLE I NUMBER OF THREADS USED AND THE DELIVERED MEMORY BANDWIDTH

Matrix size 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 900.00 1000.00

GPU/GPU speed up 5529.96 21695.42 30998.07 34431.67 36110.69 40591.58 43166.90 44860.32 45075.00 47319.24

Fig. 3. The relative speed up (CPU vs GPU) of the matrix multiplication

In figure 3 a parallel matrix multiplication was done on the Tesla GPU cluster and a sequential version was run on the CPU. This figure shows the speed up of the task on the GPU cards, which reaches a solid 808x when compared to its CPU sequential counterpart.

Number of threads used 9216.00 36864.00 82944.00 160000.00 246016.00 350464.00 473344.00 640000.00 802816.00 984064.00

Estimated delivered memory bandwidth GB/s

5.23856 4.95617 2.14275 0.98723 0.63261 0.48206 0.36954 0.27634 0.21579 0.18477

Table 1 shows the delivered memory bandwidth that is obtained from the parallel run and the number of threads in each set of experiments. We notice that the memory bandwidth delivered decreases as the number of threads increases. The Memory access patterns and the memory contention participate in this decline of the delivered memory bandwidth. V. CONCLUSION In this paper we presented the effect of considering a CPU versus GPU speed up contrasted by the use of GPU versus GPU speed up measures. As the two are different in terms of what a sequential run of the program means, we gave an example of a parallel task (eg. Matrix multiplication) that produces enough contrast between the two methods to make the point that due to the crucial differences in the architectural underpinnings of the CPU versus the GPU more attention needs to be given to the speed up measure as a central metric in parallel computing research. Also using the steady state timing model we are able to find out the effect of the total number of processing units have on the memory bandwidth of the GPUs. VI. FURTHER DIRECTIONS As the GPU/CUDA computational model picks up steam in the following years comparing the performance of different algorithms becomes crucial to advance the field. In our case we would like to keep experimenting with the GPU/GPU speed up and to find out what are the issues/complications/insights that can be gained from other important parallel applications (eg. parallel tree search with backtracking on the GPU/CUDA). ACKNOWLEDGMENT We would like to thank Dr. Yuan Shi for his crucial insights on the steady state timing model and William Leung, Reeann Zhang, Matt Thauberger, Bohryoung Tsao and James Huang from Amax Corporation for their graceful gift of computing time on the dedicated Tesla C1060 cluster. REFERENCES
[1] (2009) Nvidia corporation. [Online]. Available: http://www. [2] L.Ligowski, W.Rudnicki “An efficient implementation of smith waterman algorithm on GPU using cuda for massively parallel scanning of sequence databasesm” in Proc. HiCOMB 2009. In Press. [3] J.Cohen, M. J. Molemaker (2009)“A fast double precision CFD code using CUDA”, 2009. [Online]. Available: apers/DoublePrecision-CFD-Cohen-parCFD09.pdf [4] M. Suhail Rehman, K. Kothapalli, P.J. Narayanan. “Fast and Scalable List Ranking on the GPU.” ” in Proc. 23rd International Conference on Supercomputing (ICS). New York, USA, June 2009. In press. [5] M. Marqués, G. Quintana-Ortí, E. S. Quintana-Ortí, and R. van de Geijn.“Using graphics processors to accelerate the solution of out of core linear systems” in Proc.8th IEEE International Symposium on Parallel and Distributed Computing, Lisbon (Portugal), 2009. In press. [6] T. Preis, P. Virnau, W. Paul, J. J. Schneider, “GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model”, Journal of Computational Physics, Volume 228, Issue 12, 1 July 2009, Pages 4468-4477

[7] Y. Shi, "Parallel Program Scalability Analysis," IASTED International Conference on Parallel and Distributed Computing, October 1997, pp.451-456 [8] L. R. Scott, T. Clark, B. Bagheri. Scientific parallel computing, Princeton, USA, Princeton University Press, 2005. [9] Nvidia Cuda Programming Guide [Online] Available: /NVIDIA_CUDA_Programming_Guide_2.2.1.pdf [10] (2009) Nvidia Tesla card specification [Online]. Available:

Shared By:
Description: ieee paper template in a4 (v1)