CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs
Varun Sampath
CIS 565: GPU Programming and Architecture
Abstract
General purpose GPU computation is a fast growing field with a variety of applications. For maximum performance, though, mapping high-level parallel algorithms to vendor hardware requires a solid grasp of both the algorithm’s
computational requirements and the microarchitectural limitations of the GPU. This work aims to explore the performance of high and low arithmetic intensity workloads on the latest NVIDIA and AMD GPU hardware, codenamed Fermi
and Barts, respectively. A summed area table generator and a Black-Scholes option pricer were used as benchmarks to analyze performance for compute- and bandwidth-bound algorithms. It was found that the AMD Barts GPU provided a
50% performance boost on the Black-Scholes compute-bound workload, whereas Fermi excelled at the more memory-bound summed area table computation.
AMD Radeon HD 6870 (Barts) Method NVIDIA Tesla C2070 (Fermi)
1. Develop a benchmark with low compute/memory access ratio
• Summed Area Tables
• inclusive scan and transpose operations on off-chip global memory
2. Develop a benchmark with high compute/memory access ratio
• Black-Scholes option pricing
• embarrassingly parallel streaming computation
• Almost all single-precision floating point operations
3. Execute on both CUDA and OpenCL platforms
Results
SAT OpenCL Performance with work-group size of 256 SAT OpenCL and CUDA Performance with work-group size
0.14 of 256
0.12 0.25
Execution Time (s)
0.1 0.2
Execution Time (s)
0.08
0.15
Fermi
0.06
Barts CUDA
0.1
0.04 OpenCL
0.05
0.02
Fermi
0 0
Barts 256x256 512x512 1024x1024 2048x2048 256x256 512x512 1024x1024 2048x2048 4096x4096 Architecture
Problem Size Problem Size
Architecture Overview and
Black-Scholes OpenCL Performance with work-group size Black-Scholes OpenCL and CUDA Performance with work-
Overview and 0.005
of 256 and processing of 8 million options group size of 256 and processing of 8 million options Lessons:
0.008
• Fermi contains 14
Lessons: 0.0045
0.007 compute units with
0.004
• Barts contains 14 compute units and 16 stream cores
0.0035 0.006 32 CUDA cores per unit
per compute unit
Execution Time (s)
Execution Time (s)
0.003 • Each compute unit can schedule
• Each stream core contains 5 scalar processing units 0.005
0.0025 Fermi two SIMT blocks (“warps”) concurrently
• Each stream core executes a VLIW instruction bundle 0.004
0.002 Barts CUDA • Fermi has a theoretical performance of 1030 GFLOPS
targeting these processing units 0.003
0.0015
OpenCL
• Fermi has virtual memory and hashes physical
• Black-Scholes saturates 4 or 5 units for a
0.001
0.002 addresses, a performance boon against partition
majority of the ALU instruction bundles
0.0005 0.001 camping in SAT
• SAT leaves most idle
0 • The addition of an L1 and L2 cache hierarchy relaxes
• VLIW enables 2016 GFLOPS peak performance for Barts 0
16384 32768 49152 65536 16384 32768 49152 65536 memory coalescing restrictions compared to previous
• Groups of 64 VLIW instructions with the same clause Number of Work-Items Number of Work-items/Threads
architectures
type are executed in SIMT bundles called wavefronts
• OpenCL and CUDA performance is very dependent on
• Global memory accesses can suffer from both channel
References GPU compiler optimizations. Both generate PTX files that
and bank conflicts
are executed by the GPU driver
• Local/Shared memory access can suffer from bank ADVANCED MICRO DEVICES, INC. 2011. AMD Accelerated Parallel Processing OpenCL, Apr.
MARK HARRIS AND SHUBHABRATA SENGUPTA AND JOHN D. OWENS. Parallel Prefix Sum (Scan) with CUDA, vol. 3 of GPU Gems.
conflicts and bandwidth issues trying to satisfy VLIW NVIDIA CORPORATION. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.
processing units NVIDIA CORPORATION. 2010. NVIDIA CUDA C Programming Guide, Oct. Acknowledgements
NVIDIA CORPORATION. 2010. NVIDIA Tesla Datasheet, July.
PODLOZHNYUK, V. 2007. Black-Scholes option pricing. Tech.rep., June.
Special thanks to Aleksandar Dimitrijevic for providing the
RUETSCH, G., AND MICIKEVICIUS, P. 2010. Optimizing matrix transpose in CUDA. Tech. rep., June. AMD Radeon HD 6870 for testing, and to Patrick Cozzi and
SMITH, R. 2010. AMD’s Radeon HD 6870 & 6850: Renewing competition in the mid-range market. AnandTech (Dec.). Jon McCaffrey for their advice and help.