Embed
Email

poster

Document Sample

Shared by: dandanhuanghuang
Categories
Tags
Stats
views:
3
posted:
12/10/2011
language:
pages:
1
CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs

Varun Sampath

CIS 565: GPU Programming and Architecture

Abstract

General purpose GPU computation is a fast growing field with a variety of applications. For maximum performance, though, mapping high-level parallel algorithms to vendor hardware requires a solid grasp of both the algorithm’s

computational requirements and the microarchitectural limitations of the GPU. This work aims to explore the performance of high and low arithmetic intensity workloads on the latest NVIDIA and AMD GPU hardware, codenamed Fermi

and Barts, respectively. A summed area table generator and a Black-Scholes option pricer were used as benchmarks to analyze performance for compute- and bandwidth-bound algorithms. It was found that the AMD Barts GPU provided a

50% performance boost on the Black-Scholes compute-bound workload, whereas Fermi excelled at the more memory-bound summed area table computation.





AMD Radeon HD 6870 (Barts) Method NVIDIA Tesla C2070 (Fermi)

1. Develop a benchmark with low compute/memory access ratio

• Summed Area Tables

• inclusive scan and transpose operations on off-chip global memory

2. Develop a benchmark with high compute/memory access ratio

• Black-Scholes option pricing

• embarrassingly parallel streaming computation

• Almost all single-precision floating point operations

3. Execute on both CUDA and OpenCL platforms





Results

SAT OpenCL Performance with work-group size of 256 SAT OpenCL and CUDA Performance with work-group size

0.14 of 256

0.12 0.25

Execution Time (s)









0.1 0.2









Execution Time (s)

0.08

0.15

Fermi

0.06

Barts CUDA

0.1

0.04 OpenCL

0.05

0.02

Fermi

0 0

Barts 256x256 512x512 1024x1024 2048x2048 256x256 512x512 1024x1024 2048x2048 4096x4096 Architecture

Problem Size Problem Size

Architecture Overview and

Black-Scholes OpenCL Performance with work-group size Black-Scholes OpenCL and CUDA Performance with work-

Overview and 0.005

of 256 and processing of 8 million options group size of 256 and processing of 8 million options Lessons:

0.008

• Fermi contains 14

Lessons: 0.0045

0.007 compute units with

0.004

• Barts contains 14 compute units and 16 stream cores

0.0035 0.006 32 CUDA cores per unit

per compute unit

Execution Time (s)









Execution Time (s)



0.003 • Each compute unit can schedule

• Each stream core contains 5 scalar processing units 0.005

0.0025 Fermi two SIMT blocks (“warps”) concurrently

• Each stream core executes a VLIW instruction bundle 0.004

0.002 Barts CUDA • Fermi has a theoretical performance of 1030 GFLOPS

targeting these processing units 0.003

0.0015

OpenCL

• Fermi has virtual memory and hashes physical

• Black-Scholes saturates 4 or 5 units for a

0.001

0.002 addresses, a performance boon against partition

majority of the ALU instruction bundles

0.0005 0.001 camping in SAT

• SAT leaves most idle

0 • The addition of an L1 and L2 cache hierarchy relaxes

• VLIW enables 2016 GFLOPS peak performance for Barts 0

16384 32768 49152 65536 16384 32768 49152 65536 memory coalescing restrictions compared to previous

• Groups of 64 VLIW instructions with the same clause Number of Work-Items Number of Work-items/Threads

architectures

type are executed in SIMT bundles called wavefronts

• OpenCL and CUDA performance is very dependent on

• Global memory accesses can suffer from both channel

References GPU compiler optimizations. Both generate PTX files that

and bank conflicts

are executed by the GPU driver

• Local/Shared memory access can suffer from bank ADVANCED MICRO DEVICES, INC. 2011. AMD Accelerated Parallel Processing OpenCL, Apr.

MARK HARRIS AND SHUBHABRATA SENGUPTA AND JOHN D. OWENS. Parallel Prefix Sum (Scan) with CUDA, vol. 3 of GPU Gems.

conflicts and bandwidth issues trying to satisfy VLIW NVIDIA CORPORATION. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.

processing units NVIDIA CORPORATION. 2010. NVIDIA CUDA C Programming Guide, Oct. Acknowledgements

NVIDIA CORPORATION. 2010. NVIDIA Tesla Datasheet, July.

PODLOZHNYUK, V. 2007. Black-Scholes option pricing. Tech.rep., June.

Special thanks to Aleksandar Dimitrijevic for providing the

RUETSCH, G., AND MICIKEVICIUS, P. 2010. Optimizing matrix transpose in CUDA. Tech. rep., June. AMD Radeon HD 6870 for testing, and to Patrick Cozzi and

SMITH, R. 2010. AMD’s Radeon HD 6870 & 6850: Renewing competition in the mid-range market. AnandTech (Dec.). Jon McCaffrey for their advice and help.



Related docs
Other docs by dandanhuanghua...
Human2
Views: 0  |  Downloads: 0
COH Application
Views: 0  |  Downloads: 0
1 INTRODUCTION
Views: 0  |  Downloads: 0
labour_supply
Views: 1  |  Downloads: 0
Chpt15HW
Views: 0  |  Downloads: 0
membership-fees-2008
Views: 0  |  Downloads: 0
Treatnet ASI Workshop 3 Slides 010107
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!