An Experimental Approach to Performance
Measurement of Heterogeneous Parallel
Applications using CUDA
Allen D. Malony1, Scott Biersdorff2, Wyatt Spear2
1Department of Computer and Information Science
2Performance Research Laboratory
University of Oregon
ShangkarMayanglambam3
3Qualcomm Corporation
Motivation
Heterogeneous parallel systems are highly relevant today
Heterogeneous hardware technology more accessible
Multicore processors (e.g., 4-core, 6-core, 8-core, ...)
Manycore (throughput) accelerators (e.g., Tesla, Fermi)
High-performance engines (e.g., Cell BE, Larrabee)
Special purpose components (e.g., FPGAs)
Performance is the main driving concern
Heterogeneity is an important (the?) path to extreme scale
Heterogeneous software technology required for performance
More sophisticated parallel programming environments
Integrated parallel performance tools
support heterogeneous performance model and perspectives
ICS 2010 Measurement of Heterogeneous Applications using CUDA 2
Implications for Parallel Performance Tools
Current status quo is somewhat comfortable
Mostly homogeneous parallel systems and software
Shared-memory multithreading – OpenMP
Distributed-memory message passing – MPI
Parallel computational models are relatively stable (simple)
Corresponding performance models are relatively tractable
Parallel performance tools can keep up and evolve
Heterogeneity creates richer computational potential
Results in greater performance diversity and complexity
Heterogeneous systems will utilize more sophisticated
programming and runtime environments
Performance tools have to support richer computation models
and more versatile performance perspectives
ICS 2010 Measurement of Heterogeneous Applications using CUDA
Heterogeneous Performance Views
Want to create performance views that capture heterogeneous
concurrency and execution behavior
Reflectinteractions between heterogeneous components
Capture performance semantics relative to computation model
Assimilate performance for all execution paths for shared view
Existing parallel performance tools are CPU(host)-centric
Event-based sampling (not appropriate for accelerators)
Direct measurement (through instrumentation of events)
What perspective does the host have of other components?
Determines the semantics of the measurement data
Determines assumptions about behavior and interactions
Performance views may have to work with reduced data
ICS 2010 Measurement of Heterogeneous Applications using CUDA
Task-based Performance View
Consider the “task” abstraction for GPU accelerator scenario
Host regards external execution as a task
Tasks operate concurrently with
respect to the host
Requires support for tracking
asynchronous execution
Host creates measurement
perspective for external task
Maintainslocal and remote performance data
Tasks may have limited measurement support
May depend on host for performance data I/O
Performance data might be received from external task
How to create a view of heterogeneous external performance?
ICS 2010 Measurement of Heterogeneous Applications using CUDA
CUDA Performance Perspective
CUDA enables programming of kernels for GPU acceleration
GPU acceleration acts as an external tasks
Performance measurement appears straightforward
Execution model complicates performance measurement
Synchronous and asynchronous operation with respect to host
Overlapping of data transfer and kernel execution
Multiple GPU devices and multiple streams per device
Different acceleration kernels used in parallel application
Multiple application sections
Multiple application threads/processes
See performance in context:
temporal, spatial, (host) thread/process
ICS 2010 Measurement of Heterogeneous Applications using CUDA
TAU and TAUcuda
TAU Architecture
TAU performance system
Robust, scalable integrated performance
framework and toolkit
Parallel profiling and tracing
Shared and distributed parallel systems
Open source and portable
TAUcuda
Extension to support CUDA
performance measurement
Goal is to leverage TAU's infrastructure
and analysis capabilities in TAUcuda development
Deliver heterogeneous parallel performance support
ICS 2010 Measurement of Heterogeneous Applications using CUDA 7
TAUcuda Performance Measurement (Version 1)
Build on CUDA event interface
Allow “events” to be placed in streams and processed
events are timestamped by CUDA driver
CUDA driver reports GPU timing in event structure
Events are reported back to CPU when requested
use begin and end events to calculate intervals
CUDA kernel invocations are asynchronous
CPU does not see actual CUDA “end” event
Want to associate TAU event context with CUDA events
Get top of TAU event stack at begin (TAU context)
S. Mayanglambam, A. Malony, M. Sottile, "Performance Measurement of Applications with GPU
Acceleration using CUDA," ParCo 2009, Lyon, France, September 2009.
ICS 2010 Measurement of Heterogeneous Applications using CUDA
TAUcuda Performance Measurement (Version 2)
Overcome TAUcuda (v1) deficiencies
Required source code instrumentation
Event interface only perspectives
could not see memory transfer or CUDA system execution
CUDA system architecture
Implemented by CUDA libraries
driver and device (cuXXX) libraries
runtime (cudaYYY) library
Tools support (Parallel Nsight (Nexus), CUDA Profiler)
not intended to integrate with other HPC performance tools
TAUcuda (v2) built on experimental Linux CUDA driver
Linux CUDA driver R190.86 supports a callback interface!!!
ICS 2010 Measurement of Heterogeneous Applications using CUDA
TAUcuda Architecture
TAUcuda
events
TAU
events
ICS 2010 Measurement of Heterogeneous Applications using CUDA
TAU and TAUcuda Performance Events
TAU measures events during execution
Eventsare made visible as a result of code instrumentation
Records event begin and end for profiling and tracing
TAU events are measured by the CPU when they happen
TAU can not measure events on the GPU
TAUcuda events are measured by CUDA and the GPU device
TAUcuda events occur asynchronously to TAU events
TAUcuda is integrated with TAU measurement infrastructure
Must transform TAUcuda events into TAU events
Associate TAUcuda events with application CPU operation
samples the TAU context to link to application call site
ICS 2010 Measurement of Heterogeneous Applications using CUDA
TAUcuda Instrumentation
Normal application software composition
No performance measurement enabled
ICS 2010 Measurement of Heterogeneous Applications using CUDA 12
TAUcuda Instrumentation
TAU events
Includes only CPU-level instrumentation (TAU events)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 13
TAUcuda Instrumentation
TAUcuda events TAU events
ICS 2010 Measurement of Heterogeneous Applications using CUDA 14
CUDA Linux Driver Library Tools API
Experimental CUDA driver library provides callback support
Exposes all driver routines through callback interface
subscribe to events via cuToolsApi_ETI_Core interface table
Exposes functions to retrieve GPU performance information
TAUcuda intercepts only events of interest in callback handler
API routines
cuToolsApi_CBID_EnterGeneric
cuToolsApi_CBID_ExitGeneric
Measurement (context synchronization, GPU buffer overflow)
cuToolsApi_CBID_ProfileLaunch
cuToolsApi_CBID_ProfileMemory
Call TAU event creation / measurement routines (enter, exit)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 15
CUDA Driver Library Routines Intercepted
Launch
cuLaunch(); cuLaunchGrid();
cuLaunchGridAsync();
Memory transfer
cuMemcpyHtoD(); cuMemcpyHtoDAsync();
cuMemcpy2D(); cuMemcpy2DUnaligned();
cuMemcpy2DAsync(); cuMemcpy3D();
cuMemcpy3DAsync(); cuMemcpyAtoA();
cuMemcpyAtoD(); cuMemcpyAtoH();
cuMemcpyAtoHAsync(); cuMemcpyDtoA();
cuMemcpyDtoD(); cuMemcpyDtoH();
cuMemcpyDtoHAsync(); cuMemcpyHtoA();
cuMemcpyHtoAAsync();
ICS 2010 Measurement of Heterogeneous Applications using CUDA 16
CUDA Kernel Launch and Memory Transfer
cuToolsApi_CBID_EnterGeneric callback occurs for cuXXX()
routines that invoke GPU kernel launch and memory transfer
CUDA system manages these operations and make
measurements in association with the GPU device
Keeps information in an internal buffer
How to associate "enter" with asynchronous future "exit"?
TAUcuda Event Handler creates a call record:
event name call ID operation type API routine name
TAU context CUDA context GPU device GPU stream
TAUcuda Event Handler calls into the TAU system to retrieve
current TAU event stack (TAU context) during EnterGeneric
Profile callbacks will return performance data at later time
TAUcuda then generates TAU events (profile or trace)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 17
CUDA Runtime Library Instrumentation
NVIDIA does not implement callbacks for runtime library
Only provides header files (no source) for the runtime library
Instrument with TAU's library wrapping tool, tau_wrap
Parsesheader files
Automatically generates a new library (Magic!)
Redefines the library routines of interest
Wrapped routines are instrumented with TAU entry/exit
Original routines called with the appropriate arguments
CUDA runtime library performance measured by TAU
TAU enter and exit events for all cudaYYY()
ICS 2010 Measurement of Heterogeneous Applications using CUDA 18
TAUcuda Profiling and Tracing
Keep a profile or trace for every GPU device stream
Profiling
Calculate flat profile for each kernel and memory transfer
Done at time of Profile callback
Tracing
Must use TAU clock for timestamp
Kernel and memory timestamp reported with GPU clock
Must synchronize CPU and GPU clocks
Save a TAUcuda trace for every GPU device stream
can not insert into TAU's runtime trace buffer (Why?)
Kernel / memory transfer start/stop are asynchronous
Offline trace merging, clock correction, and translation
ICS 2010 Measurement of Heterogeneous Applications using CUDA 19
Running with TAU / TAUcuda
To run an CUDA application with TAUcuda, all of the
necessary libraries must be dynamically linked
TAUcuda works with unmodified CUDA application binaries
Use scripts for different scenarios:
taucuda profiler.sh / taucuda mpirun.sh (Profiling)
taucuda tracer.sh / taucuda mpirun tracer.sh (Tracing)
TAUcuda produces profiles or traces in the current working
directory in sub-folders to distinguish them from TAU
performance output
TAUCuda profiles are in different metric sub-folders:
gpu_elapsed_time
gpu_memory_transfer
gpu_shared_memory
ICS 2010 Measurement of Heterogeneous Applications using CUDA 20
TAUcuda Experimentation Environments
University of Oregon
Linux workstation
Dual quad core Intel Xeon
GTX 280
GPU cluster (Mist)
Four dual quad core Intel Xeon server nodes
Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)
Argonne National Laboratory (Eureka)
100 dual quad core NVIDIA Quadro Plex S4
200 Quadro FX5600 (2 per S4)
University of Illinois at Urbana-Champaign
GPU cluster (AC cluster)
32 nodes with one S1070 (4 GPUs per node)
ICS 2010 Measurement of Heterogeneous Applications using CUDA
CUDA SDK Transpose (256 x 4096 matrix)
CPU profile
cu events
cuda events
...
GPU profile
kernel
ICS 2010 Measurement of Heterogeneous Applications using CUDA 22
CUDA SDK OceanFFT (profile, trace)
kernels
Jumpshot trace visualizer
GPU CPU
ICS 2010 Measurement of Heterogeneous Applications using CUDA 23
CUDA Linpack Profile (4 processes, 4 GPUs)
Measureperformance of heterogeneous parallel applications
GPU-accelerated Linpack benchmark (M. Fatica, NVIDIA)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 24
CUDA Linpack Trace
CUDA memory transfer (white) MPI communication (yellow)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 25
NAMD and TAU / TAUcuda
DemonstrateTAUcuda with scientific application
NAMD is a molecular dynamics application
using Charm++ parallel object-oriented language
Written
Charm++ and NAMD run on large-scale HPC clusters
NAMD has been accelerated with CUDA
TAU integrated in Charm++ (ICPP 2009 paper)
Now apply TAUcuda to observe influence of GPU execution
Observe the effect of CUDA acceleration
Show scaling results for GPU cluster execution
ICS 2010 Measurement of Heterogeneous Applications using CUDA
NAMD Profile (4 processes, 4 GPUs)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 27
NAMD GPU Scaling (4–64 GPUs)
Strongscaling experiments on Eureka cluster
Use TAU PerfExplorer to compare
ICS 2010 Measurement of Heterogeneous Applications using CUDA
SHOC Stencil2D (512 iterations, 4 CPUxGPU)
Scalable HeterOgenerous Computing benchmark suite
CUDA / OpenCL kernels and microbenchmarks (ORNL)
CUDA memory transfer (white)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 29
HMPP SGEMM (CAPS Entreprise)
Host
Process
Transfer
Kernel
Compute
Kernel
Host
Process
Transfer
Kernel
Compute
Kernel
ICS 2010 Measurement of Heterogeneous Applications using CUDA 30
Conclusions
Heterogeneous parallel systems will require parallel
performance tools that integrate performance perspectives
Need to rely on hardware and software support in
heterogeneous components to access performance
Experimental Linux CUDA driver provided by NVIDIA
facilitiates access to CUDA / GPU performance information
TAUcuda merges with TAU (CPU) performance data
TAU/TAUcuda provides powerful scalable heterogeneous
performance measurement and analysis
NVIDIA is incorporating performance tools requirements in
next-generation driver/device libraries
TAUopencl is in development (working prototype)
ICS 2010 Measurement of Heterogeneous Applications using CUDA 31
Support Acknowledgements
Department of Energy (DOE)
Office of Science
ASC/NNSA
Department of Defense (DoD)
HPC Modernization Office (HPCMO)
NSF Software Development for Cyberinfrastructure (SDCI)
Research Centre Juelich
Argonne National Laboratory
Technical University Dresden
ParaTools, Inc.
NVIDIA
ICS 2010 Measurement of Heterogeneous Applications using CUDA