Embed
Email

TAU Performance System Sameer Shende_ Allen D. Malony .._1_

Document Sample

Shared by: Jun Wang
Categories
Tags
Stats
views:
0
posted:
12/29/2011
language:
pages:
32
An Experimental Approach to Performance

Measurement of Heterogeneous Parallel

Applications using CUDA

Allen D. Malony1, Scott Biersdorff2, Wyatt Spear2

1Department of Computer and Information Science

2Performance Research Laboratory



University of Oregon



ShangkarMayanglambam3

3Qualcomm Corporation

Motivation

 Heterogeneous parallel systems are highly relevant today

 Heterogeneous hardware technology more accessible

 Multicore processors (e.g., 4-core, 6-core, 8-core, ...)

 Manycore (throughput) accelerators (e.g., Tesla, Fermi)

 High-performance engines (e.g., Cell BE, Larrabee)

 Special purpose components (e.g., FPGAs)



 Performance is the main driving concern

 Heterogeneity is an important (the?) path to extreme scale

 Heterogeneous software technology required for performance

 More sophisticated parallel programming environments

 Integrated parallel performance tools

 support heterogeneous performance model and perspectives

ICS 2010 Measurement of Heterogeneous Applications using CUDA 2

Implications for Parallel Performance Tools

 Current status quo is somewhat comfortable

 Mostly homogeneous parallel systems and software

 Shared-memory multithreading – OpenMP

 Distributed-memory message passing – MPI

 Parallel computational models are relatively stable (simple)

 Corresponding performance models are relatively tractable

 Parallel performance tools can keep up and evolve

 Heterogeneity creates richer computational potential

 Results in greater performance diversity and complexity

 Heterogeneous systems will utilize more sophisticated

programming and runtime environments

 Performance tools have to support richer computation models

and more versatile performance perspectives

ICS 2010 Measurement of Heterogeneous Applications using CUDA

Heterogeneous Performance Views

 Want to create performance views that capture heterogeneous

concurrency and execution behavior

 Reflectinteractions between heterogeneous components

 Capture performance semantics relative to computation model

 Assimilate performance for all execution paths for shared view



 Existing parallel performance tools are CPU(host)-centric

 Event-based sampling (not appropriate for accelerators)

 Direct measurement (through instrumentation of events)

 What perspective does the host have of other components?

 Determines the semantics of the measurement data

 Determines assumptions about behavior and interactions



 Performance views may have to work with reduced data

ICS 2010 Measurement of Heterogeneous Applications using CUDA

Task-based Performance View

 Consider the “task” abstraction for GPU accelerator scenario

 Host regards external execution as a task

 Tasks operate concurrently with

respect to the host

 Requires support for tracking

asynchronous execution

 Host creates measurement

perspective for external task

 Maintainslocal and remote performance data

 Tasks may have limited measurement support

 May depend on host for performance data I/O

 Performance data might be received from external task



 How to create a view of heterogeneous external performance?

ICS 2010 Measurement of Heterogeneous Applications using CUDA

CUDA Performance Perspective

 CUDA enables programming of kernels for GPU acceleration

 GPU acceleration acts as an external tasks

 Performance measurement appears straightforward

 Execution model complicates performance measurement

 Synchronous and asynchronous operation with respect to host

 Overlapping of data transfer and kernel execution

 Multiple GPU devices and multiple streams per device



 Different acceleration kernels used in parallel application

 Multiple application sections

 Multiple application threads/processes

 See performance in context:

 temporal, spatial, (host) thread/process

ICS 2010 Measurement of Heterogeneous Applications using CUDA

TAU and TAUcuda

TAU Architecture

 TAU performance system

 Robust, scalable integrated performance

framework and toolkit

 Parallel profiling and tracing

 Shared and distributed parallel systems

 Open source and portable



 TAUcuda

 Extension to support CUDA

performance measurement

 Goal is to leverage TAU's infrastructure

and analysis capabilities in TAUcuda development

 Deliver heterogeneous parallel performance support



ICS 2010 Measurement of Heterogeneous Applications using CUDA 7

TAUcuda Performance Measurement (Version 1)

 Build on CUDA event interface

 Allow “events” to be placed in streams and processed

 events are timestamped by CUDA driver

 CUDA driver reports GPU timing in event structure

 Events are reported back to CPU when requested

 use begin and end events to calculate intervals

 CUDA kernel invocations are asynchronous

 CPU does not see actual CUDA “end” event

 Want to associate TAU event context with CUDA events

 Get top of TAU event stack at begin (TAU context)



S. Mayanglambam, A. Malony, M. Sottile, "Performance Measurement of Applications with GPU

Acceleration using CUDA," ParCo 2009, Lyon, France, September 2009.



ICS 2010 Measurement of Heterogeneous Applications using CUDA

TAUcuda Performance Measurement (Version 2)

 Overcome TAUcuda (v1) deficiencies

 Required source code instrumentation

 Event interface only perspectives

 could not see memory transfer or CUDA system execution

 CUDA system architecture

 Implemented by CUDA libraries

 driver and device (cuXXX) libraries

 runtime (cudaYYY) library



 Tools support (Parallel Nsight (Nexus), CUDA Profiler)

 not intended to integrate with other HPC performance tools

 TAUcuda (v2) built on experimental Linux CUDA driver

 Linux CUDA driver R190.86 supports a callback interface!!!



ICS 2010 Measurement of Heterogeneous Applications using CUDA

TAUcuda Architecture









TAUcuda

events

TAU

events









ICS 2010 Measurement of Heterogeneous Applications using CUDA

TAU and TAUcuda Performance Events

 TAU measures events during execution

 Eventsare made visible as a result of code instrumentation

 Records event begin and end for profiling and tracing

 TAU events are measured by the CPU when they happen



 TAU can not measure events on the GPU

 TAUcuda events are measured by CUDA and the GPU device

 TAUcuda events occur asynchronously to TAU events



 TAUcuda is integrated with TAU measurement infrastructure

 Must transform TAUcuda events into TAU events

 Associate TAUcuda events with application CPU operation

 samples the TAU context to link to application call site





ICS 2010 Measurement of Heterogeneous Applications using CUDA

TAUcuda Instrumentation









 Normal application software composition

 No performance measurement enabled







ICS 2010 Measurement of Heterogeneous Applications using CUDA 12

TAUcuda Instrumentation









TAU events





 Includes only CPU-level instrumentation (TAU events)



ICS 2010 Measurement of Heterogeneous Applications using CUDA 13

TAUcuda Instrumentation









TAUcuda events TAU events

ICS 2010 Measurement of Heterogeneous Applications using CUDA 14

CUDA Linux Driver Library Tools API

 Experimental CUDA driver library provides callback support

 Exposes all driver routines through callback interface

 subscribe to events via cuToolsApi_ETI_Core interface table

 Exposes functions to retrieve GPU performance information

 TAUcuda intercepts only events of interest in callback handler

 API routines

cuToolsApi_CBID_EnterGeneric

cuToolsApi_CBID_ExitGeneric

 Measurement (context synchronization, GPU buffer overflow)

cuToolsApi_CBID_ProfileLaunch

cuToolsApi_CBID_ProfileMemory

 Call TAU event creation / measurement routines (enter, exit)



ICS 2010 Measurement of Heterogeneous Applications using CUDA 15

CUDA Driver Library Routines Intercepted

 Launch

cuLaunch(); cuLaunchGrid();

cuLaunchGridAsync();

 Memory transfer

cuMemcpyHtoD(); cuMemcpyHtoDAsync();

cuMemcpy2D(); cuMemcpy2DUnaligned();

cuMemcpy2DAsync(); cuMemcpy3D();

cuMemcpy3DAsync(); cuMemcpyAtoA();

cuMemcpyAtoD(); cuMemcpyAtoH();

cuMemcpyAtoHAsync(); cuMemcpyDtoA();

cuMemcpyDtoD(); cuMemcpyDtoH();

cuMemcpyDtoHAsync(); cuMemcpyHtoA();

cuMemcpyHtoAAsync();

ICS 2010 Measurement of Heterogeneous Applications using CUDA 16

CUDA Kernel Launch and Memory Transfer

 cuToolsApi_CBID_EnterGeneric callback occurs for cuXXX()

routines that invoke GPU kernel launch and memory transfer

 CUDA system manages these operations and make

measurements in association with the GPU device

 Keeps information in an internal buffer

 How to associate "enter" with asynchronous future "exit"?

 TAUcuda Event Handler creates a call record:

event name call ID operation type API routine name

TAU context CUDA context GPU device GPU stream

 TAUcuda Event Handler calls into the TAU system to retrieve

current TAU event stack (TAU context) during EnterGeneric

 Profile callbacks will return performance data at later time

 TAUcuda then generates TAU events (profile or trace)



ICS 2010 Measurement of Heterogeneous Applications using CUDA 17

CUDA Runtime Library Instrumentation

 NVIDIA does not implement callbacks for runtime library

 Only provides header files (no source) for the runtime library

 Instrument with TAU's library wrapping tool, tau_wrap

 Parsesheader files

 Automatically generates a new library (Magic!)

 Redefines the library routines of interest

 Wrapped routines are instrumented with TAU entry/exit

 Original routines called with the appropriate arguments



 CUDA runtime library performance measured by TAU

 TAU enter and exit events for all cudaYYY()







ICS 2010 Measurement of Heterogeneous Applications using CUDA 18

TAUcuda Profiling and Tracing

 Keep a profile or trace for every GPU device stream

 Profiling

 Calculate flat profile for each kernel and memory transfer

 Done at time of Profile callback



 Tracing

 Must use TAU clock for timestamp

 Kernel and memory timestamp reported with GPU clock

 Must synchronize CPU and GPU clocks

 Save a TAUcuda trace for every GPU device stream

 can not insert into TAU's runtime trace buffer (Why?)

 Kernel / memory transfer start/stop are asynchronous



 Offline trace merging, clock correction, and translation

ICS 2010 Measurement of Heterogeneous Applications using CUDA 19

Running with TAU / TAUcuda

 To run an CUDA application with TAUcuda, all of the

necessary libraries must be dynamically linked

 TAUcuda works with unmodified CUDA application binaries

 Use scripts for different scenarios:

 taucuda profiler.sh / taucuda mpirun.sh (Profiling)

 taucuda tracer.sh / taucuda mpirun tracer.sh (Tracing)



 TAUcuda produces profiles or traces in the current working

directory in sub-folders to distinguish them from TAU

performance output

 TAUCuda profiles are in different metric sub-folders:

gpu_elapsed_time

gpu_memory_transfer

gpu_shared_memory

ICS 2010 Measurement of Heterogeneous Applications using CUDA 20

TAUcuda Experimentation Environments

 University of Oregon

 Linux workstation

 Dual quad core Intel Xeon

 GTX 280

 GPU cluster (Mist)

 Four dual quad core Intel Xeon server nodes

 Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)

 Argonne National Laboratory (Eureka)

 100 dual quad core NVIDIA Quadro Plex S4

 200 Quadro FX5600 (2 per S4)

 University of Illinois at Urbana-Champaign

 GPU cluster (AC cluster)

 32 nodes with one S1070 (4 GPUs per node)



ICS 2010 Measurement of Heterogeneous Applications using CUDA

CUDA SDK Transpose (256 x 4096 matrix)

CPU profile









cu events

cuda events







...



GPU profile



kernel



ICS 2010 Measurement of Heterogeneous Applications using CUDA 22

CUDA SDK OceanFFT (profile, trace)





kernels









Jumpshot trace visualizer

GPU CPU









ICS 2010 Measurement of Heterogeneous Applications using CUDA 23

CUDA Linpack Profile (4 processes, 4 GPUs)

 Measureperformance of heterogeneous parallel applications

 GPU-accelerated Linpack benchmark (M. Fatica, NVIDIA)









ICS 2010 Measurement of Heterogeneous Applications using CUDA 24

CUDA Linpack Trace









CUDA memory transfer (white) MPI communication (yellow)

ICS 2010 Measurement of Heterogeneous Applications using CUDA 25

NAMD and TAU / TAUcuda

 DemonstrateTAUcuda with scientific application

 NAMD is a molecular dynamics application

using Charm++ parallel object-oriented language

 Written

 Charm++ and NAMD run on large-scale HPC clusters

 NAMD has been accelerated with CUDA



 TAU integrated in Charm++ (ICPP 2009 paper)

 Now apply TAUcuda to observe influence of GPU execution

 Observe the effect of CUDA acceleration

 Show scaling results for GPU cluster execution









ICS 2010 Measurement of Heterogeneous Applications using CUDA

NAMD Profile (4 processes, 4 GPUs)









ICS 2010 Measurement of Heterogeneous Applications using CUDA 27

NAMD GPU Scaling (4–64 GPUs)

 Strongscaling experiments on Eureka cluster

 Use TAU PerfExplorer to compare









ICS 2010 Measurement of Heterogeneous Applications using CUDA

SHOC Stencil2D (512 iterations, 4 CPUxGPU)

 Scalable HeterOgenerous Computing benchmark suite

 CUDA / OpenCL kernels and microbenchmarks (ORNL)









CUDA memory transfer (white)

ICS 2010 Measurement of Heterogeneous Applications using CUDA 29

HMPP SGEMM (CAPS Entreprise)

Host

Process

Transfer

Kernel

Compute

Kernel





Host

Process



Transfer

Kernel



Compute

Kernel





ICS 2010 Measurement of Heterogeneous Applications using CUDA 30

Conclusions

 Heterogeneous parallel systems will require parallel

performance tools that integrate performance perspectives

 Need to rely on hardware and software support in

heterogeneous components to access performance

 Experimental Linux CUDA driver provided by NVIDIA

facilitiates access to CUDA / GPU performance information

 TAUcuda merges with TAU (CPU) performance data

 TAU/TAUcuda provides powerful scalable heterogeneous

performance measurement and analysis

 NVIDIA is incorporating performance tools requirements in

next-generation driver/device libraries

 TAUopencl is in development (working prototype)



ICS 2010 Measurement of Heterogeneous Applications using CUDA 31

Support Acknowledgements

 Department of Energy (DOE)

 Office of Science

 ASC/NNSA

 Department of Defense (DoD)

 HPC Modernization Office (HPCMO)

 NSF Software Development for Cyberinfrastructure (SDCI)

 Research Centre Juelich

 Argonne National Laboratory

 Technical University Dresden

 ParaTools, Inc.

 NVIDIA





ICS 2010 Measurement of Heterogeneous Applications using CUDA



Related docs
Other docs by Jun Wang
Management Two
Views: 2  |  Downloads: 0
Management training Red Cross branch offices
Views: 2  |  Downloads: 0
Management subjekt_ CR
Views: 2  |  Downloads: 0
Management Styles_1_
Views: 18  |  Downloads: 0
Management stratégique
Views: 2  |  Downloads: 0
Management Standards at CARE - CARE Academy
Views: 2  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!