VIEWS: 17 PAGES: 94 POSTED ON: 3/2/2011
CUDA Programming Model Overview Hardware Hierarchy and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University CUDA Development Environment CUDA Development Environment for developers who for developers who wants low-level APIs wants high-level APIs share back-end compiler and optimize technology 3 CUDA Development Environment 4 CUDA Development Environment CUDA source files must be compiled by nvcc - create host object code directly - create device object code directly - create Parallel Thread eXecution source code directly - compile into device emulation mode for host simulation Executable CUDA code requires two libraries - dynamic linking with cuda core library - dynamic linking with cuda runtime library 5 CUDA Development Environment CUDA emulation mode - use host to simulate kernel execution - no need of any device and CUDA driver - each device thread is emulated with a host thread Running in device emulation mode can - use host native debugger support - call any host function from device code and vice-versa - access any device-specific data from host code and vice-versa - detect deadlock situation caused by improper synchronization 6 CUDA Development Environment Linux and Windows environment 7 CUDA Development Environment float4 me=gx[gtid]; me.x=me.x+me.y*me.z; C/C++ CUDA Virtual Layer application NVCC CPU code PTX code PTX to target Physical Layer compiler G80 … GPU target code ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0]; mad.f32 $f1,$f5,$f3,$f1; 8 CUDA Development Environment How to install CUDA? How to compile CUDA file? How to compile CUDA file into PTX? 9 CUDA Development Environment Windows Nexus IDE - fist IDE for massively thread parallel applications - accelerate coprocessing application development - complete Visual Studio integrated development environment 10 CUDA Development Environment Linux and Mac Visual Profiler 11 Start from Matrix-Matrix Multiplication naïve and tiled algorithm Naïve Matrix-Matrix Multiplication Each thread calculates one result element - read one row and one column from global memory - perform inner product and store back to global memory N WIDTH M P WIDTH WIDTH WIDTH 13 Naïve Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N __global__ void SGEMM(float* d_A,float* d_B,float* d_C) { float value; float Melement; float Nelement; //calculate each thread global index ... value=0.0f; //each thread performs its inner product for(int loop=0;loop<dim;loop++) { Melement=Amatrix[yidx*dim+loop]; Nelement=Bmatrix[loop*dim+xidx]; value=value+Melement*Nelement; } //store final result into P matrix Pmatrix[yidx*dim+loop]=value; } 14 Naïve Matrix-Matrix Multiplication Analyze the global memory usage - each thread compute one element on the result matrix - load one row and one column elements from matrix M and N - perform one multiply and addition for each pair of elements - compute and access global memory ratio is close to 1:1 Analyze the global memory bandwidth - 4bytes x 346.5GFLOPs = 1386GB/s requires to achieve peak - 86.4GB/s limits the actual performance about 21.6GFLOPs - the actual code for naïve algorithm runs at about 15GFLOPs 15 Naïve Matrix-Matrix Multiplication Need to drastically cut own memory accessing - each row on matrix M is read for many times - each column on matrix N is read for many times N WIDTH M P WIDTH WIDTH WIDTH 16 Tiled Matrix-Matrix Multiplication Use tiled algorithm for matrix data reusing - each block computes one square sub-matrix result - each thread computes one element on the sub-matrix TILE_WIDTH TILE_WIDTH N WIDTH M P TILE_WIDTHE Pdsub WIDTH TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH 17 Tiled Matrix-Matrix Multiplication Each block should have many threads - tiled size 16 creates 16x16=256 threads per block - 1024x1024 matrix size needs 64x64=4096 blocks on grid Each thread block loads two sub-matrix and compute results - load 2x256=512 floats from global memory to shared memory - perform 256x(2x16)=512x16 operations on the shared memory - global memory bandwidth no longer a limiting factor - compute and access global memory ratio is close to 16:1 18 Tiled Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N void SGEMM(float* h_A,float* h_B,float* h_C) { dim3 gridSize; dim3 blockSize; //setup the execution configuration blockSize.x=TILED_WIDTH; blockSize.y=TILED_WIDTH; gridSize.x=WIDTH/TILED_WIDTH; gridSize.y=WIDTH/TILED_WIDTH; //launch device matrix-matrix multiplication kernel sgemm<<<gridSize,blockSize>>>(d_A,d_B,d_C); ... } 19 Tiled Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N __global__ void SGEMM(float* d_A,float* d_B,float* d_C) { //setup the block index and thread index int bx=blockIdx.x; int by=blockIdx.y; int tx=threadIdx.x; int ty=threadIdx.y; float value=0.0f; //declare the shared memory per block __shared__ float Mshared[TILED_WIDTH][TILED_WIDTH]; __shared__ float Nshared[TILED_WIDTH][TILED_WIDTH]; ... 20 Tiled Matrix-Matrix Multiplication //loop over all sub-matrices on matrix M and N //required to compute block sub-matrix results for(int loop=0;loop<WIDTH/TILED_WIDTH;loop++) { //get the pointer to the current sub-matrix M and N float* Msub=GetSubMatrix(M,loop,bx,by,tx,ty,WIDTH); float* Nsub=GetSubMatrix(N,loop,bx,by,tx,ty,WIDTH); //each thread load one element to shared memory Msub[ty][tx]=GetMatrixElement(Msub,tx,ty); Nsub[ty][tx]=GetMatrixElement(Nsub,tx,ty); //synchronize to make sure the sub-matrix are load //to shared memory before computing partial result __syncthreads(); 21 Tiled Matrix-Matrix Multiplication void __syncthreads() function - synchronize all executed threads in the same block - resume normally when all threads have reached the same point - use to avoid RAW/WAR/WRW hazards when accessing memory 22 Tiled Matrix-Matrix Multiplication //each thread conputes one element of the sub-matrix for(int loope=0;loope<TILED_WIDTH;loope++) value=value+Msud[ty][loope]*Nsub[loope][tx]; //synchronize to make sure that the proceding //computation is done before loading two new //sub-matrices of M and N in the next iteration __syncthreads(); } //get the pointer to the block sub-matrix P float* Psub=GetSubMatrix(P,loop,bx,by,tx,ty,WIDTH); //save the block sub-matrix result to global memory SetMatrixElement(Psub,value,bx,by,tx,ty,WIDTH); return; } 23 Tiled Matrix-Matrix Multiplication 24 Tiled Matrix-Matrix Multiplication What is the limitation on matrix tiled size? - access memory and compute ratio depends on tiled size - higher memory usage ratio comes from the lager tiled size - the ratio is limited by shared memory size and block size What is the performance between two algorithms? - naïve matrix-matrix multiplication runs at about 15GFLOPs - tiled matrix-matrix multiplication runs at about 90GFLOPs What kind of tiled algorithm used on Intel Math Kernel Library? 25 Matrix-Matrix Multiplication device memory host memory shared memory L3 cache L2 cache GPU version CPU version 26 Matrix-Matrix Multiplication 27 Matrix-Matrix Multiplication How about the memory accessing pattern? N WIDTH M P WIDTH WIDTH WIDTH 28 Tiled Matrix-Matrix Multiplication How to accelerate current tiled algorithm? - prefetch the next sub-matrix data while computing - overlap the loading and computing times to hide latency 29 Recommend Program Strategy Global memory resides in device memory - much slower accessing than shared memory A profitable way of programming on device is tiled data - divide data into subsets that fit into shared memory - handle each data subset with one thread block by: 1: load the subset from global memory to shared memory use multiple threads to exploit memory-level parallelism 2: perform computation on the subset from shared memory 3: store final results from shared memory to global memory 30 Same Idea for Matrix Transpose naïve and tiled algorithm Naïve Matrix Transpose Each thread represents one element of the matrix - load one row element from original matrix - store one column element into another matrix WIDTH WIDTH 32 Naïve Matrix Transpose load data from global memory WIDTH stride = 1 store data into global memory WIDTH stride = dimension 33 Tiled Matrix Transpose load data from memory store data into memory 0,0 the 0,2 0,0 the 2,0 Consider0,1 tiled algorithm on1,0 matrix! 1,0 1,1 1,2 0,1 1,1 2,1 2,0 2,1 2,2 0,2 1,2 2,2 34 Tiled Matrix Transpose load data from global memory to shard memory 0,0 0,1 0,2 0,0 0,1 0,2 transpose matrix on blocks! Consider one of thread1,1 1,2 1,0 1,1 1,2 1,0 shared memory 2,0 2,1 2,2 2,0 2,1 2,2 global memory shared memory 0,0 0,1 0,2 0,0 1,0 2,0 1,0 1,1 1,2 0,1 1,1 2,1 2,0 2,1 2,2 0,2 1,2 2,2 store data from shared memory to global memory 35 Hardware Hierarchy Architecture Details Hardware Hierarchy Architecture this unit distributes all blocks into shader multiprocessors Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 37 Hardware Hierarchy Architecture all blocks get scheduled this means you need round-robin based on sufficient blocks to fill the number of shaders all the shader pipeline 38 Hardware Hierarchy Architecture Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB this unit performs graphic texture operations texture filtering and pixel interpolations 39 Hardware Hierarchy Architecture shader processor array SPA on chip shader number depends on the different hardware Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 40 Hardware Hierarchy Architecture shader multiprocessor SM on SPA thread block is scheduled into one SM Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 41 Hardware Hierarchy Architecture shader multiprocessor SM on SPA thread block is scheduled into one SM Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 42 Hardware Hierarchy Architecture shader processor SP for regular arithmetic thread is scheduled into one shader processor 43 Hardware Hierarchy Architecture low precision special function unit SFU hardware supports intrinsic functions 44 Hardware Hierarchy Architecture Intrinsic functions are supported by hardware SFU - intrinsic function comes from traditional usage - the functions are less accurate but faster version - the functions have name prefixed with __function() Selectable operation rounding mode - function suffixed with _rn uses round-to-nearest-even - function suffixed with _rz uses round-towards-zero - function suffixed with _rd uses round-down - function suffixed with _ru uses round-up SFU and SP units are two independent operation units - SFU and SP units can be performed simultaneously 45 Hardware Hierarchy Architecture on-chip shared memory on each SM fetch instruction and dispatch shared data in the same thread block to all threads in the same warp 46 Warp Architecture Unrelated to software design which is purely hardware - designed to solve variable block size - designed to achieve effective divergence - designed to hide memory accessing latency Each block are divided into several warps - each warp contains 32-threads on current shader - warp size is decided from some hardware reasons Each shader can execute only one warp at the same time - each shader can monitor and switch between several warps - synchronization not necessary for all threads in the same warp 47 Warp Architecture block size 256 1 2 3 4 5 6 7 8 warp umber 8 block size 280 1 2 3 4 5 6 7 8 9 warp umber 9 8 threads on the last warp are idle 48 Warp Architecture Shader multiprocessor scheduling Shader Multiprocessor block 0 block 1 block 0 block 0 block 1 block 1 warp 0 warp 1 warp 0 warp 1 block 0 block 0 block 1 block 1 warp 2 warp 3 warp 2 warp 3 49 Warp Architecture Shader multiprocessor synchronization Shader Multiprocessor block 0 block 1 block 0 block 0 block 1 block 1 warp 0 warp 1 warp 0 warp 1 block 0 block 0 block 1 block 1 warp 2 warp 3 warp 2 warp 3 50 Warp Architecture The warp scheduler on each shader is very smart!! 51 Warp Architecture SM hardware implements zero-overhead warp scheduling - warps whose next instruction has its operand ready for consumption are eligible for execution via score boarding - eligible warps are selected for execution on a prioritized scheduling policy via warp scheduling priority table Thread block is partitioned into several warps - partition is always the same on different device - warp size may be changed from generation to generation - thread index within the warp is consecutive and increasing All threads in the same warp execute the same instruction - synchronize not necessary for threads in the same warp 52 Warp Architecture 1 2 3 4 5 6 7 8 9 10 11 12 block scheduler 11 12 9 10 7 8 5 6 3 4 1 2 53 warp scheduler score boarding 54 Warp Architecture 11 7 9 5 3 1 Warp Architecture Why the warp size is equal to 32-threads? - depend on the shader instruction dispatch hardware - dispatch same instruction to all threads need 4 clock cycle - each warp thread needs 4 clock cycles to get next instruction - 8 concurrent threads x 4 clock cycles = 32 threads on one warp 55 Warp Architecture clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4 instruction k instruction k+1 56 Warp Architecture How to hide the global memory accessing latency? - suppose one global access is needed for every n instructions - more than 200/4n warps are needed to fully tolerate latency warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 57 Warp Architecture How to hide the global memory accessing latency? - suppose one global access is needed for every n instructions - more than 200/4n warps are needed to fully tolerate latency warp 0 warp 1 warp 2 warp 3 58 Control Flow Control Flow Each instruction is issued per 32 threads or 1 warp Main performance concern with branching is divergent - threads within a single warp take different control path - different execution paths within a warp are serialized - the control paths taken by the threads in the same warp are traversed one at a time until there is no more Different warps are independent for the divergent condition 60 Control Flow //divergent condition if(threadIdx.x<2) {......} else {......} - this creates two different control paths for threads in a block - the branch granularity is less than warp size (only the first warp) - thread 0 and 1 follow different path than the rest of threads in warp 0 warp 0 warp 1 warp 2 warp 3 61 Control Flow //no divergent condition if(threadIdx.x/WARP_SIZE<2) {......} else {......} - this creates two different control paths for threads in a block - the branch granularity is a whole multiple of the warp size - all threads in any given warp follow the same control path warp 0 warp 1 warp 2 warp 3 62 Control Flow Profiler counts each shader multiprocessor execution - count branch and divergent branch - count warp serialize and instructions 63 Parallel Vector Reduction Parallel Vector Reduction Give any array and reduce them to a single value in parallel Some examples: - sum reduction: sum of all values in the array - max reduction: maximum of all values in the array 65 Parallel Vector Reduction Assume an in-place reduction using shared memory - the original vector is in device global memory - shared memory used to hold a partial sum result - each iteration bring the partial sum close to final sum - the partial solution on shared memory will be in element 0 66 Parallel Vector Reduction 67 Parallel Vector Reduction //naïve parallel vector sum reduction //raw data is loaded into shared memory __shared__ float array[size]; .... //outer loop for parallel reduction for(int stride=1;stride<blockDimx.x;stride=stride*2) { //check the thread is active or not if(threadIdx.x%(2*stride)==0) array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride]; //make sure all active threads are done __syncthreads(); } 68 Parallel Vector Reduction Time Bandwidth 69 Parallel Vector Reduction Each warp will traverse two control paths sequentially - threads that do not perform addition may cost extra cycles cause from the implementation of divergence No more than half of threads will be executed at any time - all old index threads are disable right from the beginning - less than ¼ threads will be activated for all warps on average - after the 5th iteration, entire warps in each block will be disabled, this is poor resource utilization but no divergence 70 Parallel Vector Reduction //naïve parallel vector summation reduction //raw data is loaded into shared memory BAD: divergence due to __shared__ float array[size]; interleaved branch decisions //outer loop for parallel reduction for(int stride=1;stride<blockDimx.x;stride=stride*2) { //check the thread is active or not if(threadIdx.x%(2*stride)==0) array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride]; //make sure all active threads are done __syncthreads(); } 71 Parallel Vector Reduction //naïve parallel vector summation reduction take out the mod //raw data is loaded into shared memory __shared__ float array[size]; operation on first step //outer loop for parallel reduction for(int stride=1;stride<blockDimx.x;stride=stride*2) { index=threadIdx.x*stride*2; if(index<blockDim.x) array[index]=array[index]+array[index+stride]; //make sure all active threads are done __syncthreads(); } 72 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup 73 Parallel Vector Reduction 74 Parallel Vector Reduction //naïve parallel vector summation reduction //raw data is loaded into shared memory __shared__ float array[size]; //outer loop for parallel reduction for(int stride=blockDim.x/2;stride>0;stride=stride>>1) { //check the thread is active or not if(threadIdx.x<stride) array[threadIdx.x]=array[threadIdx.x] +array[threadIdx.x+stride]; //make sure all active threads are done __syncthreads(); } 75 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup 76 Parallel Vector Reduction Idle thread problem - half of the threads are idle on the first loop iteration!! //raw data is loaded into shared memory Idle: only a half of __shared__ float array[size]; threads are active this is wasteful!! //outer loop for parallel reduction for(int stride=blockDim.x/2;stride>0;stride=stride>>1) { //check the thread is active or not if(threadIdx.x<stride) array[threadIdx.x]+=array[threadIdx.x+stride]; //make sure all active threads are done __syncthreads(); } 77 Parallel Vector Reduction Halve the block size - replace original single load with two loads - add the first two element for first level reduction input data input data block size block size 78 Parallel Vector Reduction Original data loading //raw data is loaded into shared memory __shared__ float array[size]; //load data from global memory to shared memory array[threadIdx.x]= ivector[threadIdx.x]; New data loading: halve original block size //raw data is loaded into in shared memory __shared__ float array[size]; //load data from global memory to shared memory array[threadIdx.x]= ivector[threadIdx.x] + ivector[threadIdx.x+blockDim.x]; 79 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup 80 Parallel Vector Reduction It seems far from bandwidth bound at 17 GB/s - we know reduction has low arithmetic intensity - therefore a likely bottleneck is instruction overhead - consider the address arithmetic and loop overhead Active thread number decreases as reduction proceed - only one warp is active when stride is less than 32 - instructions are synchronous within the same warp (remove the block synchronization on the last warp) Strategy: unroll the last 6 iterations on the inner loop 81 Parallel Vector Reduction //outer loop for parallel reduction for(int stride=blockDim.x/2; stride>32 ;stride=stride>>1) { //check the thread is active or not if(threadIdx.x<stride) array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride]; this not occurs divergent branch on the last active warp //make sure all active threads are done __syncthreads(); } //unroll the last 6 loop and skip synchronize if(threadIdx.x<32) { array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 32 ]; array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 16 ]; remove loop instruction array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 8 ]; array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 4 ]; and block synchronization array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 2 ]; array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 1 ]; } 82 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup Note: this saves useless work in all warps, not just the last one!! All warps execute all for loop and if instructions without unrolling 83 Parallel Vector Reduction Completely unroll the loop reduction - the block size is limited by 512 threads - we are sticking to power-of-2 block size - suppose we know the number of iterations It is easy to unroll for a fixed block size - need a very smart method to unroll all possible size - how can we unroll all possible block size at compile time? Templates to the rescue!! - CUDA supports C++ template parameters on device function 84 Parallel Vector Reduction Specify block size as a function template parameter //use template to unroll all possible block size template<int blockSize> __global__ void reduce(float* ivector,float* ovector) How to invoke template kernel in different block size? //invoke template kernel in different block size reduce<512><<<blockNumber,blockSize>>>(ivector,ovector); reduce<256><<<blockNumber,blockSize>>>(ivector,ovector); reduce<128><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 64><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 32><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 16><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 8><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 4><<<blockNumber,blockSize>>>(ivector,ovector); reduce< 2><<<blockNumber,blockSize>>>(ivector,ovector); Reduce< 1><<<blockNumber,blockSize>>>(ivector,ovector); 85 Parallel Vector Reduction //unroll the first 3 iterations conditionally if(blockSize>=512) { if(threadIdx.x<256) array[threadIdx.x]+=array[threadIdx.x+256]; __syncthraeds(); } if(blockSize>=256) { if(threadIdx.x<128) array[threadIdx.x]+=array[threadIdx.x+128]; __syncthraeds(); } if(blockSize>=128) { if(threadIdx.x<64) array[threadIdx.x]+=array[threadIdx.x+64]; __syncthraeds(); } 86 Parallel Vector Reduction //unroll the last active warp conditionally if(threadIdx.x<32) { if(blockSize>=64) array[threadIdx.x]+=array[threadIdx.x+32]; if(blockSize>=32) array[threadIdx.x]+=array[threadIdx.x+16]; if(blockSize>=16) array[threadIdx.x]+=array[threadIdx.x+ 8]; if(blockSize>= 8) array[threadIdx.x]+=array[threadIdx.x+ 4]; if(blockSize>= 4) array[threadIdx.x]+=array[threadIdx.x+ 2]; if(blockSize>= 2) array[threadIdx.x]+=array[threadIdx.x+ 1]; } Note: all code in RED will be evaluated at compile time!! This result in a very efficient inner loop, no loop instruction 87 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup 88 Parallel Vector Reduction Combine both sequential and parallel reduction - each thread loads and sums multiple elements into shared - last parallel part performs tree-based reduction in shared input data input data block size block size input data block size 89 Parallel Vector Reduction //load data from global memory to shared memory array[threadIdx.x]=vector[threadIdx.x] +vector[threadIdx.x+blockDim.x]; int index=threadIdx.x; //load data from global memory to shared memory While(index<maxn) { array[threadIdx.x]=vector[index] +vector[index+blockDim.x]; index=index+blockSize<<1; } 90 Parallel Vector Reduction Step Cumulative Time Bandwidth Speedup Speedup 91 Parallel Vector Reduction Why this is suitable for traditional parallel architecture? 92 Matrix-Matrix Multiplication How about unroll the matrix-matrix multiplication? 93 Reference - Mark Harris http://www.markmark.net/ - Wei-Chao Chen http://www.cs.unc.edu/~ciao/ - Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php 94