VIEWS: 12 PAGES: 31 POSTED ON: 10/23/2013
CUDA Lecture 6 Embarrassingly Parallel Computations Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Embarrassingly Parallel Computations A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process(or). No communication or very little communication between processes. Each process can do its tasks without any interaction with other processes. Embarrassingly Parallel Computations – Slide 2 Embarrassingly Parallel Computation Examples Low level image processing Mandelbrot set Monte Carlo calculations Embarrassingly Parallel Computations – Slide 3 Topic 1: Low level image processing Many low level image processing operations only involve local data with very limited if any communication between areas of interest. Embarrassingly Parallel Computations – Slide 4 Partitioning into regions for individual processes Square region for each process Can also use strips Embarrassingly Parallel Computations – Slide 5 Some geometrical operations Shifting Object shifted by x in the x-dimension and y in the y -dimension: x = x + x y = y + y where x and y are the original and x and y are the new coordinates. Scaling Object scaled by a factor of Sx in the x-direction and Sy in the y-direction: x = xSx y = ySy Embarrassingly Parallel Computations – Slide 6 Some geometrical operations (cont.) Rotation Object rotated through an angle about the origin of the coordinate system: x = x cos + y sin y = -x sin + y cos Clipping Applies defined rectangular boundaries to a figure and delete points outside the defined area. Display the point (x,y) only if xlow x xhigh ylow y yhigh otherwise discard. Embarrassingly Parallel Computations – Slide 7 Notes GPUs sent starting row numbers Since these are related to block and thread i.d.s each GPU should be able to figure it out for themselves Return results are single values Not good programming but done to simplify first exposure to CUDA programs No terminating code is shown Embarrassingly Parallel Computations – Slide 8 Topic 2: Mandelbrot Set Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function zk+1 = zk2 + c where zk+1 is the (k+1)th iteration of the complex number z = a + bi and c is a complex number giving position of point in the complex plane. The initial value for z is zero. Embarrassingly Parallel Computations – Slide 9 Mandelbrot Set (cont.) Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by Embarrassingly Parallel Computations – Slide 10 Sequential routine computing value of one point returning number of iterations structure complex { float real; float imag; } int cal_pixel(complex c) { int count=0, max=256; complex z; float temp, lengthsq; z.real = 0; z.imag = 0; do { temp = z.real * z.real – z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count; } Embarrassingly Parallel Computations – Slide 11 Notes Square of length compared to 4 (rather than comparing length to 2) to avoid a square root computation Given the terminating conditions, all of the Mandelbrot points must be within a circle centered at the origin having radius 2 Resolution is expanded at will to obtain interesting images Embarrassingly Parallel Computations – Slide 12 Mandelbrot set Embarrassingly Parallel Computations – Slide 13 Parallelizing Mandelbrot Set Computation Dynamic Task Assignment Have processor request regions after computing previous regions. Doesn’t lend itself to a CUDA solution. Static Task Assignment Simply divide the region into a fixed number of parts, each computed by a separate processor. Not very successful because different regions require different numbers of iterations and times. …but what we’ll do in CUDA. Embarrassingly Parallel Computations – Slide 14 Dynamic Task Assignment: Work Pool/Processor Farms Embarrassingly Parallel Computations – Slide 15 Simplified CUDA Solution #include <stdio.h> #include <conio.h> #include <math.h> #include “../common/cpu_bitmap.h” #define DIM 1000 struct cuComplex { float r, i; cuComplex(float a, float b) : r(a), i(b) {} __device__ float magnitude2 (void) { return r*r + i*i; } __device__ cuComplex operator*(const cuComplex& a) { return cuComplex(r*a.r – i*a.i, i*a.r + r*a.i); } __device__ cuComplex operator+(const cuComplex& a) { return cuComplex(r + a.r, i + a.i); } }; Embarrassingly Parallel Computations – Slide 16 Simplified CUDA Solution (cont.) __device__ int mandelbrot(int x, int y) { float jx = 2.0 * (x – DIM/2) / (DIM/2); float jy = 2.0 * (y - DIM/2) / (DIM/2); cuComplex c(jx,jy); cuComplex z(0.0,0.0); int i = 0; do { z = z * z + c; i++; } while ((z.magnitude2() < 4.0) && (i < 256)); return i % 8; } Embarrassingly Parallel Computations – Slide 17 Simplified CUDA Solution (cont.) __global__ void kernel (unsigned char *ptr) { // map from blockIdx to pixel position int x = blockIdx.x; int y = blockIdx.y; int offset = x + y * gridDim.x; // now calculate the value at that position int mValue = mandelbrot(x,y); ptr[offset*4 + 0] = 0; ptr[offset*4 + 1] = 0; ptr[offset*4 + 2] = 0; ptr[offset*4 + 3] = 255; switch (mValue) { case 0: break; // black case 1: ptr[offset*4 + 0] = 255; break; // red case 2: ptr[offset*4 + 1] = 255; break; // green case 3: ptr[offset*4 + 2] = 255; break; // blue case 4: ptr[offset*4 + 0] = 255; // yellow ptr[offset*4 + 1] = 255; break; case 5: ptr[offset*4 + 1] = 255; // cyan ptr[offset*4 + 2] = 255; break; case 6: ptr[offset*4 + 0] = 255; // magenta ptr[offset*4 + 2] = 255; break; default: ptr[offset*4 + 0] = 255; // white ptr[offset*4 + 1] = 255; ptr[offset*4 + 2] = 255; break; } } Embarrassingly Parallel Computations – Slide 18 Simplified CUDA Solution (cont.) int main(void) { CPUBitmap bitmap(DIM, DIM); unsigned char *dev_bitmap; cudaMalloc((void**) &dev_bitmap, bitmap.image_size()); dim3 grid(DIM, DIM); kernel<<<grid,1>>>(dev_bitmap); cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost)); bitmap.display_and_exit(); cudaFree(dev_bitmap); } Embarrassingly Parallel Computations – Slide 19 Topic 3: Monte Carlo methods An embarrassingly parallel computation named for Monaco’s gambling resort city, method’s first important use in development of atomic bomb during World War II. Monte Carlo methods use random selections. Given a very large set and a probability distribution over it Draw a set of samples identically and independently distributed Can then approximate the distribution using these samples Embarrassingly Parallel Computations – Slide 20 Applications of Monte Carlo Method Evaluating integrals of arbitrary functions of 6+ dimensions Predicting future values of stocks Solving partial differential equations Sharpening satellite images Modeling cell populations Finding approximate solutions to NP-hard problems Embarrassingly Parallel Computations – Slide 21 Monte Carlo Example: Calculating Form a circle within a square, with unit radius so that the square has sides 22. Ratio of the area of the circle to the square given by Randomly choose points within square. Keep score of how many points happen to lie within circle. Fraction of points within the circle will be /4 given a sufficient number of randomly selected samples. Embarrassingly Parallel Computations – Slide 22 Calculating (cont.) Embarrassingly Parallel Computations – Slide 23 Calculating (cont.) x x x x x x x x x x x x x x xx x x x x Embarrassingly Parallel Computations – Slide 24 Relative Error Relative error is a way to measure the quality of an estimate The smaller the error, the better the estimate a: actual value e: estimated value Relative error = |e-a|/a Embarrassingly Parallel Computations – Slide 25 Increasing Sample Size Reduces Error n Estimate Abs. Error 1/(2n) 10 2.40000 0.23606 0.15811 100 3.36000 0.06952 0.05000 1,000 3.14400 0.00077 0.01581 10,000 3.13920 0.00076 0.00500 100,000 3.14132 0.00009 0.00158 1,000,000 3.14006 0.00049 0.00050 10,000,000 3.14136 0.00007 0.00016 100,000,000 3.14154 0.00002 0.00005 1,000,000,000 3.14155 0.00001 0.00002 Embarrassingly Parallel Computations – Slide 26 Next Example: Computing an Integral One quadrant of the construction can be described by the integral Random numbers (xr,yr) generated, each between 0 and 1. Counted as in circle if xr2+yr2 1. Embarrassingly Parallel Computations – Slide 27 Example Computing the integral Sequential code sum=0; Routine randv(x1,x2) for (i=0; i<N; i++) { xr = rand_v(x1,x2); returns a pseudorandom sum += xr * xr - 3*xr; } number between x1 and x2. area = (sum / N) * (x2 - x1); Monte Carlo method very useful if the function cannot be integrated numerically (maybe having a large number of variables). Embarrassingly Parallel Computations – Slide 28 Why Monte Carlo Methods Interesting Error in Monte Carlo estimate decreases by the factor 1/n Rate of convergence independent of integrand’s dimension Deterministic numerical integration methods do not share this property Hence Monte Carlo superior when integrand has 6 or more dimensions Furthermore Monte Carlo methods often amenable to parallelism Can find an estimate about p times faster or reduce error of estimate by p Embarrassingly Parallel Computations – Slide 29 Summary Embarrassingly Parallel Computation Examples Low level image processing Mandelbrot set Monte Carlo calculations Applications: numerical integration Related topics: Metropolis algorithm, simulated annealing For parallelizing such applications, need best way to generate random numbers in parallel. Embarrassingly Parallel Computations – Slide 30 End Credits Based on original material from The University of Akron Tim O’Neil, Saranya Vinjarapu The University of North Carolina at Charlotte Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn Revision history: last updated 7/28/2011. Embarrassingly Parallel Computations – Slide 31