Linear Algebra � I by L28387u

VIEWS: 7 PAGES: 40

									Linear Algebra – I

 Sashi Kumar Penta
   GPGP class presentation
       Spring 2007
                  LA – 1: Outline
• Motivation
• Dense matrices and vectors
  –   Representation
  –   Vector-Vector operations
  –   Vector reduction type operations
  –   Matrix-Matrix multiplication
       • Larsen et. al.
       • Fatahalian et. al.
       • Naga Govindaraju et. al.
• Introduction to CUDA
  – Matrix multiplication
            LA – 2: Outline
• Memory requirements for Balanced
  computer architectures
• Sparse matrices
  – Banded matrices
  – Random sparse matrices
• Sparse Matrix-Vector multiplication
• Solving Navier - strokes on GPU
• …
         Why LA on GPUs?
• Applications of LA
  – Scientific computations
  – Search (LSI)
  – Graphics Simulations
• The GPU is a fast streaming processor
  – LA operations are easily “streamable”
• The result of the computation is already on
  the GPU and ready for display
  – For simulations
          Arrays = textures
• 2D arrays are the native GPU data layout

• Maximum array size in GPU 8K x 8K

• Texture coordinates to access values
  stored in the textures
                 Representation
                         1
1                    N
                                         N


            Matrix               N Vectors
        i




                                                        N 2D-Textures
                                                 1        i             N
    N                    N       ... ...                ...       ...



                             1       i       N
             N




                                                     Courtesy Jens Krüger
     Matrix multiplication: Why is it
              important?


•   Inherently parallel
•   Memory intensive
•   Greatly benefit from GPUs
•   Used for bench marking hardware
       Early Pre-floating point
• Larsen and McAllister described an early
  pre-floating point implementation of matrix
  multiplies.
• 2D textures and blending operations to
  perform matrix product
• nVidia GeForce3 GPU : 8-bit fixed point output
• 4.4 GB ops
     L-M GPU-based Nested Loop
         Matrix multiplication
for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
         Z[i,j] = 0;
         // Each iteration in the following loop is quadrilateral
         // rasterization of size N x N
        for(k = 0; k < N; k++)
             Z[i,j] = Z[I, j] + X[i,k] * Y[k,j];
     }
}
              Continued ..




The ability to modulate with two textures provides us
with the ability to multiply values from two separate
textures using the graphics hardware
                                   Continued…
Load texture texA with matrix A.
Load texture texB with matrix B.
Set the multitexturing mode to modulate.
Set the framebuffer write mode to accumulate.
For i = 0 .. n -1
    draw an n x n pixel rectangle with
         texA coords (0,i), (0,i), (n-1, i), (n-1, i) and
         texB coords (i, 0), (i, n-1), (i, n-1), (i,0).
End
Screen contains result of A x B
        Floating point textures
• Texture targets
  – GL_TEXTURE_2D
     • Coordinates have to be normalized [0, 1]
  – GL_TEXTURE_RECTANGLE_ARB
     • Coordinates are not normalized


  GL_LUMINANCE 1 floating point
                  - sqrt(N) * sqrt(N)
  GL_RGBA      4 floating points
                  - sqrt(N/4) * sqrt(N/4)
                                                                      Courtesy Jens Krüger

                                        Operations
• Vector-Vector Operations
     – Reduced to 2D texture operations
     – Coded in pixel shaders
• Example: Vector1 + Vector2  Vector3
              Vector 1    Vector 2      Vector 3




                                                                                   Render Texture
Static quad                                        TexUnit 0           TexUnit 1


                         Pass through                 return tex0 + tex1


                    Vertex Shader                        Pixel Shader
 Vector-vector operation: saxpy
• Scalar alpha * x plus y, i.e. (alpha * x + y)
• Y_new = y_old + alpha * x
• On CPU:
     for (int i = 0; i < N; i++)
         Y[i] = Y[i] + alpha * X[i]

• Calculations are independent
• Computational kernel
         Y_new[i] = Y_old[i] + alpha * X[i]
             Kernels = shaders
float saxpy (
    float2 coords : TEXCOORD0,
    uniform sampler2D textureY,
    uniform sampler2D textureX,
    uniform float alpha) : COLOR
{
    float result;
    float y = tex2D(textureY, coords);
    float x = tex2D(textureX, coords);
    result = y + alpha * x;
    return result;
}
Four-channeled RGBA texture
float4 saxpy (
    float2 coords : TEXCOORD0,
    uniform sampler2D textureY,
    uniform sampler2D textureX,
                                        4 channeled textures to improve
    uniform float alpha) : COLOR     efficiency in fetching and computation
{
    float4 result;
    float4 y = tex2D(textureY, coords);
    float4 x = tex2D(textureX, coords);
    result = y + alpha * x;
    return result;
}
           Computing=drawing
glPolygonMode(GL_FRONT,GL_FILL);
glBegin(GL_QUADS)
 glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);
 glTexCoord2f(texSize, 0.0); glVertex2f(texSize, 0.0);
 glTexCoord2f(texSize, texSize); glVertex2f(texSize, texSize);
 glTexCoord2f(0.0, texSize); glVertex2f(0.0, texSize);
glEnd();

                     Texture rectangles
           Computing=drawing
glPolygonMode(GL_FRONT,GL_FILL);
glBegin(GL_QUADS)
 glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);
 glTexCoord2f(1.0, 0.0); glVertex2f(texSize, 0.0);
 glTexCoord2f(1.0, 1.0); glVertex2f(texSize, texSize);
 glTexCoord2f(0.0, 1.0); glVertex2f(0.0, texSize);
glEnd();

                        Texture2D
    Reduction-type operations
• Maximum, vector norm, dot products etc.
• One method: render a 1x1 output
     • only one parallel processing element (PE)
     • Might exceed the maximum allowed shader length
• Local maximum of each 2x2 group of
  elements in one kernel
• Input: M by M, output: M/2 by M/2
• Recursively repeated until the final 1 by 1
• Logarithmetic number of iterations
Rendering a quad
      void drawQuad(int w, int h) {
         glBegin(GL_QUADS);
           glMultiTexCoord2f(GL_TEXTURE0, -0.5, -0.5);
           glMultiTexCoord2f(GL_TEXTURE1, 0.5, -0.5);
           glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5);
           glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5);
           glVertex2f(0.0, 0.0);

            glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, -0.5);
            glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, -0.5);
            glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 0.5);
            glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 0.5);
            glVertex2f(w, 0.0);

            glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, 2*h-0.5);
            glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, 2*h-0.5);
            glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 2*h+0.5);
            glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 2*h+0.5);
            glVertex2f(w, h);

             glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2*h-0.5);
             glMultiTexCoord2f(GL_TEXTURE1, 0.5, 2*h-0.5);
             glMultiTexCoord2f(GL_TEXTURE2, -0.5, 2*h+0.5);
             glMultiTexCoord2f(GL_TEXTURE3, 0.5, 2*h+0.5);
             glVertex2f(0.0, h);
          glEnd();
      }
                     maximum
float maximum (float2 left: TEXCOORD0, float2 right: TEXCOORD1,
                  float2 top: TEXCOORD2, float2 bottom: TEXCOORD3,
                  uniform samplerRECT texture) : COLOR
{
    float val1 = texRECT(texture, left);
    float val2 = texRECT(texture, right);
    float val3 = texRECT(texture, top);
    float val4 = texRECT(texture, bottom);
    return max(val1,max(val2,max(val3,val4)));
}
     Matrix-Matrix multiplication
• On CPUs            for (i = 0; i < N; i++)
                         for( j = 0; j < N; j++)
                             C[i, j] =0;
                             for(k = 0; k < N; k++)
                                 C[i, j] += A[i,k] * B[k, j];

• Suffers from locality: Elements of B are
  accessed column wise. Not sequential order in
  memory.                         “blocks” inputs into small
                                   Sub matrices that fit in
• Bandwidth limited                   Processor cache
• Size of its per-loop prevents bandwidth
  amplification from the closest and fastest
  levels of memory hierarchy.
       Fatahalian et. al. Matrix-matrix
         multiplication on the GPU
• 2x2 blocks of matrix elements in 4 component texels
• Reads 2 rows from matrix A and 2 columns of B to
  compute 4 elements of the output matrix
• Each shader program performs 4 inner loop iterations
  of the CPU program
• GPU texture caches are organized to capture 2D
  locality, so sequential access along either access are
  cache coherent
   for (k = 0; k < N/2; i++)
        C[i, j].xyzw += A[i,k].xxzz * B[k, j].xyxy +
                         A[i,k].yyww * B[k, j].zwzw;
              Multipass algorithm
• Single pass: instruction count exceed the limits of GPU
  architectures for sufficiently large matrices
• 4 consecutive elements from a matrix column into a texel
• 4x4 matrix by 4x1 vector products
• 4x reuse of data in texel B
• 6 fetches for 4 mad operations: 1.5x more fetches

  C[i, j].xyzw = accum[i,j].xyzw +
                      A[i,k].xyzw * B[k/4, j].xxxx +
                      A[i,k+1].xyzw * B[k/4, j].yyyy +
                      A[i,k+2].xyzw * B[k/4, j].zzzz +
                      A[i,k+3].xyzw * B[k/4, j].wwww
  Measurements – multiplication of
      1024 x 1024 matrices
                    Time(s)                GFLOPS            BW (GB/s)

NV5900 single       0.781                  2.75              7.22

NV6800 single       0.283                  7.59              15.36

ATI 9800 single     0.445                  4.83              12.06

P4 ATLAS            0.289                  7.78              27.68

NV6800 multi        0.469                  4.58              12.79



Available floating point bandwidth from the closest cache on these GPUs is
up to several times lower than that of current CPUs to their L1 caches
       Naga et. al. Matrix-matrix
       multiplication on the GPU
• A memory model for scientific algorithms on
  Graphics Processors
• Explicit blocking to avoid hardware
  dependencies
• Computation on the tiles of the size T x T is
  invoked by drawing quadrilaterals of size T x T
  on the screen.
• A single fragment program evaluates the dot
  product from vectors of size D in X and Y
  (inputs).
     GPU-based nested loop matrix
            multiplication
for(ib = 0; ib < N; ib = ib + T)
   for(jb = 0; jb < N; jb = jb + T)
       for(kb = 0; kb < N; kb = kb + D)
           // following two loops invoked using a quadrilateral of size TxT
           for(i = ib; i < ib + T; i = i +1)
                 for(j = jb; j < jb + T; j = j + 1)
                       // following loop is performed inside
                       // a fragment program
                       for(k = kb; k < kb + D; k = k + 1)
                            Z[i,j] = Z[i,j] + X[i,k] * Y[k,j]
           Measurements

• NVIDIA 7900 GTX GPU
  – 17.6 GFLOPS
  – 40 GB / S
• NVIDIA 6800 Ultra
  – 8 GFLOPS
   Matrix-Matrix multiplication in
           CUDA/CTM
• ~ 100 GFLOPS
• How?
  – Use of extended memory formats (fetch4) and
    deeper loop unrolling (R580)
  – Shader memory (G80)
    • On chip shared memory allows you to keep blocks
      of data close to the ALUs rather than having to
      ping-pong off-chip to an FBO.
    • SM is only exposed through CUDA
       Introduction to CUDA
• Thread Batching
  – Thread Block                            threadID

    • A batch of threads that can cooperate
    • Sharing data through fast shared memory
    • Synchronize their execution           blockID

  – Grid of Thread Blocks
    • Blocks that execute the same kernel can
       be batched together
    • Reduced thread cooperation
Memory model
  Matrix multiplication using CUDA
• Each thread block is
  responsible for computing
  one square sub-matrix
  Csub of C.
• Block size of Csub is equal
  to 16
void Mul(const float * A, const float * B, int hA, int wA, int wB, float * C)
{
     float * Ad;
     int size = hA * wA * sizeof(float);
     cudaMalloc((void **) &Ad, size);                               Loading A into device global memory
     cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);


     float * Bd;
     size = wA * wB * sizeof(float);
                                                                    Loading B into device global memory
     cudaMalloc((void **) &Bd, size);
     cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);


     float * Cd;
     size = hA * wB * sizeof(float);
                                                                      Allocate C on device global memory
     cudaMalloc((void**)&Cd, size);


      dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
                                                                                 Execution configuration
      dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);


      Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);                          Launch device computation

      cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
                                                                                 Read C from the device
}
__global__ void Muld(const float * A, const float * B, int wA, int wB, float * C)
{
    int bx = blockIDx.x;
    int by = blockIDx.y;
    int tx = threadIDx.x;
    int ty = threadID.y;                            aBegin : Index of the first sub-matrix of
                                                          A processed by the block
    int aBegin = wA * BLOCK_SIZE * by;
    int aEnd = aBegin + wA; // (wA + 1) * BLOCK_SIZE * by
    int aStep = BLOCK_SIZE;


    int bBegin = BLOCK_SIZE * bx;
    int bStep = BLOCK_SIZE * wB;


    float Csub = 0;


     ……
    …..
    …..
    for( int a = aBegin, b = bBegin; a < aEnd; a += aStep, b+= bStep)    Iterate through all sub-matrices of A and B
    {
          __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
          __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];


          As[ty][tx] = A[a + wA * ty + tx];                             Load one sub-matrix of A and one of B from
          Bs[ty][tx] = B[b + wB * ty + tx];                                 global memory to shared memory

                                                                         Synchronize to make sure both matrices
          __syncthreads();
                                                                             Are fully loaded by all threads

          for(int k = 0; k < BLOCK_SIZE; ++k)
                                                                        Compute the product of the two sub-matrices
              Csub += As[ty][k] * Bs[k][tx];

                                                                         Synchronize to make sure that the product
          __syncthreads();                                                   Of the two sub-matrices is done
    }


    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
                                                                        Write the block sub-matrix to global memory
    C[c + wB * ty + tx] = CSub;
}
              Conclusions
• Different ways of representing matrices
  and vectors on the GPU
• Various algorithms for matrix-matrix
  multiplication and vector-vector operations
  on the GPU
• Computation throughput of the GPU
• Memory performance of the GPU
Questions?
Thank you
                   References
• E.S. Larsen and D. McAllister. Fast matrix multiplies using
  graphics hardware. Super Computing 2001.
• Dominik Göddeke. GPGPU Tutorials.
  http://www.mathematik.uni-
  dortmund.de/~goeddeke/gpgpu/index.html
• Adam Moravanszky, Dense matrix algebra on the GPU.
  2003 http://www.shaderx2.com/shaderx.PDF
• Nico Galoppo, Naga K. Govindaraju, Michael Henson and
  Dinesh Manocha: LU-GPU : Efficient algorithms for solving
  Dense Linear systems on Graphics hardware. Super
  Computing 2005
• CUDA Programming Guide and CUDA BLAS.
  http://developer.nvidia.com/object/cuda.html
              References ctd..
• K. Fatahalian, J. Sugerman, and P. Hanran, Understanding
  the Efficiency of GPU Algorithms for Matrix-Matrix
  multiplication, Graphics Hardware 2004
• Jens Krüger, Linear Algebra on GPUs SIGGRAPH 2004
  course notes
• Naga K. Govindaraju, Scott Larsen, Jim Gray, Dinesh
  Manocha, A Memory Model for Scientific Algorithms on
  Graphics Procesors, Super Computing 2006.

								
To top