Docstoc

GPU acceleration of Fluidity A pi

Document Sample
GPU acceleration of Fluidity A pi Powered By Docstoc
					                       Generating GPU-Accelerated
Software Performance
Optimisation Group




                       Code From a High-level Domain-
                       specific Language
                       Graham Markall
                       Software Performance Optimisation Group
                       Imperial College London
                       http://www.doc.ic.ac.uk/~grm08

                       Joint work with David Ham and Paul Kelly

                                                                  October 2009
                       Problem & Proposal
                         How do we exploit multicore architectures to improve
                         the performance of the Assembly Phase?
                            Writing code for new architectures is time-consuming
                            and error-prone
Software Performance
Optimisation Group




                         Provide hardware-independent abstraction for the
                         specification of finite element methods.
                            Future proofing of code
                            Easier development

                         Background:
                           Conjugate Gradient GPU Solver 10x faster than one
                           CPU core
                           Solvers are generic – someone else will solve this
                           problem
                       This Study

                         We present a pilot study into using the Unified Form
                         Language to generate CUDA code.
Software Performance




                       Part 1:                                Part 2:
Optimisation Group




                       1.Nvidia Tesla Architecture & CUDA     1.UFL
                       2.Test Problems                        2.UFL Compiler Design
                       3.Translation Methodology              3.Test Results
                       4.Performance Optimisations            4.Discussion
                       5.Performance Results

                         Why bother with part 1?
                             To prove we can speed up assembly using GPUs
                             To provide a guide for the output we expect from the compiler
                             To experiment with different performance optimisations
                       NVIDIA Tesla Architecture & CUDA
                         GT200 Architecture:
                           1-4GiB RAM                    Data transfer



                         For high performance:
Software Performance




                            Coalescing:
Optimisation Group




                           64B window

                                                         Data transfer

                                16 threads (half-warp)
                            Use many threads (10000+)

                         Caches:
                           Texture cache (read-only)
                           Shared memory
                       The Test Problems
                         Test_laplacian: Solves ∆u = f on unit square
                            Analytical solution: Allows us to examine the accuracy
                            of the computed solution
                         Test_advection_diffusion:
Software Performance
Optimisation Group




                         Advection-Diffusion is more representative:
                            Time-dependent, nonlinear
                            multiple assemble/solve
                       From Fortran to CUDA (Test_laplacian)
                            Assembly Loop in Fortran:
Software Performance
Optimisation Group




                       1 Element


                            Assembly Loop in CUDA:




                                      do ele=1,num_ele
                                        call assemble(ele,A,b)   call gpu_assemble()
                       All Elements   end do
                                                                 call gpucg_solve(x)
                                      call petsc_solve(x,A,b)
                       From Fortran to CUDA (Adv.-Diff.)
                                             Original:
                                                Assemble
                                                Solve
                                                Output of solve input
Software Performance




                                                to next Assemble
Optimisation Group




                                             CUDA:
                                               Avoid transferring the
                                               solution at every
                                               iteration
                                               Upload initial
                                               conditions
                                               Iterate
                                               Transfer solution when
                                               required
                       Performance Optimisations
                                                   x1 y1 x2 y2 x3 y3
                         Coalescing           1
                                                                            1 2 3         n-2 n-1 n
                         Maximise memory      2
                         bandwidth            3
                                                                       x1
                              (x1,y1)




                                                         ...
                                                                       y1
                                                                       x2
Software Performance




                                                                       y2
                                                                                    ...
Optimisation Group




                                             n-2                       x3
                                             n-1
                        (x2,y2)    (x3,y3)     n
                                                                       y3
                       Performance Optimisations
                                                      x1 y1 x2 y2 x3 y3
                          Coalescing             1
                                                                               1 2 3         n-2 n-1 n
                          Maximise memory        2
                          bandwidth              3
                                                                          x1
                               (x1,y1)




                                                            ...
                                                                          y1
                                                                          x2
Software Performance




                                                                          y2
                                                                                       ...
Optimisation Group




                                                n-2                       x3
                                                n-1
                        (x2,y2)       (x3,y3)     n
                                                                          y3

                          Specialisation of Kernels (reduced register usage)
                       for(int x=0; x<nodes; x++) {           for(int x=0; x<3; x++) {
                         for(int y=0; y<nodes; y++) {           for(int y=0; y<3; y++) {
                           ...;                                   ...;
                         }                                      }
                       }                                      }
                       Performance Optimisations
                                                       x1 y1 x2 y2 x3 y3
                           Coalescing             1
                                                                                1 2 3         n-2 n-1 n
                           Maximise memory        2
                           bandwidth              3
                                                                           x1
                                (x1,y1)




                                                             ...
                                                                           y1
                                                                           x2
Software Performance




                                                                           y2
                                                                                        ...
Optimisation Group




                                                 n-2                       x3
                                                 n-1
                          (x2,y2)      (x3,y3)     n
                                                                           y3

                           Specialisation of Kernels (reduced register usage)
                        for(int x=0; x<nodes; x++) {           for(int x=0; x<3; x++) {
                          for(int y=0; y<nodes; y++) {           for(int y=0; y<3; y++) {
                            ...;                                   ...;
                          }                                      }
                        }                                      }

                           Texture Memory for matrix sparsity
                       row_ptr

                       col_idx

                           val
                       Performance Results
                         For Advection-Diffusion problem. Test setup:
                         Nvidia 280GTX – 1GB RAM (use Tesla C1060 for 4GB)
                         Intel Core 2 Duo E8400 @ 3.00GHz
                         2GB RAM in host machine
Software Performance




                         Intel C++ and Fortran Compilers V10.1
Optimisation Group




                             V11.0 suffers from bugs and cannot compile Fluidity
                         CPU Implementations compiled with –O3 flags
                         CUDA Implementation compiled using NVCC 2.2

                         Run problem for 200 timesteps
                         Increasingly fine meshes
                            Increasing element count
                         Five runs of each problem
                            Averages reported
                         Double Precision computations
Software Performance
Optimisation Group     Advection Diffusion Assembly Time
Software Performance
Optimisation Group     Speedup in the Assembly Phase
Software Performance
Optimisation Group     Overall Speedup (Assemble & Solve)
                       Proportion of GPU Time in each Kernel
                         Which kernels should we focus on optimising?
Software Performance
Optimisation Group




                         Addto kernels: 84% of execution time
Software Performance
Optimisation Group     The Impact of Atomic Operations




                       Colouring Optimisation on GPUs: [1] D. Komatitsch, D.Michea and G.
                       Erlebacher. Porting a high-order finite-element earthquake modelling application to
                       Nvidia graphics cards using CUDA. J. Par. Dist. Comp., 69(5):451-460, 2009
                       Summary of Part 1
                           8x Speedup over 2 CPU Cores for assembly
                           6x Speedup overall
                           Further performance gains from:
Software Performance




                             Colouring Elements & Non-atomic ops [1]
Optimisation Group




                             Alternative matrix storage formats
                             Fusing kernels [2], Mesh partitioning [3]



                       Fusing kernels: [2] J. Filipovic, I. Peterlik and J. Fousek. GPU Acceleration of
                       Equations Assembly in Finite Elements Method – Preliminary Results. In SAAHPC
                       Symposium on Application Accelerators in HPC, July 2009.
                       Mesh partitioning: [3] A. Klockner, T. Warburton, J. Bridge and J. S. Hesthaven.
                       Nodal Discontinuous Galerkin methods on graphics processors. Journal of
                       Computational Physics, in press, 2009.
                       Part 2: A UFL [4] Example (Laplacian)
                           Solving:
                           Weak form:
                           (Ignoring boundaries)
Software Performance




                       Psi = state.scalar_fields(“psi”)
Optimisation Group




                       v=TestFunction(Psi)
                       u=TrialFunction(Psi)
                       f=Function(Psi, “sin(x[0])+cos(x[1])”)
                       A=dot(grad(v),grad(u))*dx
                       RHS=v*f*dx
                       Solve(Psi,A,RHS)

                           Close to mathematical notation
                           No implementation details
                               Allows code generation for multiple backends and
                               choice of optimisations to be explored.
                       [4] M. Alnaes and Anders Logg. Unified Form Language Specification and User’s
                       Manual. http://www.fenics.org/pub/documents/ufl/ufl-user-manual/ufl-user-
                       manual.pdf Retrieved 15 Sep 2009.
                       From UFL to CUDA
                        We “parse” UFL using the ufl.algorithms package
                        Leading to the creation of a DAG representing the
                        assembly:
                                                Intermediate representation:
Software Performance
Optimisation Group




                                          UFL   Frontend        Backend   CUDA
                                                           IR
                        Similar for RHS
                       Testing
                        Frontend:                          Backend (example):
                        psi = state.scalar_fields("psi")   stringList *params = new stringList();
                        v = TestFunction(P)                (*params).push_back(string("val"));
                        u = TrialFunction(P)               (*params).push_back(string("size_val"));
                        f = Function(P)                    (*params).push_back(string("ele_psi"));
Software Performance




                        f.name="shape_rhs"                 (*params).push_back(string("lmat"));
                        A = dot(grad(v),grad(u))*dx        (*params).push_back(string("n"));
Optimisation Group




                        solve(P, A, f)                     launchList.push_back(
                                                             kernelLaunch("matrix_addto",params));



                       Hand Translation:                   Generated Code:
                       Testing - continued
                         Helmholtz equation:
                         Weak form:

                         A=(dot(grad(v), grad(u))+(20)*dot(v,u))*dx
Software Performance




                         Add extra calls to shape_shape and matrix_addto
Optimisation Group




                       FEniCS Dolfin solution:   Generated code solution:
                       Conclusions
                         We obtain speedups of 8x over 2 core CPU in the
                         assembly phase using CUDA
                           An overall speedup of 6x over 2 cores
Software Performance
Optimisation Group




                         Generation of CUDA code from UFL source is
                         feasible
                            UFL is “future proof”
                            UFL is easier to use than CUDA, Fortran etc.
                            Allows automated exploration of optimisations
                            Other backends (Cell, multicore CPU, Larrabee
                            etc.) should be possible
                       Further work
                       On the UFL Compiler:
                        Support for a more complete subset of UFL
                        Development of a more expressive intermediate
Software Performance




                        representation
Optimisation Group




                           Facilitates the development of other backends
                           Generation of kernels from IR
                        Automatic tuning

                       On the Conjugate Gradient Solver:
                        Integration with blocked SpMV implementation [5]
                           Expect: further performance improvements

                       Blocked SpMV: [5] A. Monakov and A. Avetisyan. Implementing Blocked Sparse
                       Matrix-Vector Multiplication on Nvidia GPUs. In SAMOS IX: International
                       Symposium on Systems, Architectures, Modeling and Simulation, July 2009.
Software Performance
Optimisation Group     Spare Slides
                       Test Advection Diffusion UFL
                       Advection:                                  Diffusion:
                       T=state.scalar_fields(Tracer)               mu=state.tensor_fields(TracerDiffusivity)
                       U=state.vector_fields(Velocity)             i,j=indices(2)
                       UNew=state.vector_fields(NewVelocity)       M=p*q*dx
                                                                   d=-grad(q)[i]*mu[i,j]*grad(p)[j]*dx
Software Performance




                       # We are solving for the Tracer, T.         A=m-0.5*d
                       t=Function(T)                               rhs=action(M+0.5*d,t)
                       p=TrialFunction(T)                          T=solve(A,rhs)
Optimisation Group




                       q=TestFunction(T)

                       #The value of the advecting velocity U is known.
                       u=Function(U)
                       unew=Function(UNew)

                       #Mass matrix.
                       M=p*q*dx

                       #Solve for T1-T4.
                       rhs=dt*dot(grad(q),u)*t*dx
                       t1=solve(M,rhs)
                       rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t1)*dx
                       t2=solve(M,rhs)
                       rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t2)*dx
                       t3=solve(M,rhs)

                       #Solve for T at the next time step.
                       rhs=action(M,t) + 1.0/6.0*t1 + 1.0/3.0*t2
                                       + 1.0/3.0*t3 + 1.0/6.0*t4
                       t=solve(M,t)
                       Memory Bandwidth Utilisation
                             Orange: Using Atomic operations
                             Blue: Using non-atomic operations
Software Performance
Optimisation Group
                       Proportion of GPU Time in each Kernel
                             Orange: Using Atomic operations
                             Blue: Using non-atomic operations
Software Performance
Optimisation Group
Software Performance
Optimisation Group     Assembly Throughput
                       Code Generation
                         List of variables, kernels and parameters passed to
                         backend.
                         Using the ROSE Compiler Infrastructure [6].
Software Performance




                       Initialisation            AST        Assembly
Optimisation Group




                       cudaMalloc()                         kernel<<<.>>>()
                       cudaBindTexture()
                       cudaMemcpy()

                                         gpu_assemble.cu

                                                            Finalisation
                       Streaming        Declarations        cudaFree()
                       cudaMemcpy()     Int, double, ...    cudaUnbindTexture()


                         CUDA Keywords (__global__, <<<...>>> notation)
                         inserted as arbitrary strings.
                       NVIDIA Tesla Architecture & CUDA
                                          GT200 Architecture
                                              10 TPCs
                                              8 Banks of DRAM: 1-4GiB
                                          y = αx + y in C:
Software Performance




                                       void daxpy(double a, double* x,
Optimisation Group




                                                  double* y, int n)
                                       {
                                         for (int i=0; i<n; i++)
                                           y[i] = y[i] + a*x[i];
                                       }
                                          CUDA Kernel:
                                       __global__ void daxpy(double a,
                                                             double* x,
                                                             double* y,
                                                             int n)
                                       {
                                         for (int i=T_ID; i<n; i+=T_COUNT)
                                           y[i] = y[i] + a*x[i];
                                       }
                       Variable naming
                          How do we ensure the output of a kernel is correctly input
                          to successive kernels? Consistently invent names.

                                                                Output: dshape_psi
Software Performance
Optimisation Group




                                                            Input: dshape_psi
                                                            Output: lmat_psi_psi

                                                             Input: lmat_psi_psi
                       Psi = state.scalar_fields(“psi”)
                       v=TestFunction(Psi)
                       u=TrialFunction(Psi)
                       f=Function(Psi, “sin(x[0])+cos(x[1])”)
                       A=dot(grad(v),grad(u))*dx
                       RHS=v*f*dx
                       Solve(Psi,A,RHS)
Software Performance
Optimisation Group     Memory Bandwidth Utilisation

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:2/20/2010
language:English
pages:32