CUDA Lecture 1 Introduction to Massively .._6_ by malj

VIEWS: 8 PAGES: 29

									Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The
                   University of Akron.
 Your own PCs running G80 emulators
   Better debugging environment
   Sufficient for the first couple of weeks
 Your own PCs with a CUDA-enabled GPU
 NVIDIA boards in department
   GeForce family of processors for high-performance
    gaming
   Tesla C2070 for high-performance computing – no
    graphics output (?) and more memory



                                          CUDA at the University of Akron – Slide 2
Description            Card Models                 Where Available
Low Power                  Ion                Netbooks in CAS 241.
Consumer Graphics     GeForce 8500GT           Add-in cards in Dell
Processors            GeForce 9500GT              Optiplex 745s in
                      GeForce 9600GT                  department.
2nd Generation GPUs   GeForce GTX275    In Dell Precision T3500s in
                                                       department.
Fermi GPUs            GeForce GTX480        In select Dell Precision
                                            T3500s in department.
                        Tesla C2070        In Dell Precision T7500
                                                       Linux server
                                       (tesla.cs.uakron.edu)



                                            CUDA at the University of Akron – Slide 3
 Basic building block is a “streaming multiprocessor”
 different chips have different numbers of these SMs:

                 Product       SMs   Compute
                                     Capability
              GeForce 8500GT    2       v. 1.1
              GeForce 9500GT    4       v. 1.1
              GeForce 9600GT    8       v. 1.1




                                             CUDA at the University of Akron – Slide 4
 Basic building block is a “streaming multiprocessor”
 with
   8 cores, each with 2048 registers
   up to 128 threads per core
   16KB of shared memory
   8KB cache for constants held in device memory
 different chips have different numbers of these SMs:

        Product   SMs    Bandwidth      Memory        Compute
                                                      Capability
        GTX275     30      127 GB/s     1 -2 GB            v. 1.3


                                           CUDA at the University of Akron – Slide 5
 each streaming multiprocessor has
   32 cores, each with 1024 registers
   up to 48 threads per core
   64KB of shared memory / L1 cache
   8KB cache for constants held in device memory
 there’s also a unified 384KB L2 cache
 different chips again have different numbers of SMs:
        Product      SMs   Bandwidth Memory        Compute
                                                   Capability
         GTX480      15     180 GB/s    1.5 GB          v. 2.0
       Tesla C2070   14     140 GB/s   6 GB ECC         v. 2.1
                                            CUDA at the University of Akron – Slide 6
Feature                                 v. 1.1    v. 1.3, 2.x
Integer atomic functions operating on    no            yes
64-bit words in global memory
Integer atomic functions operating on    no            yes
32-bit words in shared memory
Warp vote functions                      no            yes
Double-precision floating-point          no            yes
operations




                                         CUDA at the University of Akron – Slide 7
Feature                                       v. 1.1, 1.3      v. 2.x
3D grid of thread block                            no            yes
Floating-point atomic addition operating on        no            yes
32-bit words in global and shared memory
_ballot()                                          no            yes
_threadfence_system()                              no            yes
_syncthread_count(),                               no            yes
_syncthread_and(),
_syncthread_or()
Surface functions                                  no            yes




                                              CUDA at the University of Akron – Slide 8
Spec
Maximum x- or y- dimensions of a grid of thread blocks                        65536
Maximum dimensionality of thread block                                           3
Maximum z- dimension of a block                                                 64
Warp size                                                                       32
Maximum number of resident blocks per multiprocessor                             8
Constant memory size                                                           64 K
Cache working set per multiprocessor for constant memory                       8K
Maximum width for 1D texture reference bound to linear                         2 27
memory
Maximum width, height and depth for a 3D texture reference bound     2048 x 2048 x 2048
to linear memory or a CUDA array
Maximum number of textures that can be bound to a kernel                        128
Maximum number of instructions per kernel                                   2 million
                                                            CUDA at the University of Akron – Slide 9
Spec                                            v. 1.1    v. 1.3     v. 2.x
Maximum number of resident warps per             24         32         48
multiprocessor
Maximum number of resident threads per          768       1024        1536
multiprocessor
Number of 32-bit registers per multiprocessor   8K         16 K      32 K




                                                  CUDA at the University of Akron – Slide 10
Spec                                              v. 1.1, 1.3      v. 2.x
Maximum dimensionality of grid of thread block         2              3
Maximum x- or y- dimension of a block                 512          1024
Maximum number of threads per block                   512          1024
Maximum amount of shared memory per                  16 K          48 K
multiprocessor
Number of shared memory banks                         16             32
Amount of local memory per thread                    16 K          512 K
Maximum width for 1D texture                         8192         32768
reference bound to a CUDA array




                                                 CUDA at the University of Akron – Slide 11
Spec                                               v. 1.1, 1.3                   v. 2.x
Maximum width and number of layers
                                                  8192 x 512                16384 x 2048
for a 1D layered texture reference
Maximum width and height for 2D
texture reference bound to                      65536 x 32768              65536 x 65536
linear memory or a CUDA array
Maximum width, height, and number
                                               8192 x 8192 x 512        16384 x 16384 x 2048
of layers for a 2D layered texture reference
Maximum width for a 1D surface
                                                                                  8192
reference bound to a CUDA array
Maximum width and height for a 2D
                                               Not supported                 8192 x 8192
surface reference bound to a CUDA array
Maximum number of surfaces that
                                                                                    8
can be bound to a kernel
                                                           CUDA at the University of Akron – Slide 12
 CUDA (Compute Unified Device Architecture) is
 NVIDIA’s program development environment:
   based on C with some extensions
   C++ support increasing steadily
   FORTRAN support provided by PGI compiler
   lots of example code and good documentation – 2-4
    week learning curve for those with experience of
    OpenMP and MPI programming
   large user community on NVIDIA forums




                                       CUDA at the University of Akron – Slide 13
 When installing CUDA on a system, there are 3
 components:
   driver
      low-level software that controls the graphics card
      usually installed by sys-admin
   toolkit
      nvcc CUDA compiler
      some profiling and debugging tools
      various libraries
      usually installed by sys-admin in /usr/local/cuda




                                             CUDA at the University of Akron – Slide 14
 SDK
   lots of demonstration examples
   a convenient Makefile for building applications
   some error-checking utilities
   not supported by NVIDIA
   almost no documentation
   often installed by user in own directory




                                        CUDA at the University of Akron – Slide 15
 Remotely access the front end:
             ssh tesla.cs.uakron.edu
   ssh sends your commands over an encrypted stream so
    your passwords, etc., can’t be sniffed over the network




                                        CUDA at the University of Akron – Slide 16
 The first time you do this:
   After login, run
      /root/gpucomputingsdk_3.2.16_linux.run
    and just take the default answers to get your own
    personal copy of the SDK.
   Then:

      cd ~/NVIDIA_GPU_Computing_SDK/C
      make -j12 -k

    will build all that can be built.

                                        CUDA at the University of Akron – Slide 17
 The first time you do this:
   Binaries end up in:
      ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
   In particular header file <cutil_inline.h> is in
    ~/NVIDIA_GPU_Computing_SDK/C/common/inc


 Can then get a summary of technical specs and
  compute capabilities by executing
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery




                                      CUDA at the University of Akron – Slide 18
 Two choices:
   use nvcc within a standard Makefile
   use the special Makefile template provided in the SDK
 The SDK Makefile provides some useful options:
   make emu=1
      uses an emulation library for debugging on a CPU
   make dbg=1
      activates run-time error checking

 In general just use a standard Makefile



                                           CUDA at the University of Akron – Slide 19
GENCODE_ARCH   := -gencode=arch=compute_10,code=\"sm_10,compute_10\“
     -gencode=arch=compute_13,code=\"sm_13,compute_13\“
     -gencode=arch=compute_20,code=\"sm_20,compute_20\“

INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc
     -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc

LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
     -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib

LIBS =   -lcutil_x86_64

<progName>: <progName>.cu <progName>.cu <progName>.cuh
            nvcc $(GENCODE_ARCH) $(INCLOCS) <progName>.cu $(LIBLOCS)
                 $(LIBS) -o <progName>




                                             CUDA at the University of Akron – Slide 20
 Parallel Thread
 Execution (PTX)
   Virtual machine and ISA
   Programming model
   Execution resources and
    state




            CUDA Tools and Threads – Slide 2
 Any source file containing CUDA extensions must be
  compiled with NVCC
 NVCC is a compiler driver
   Works by invoking all the necessary tools and compilers
    like cudacc, g++, cl, …
 NVCC outputs
   C code (host CPU code)
      Must then be compiled with the rest of the application using
       another tool
   PTX
      Object code directly, or PTX source interpreted at runtime

                                                CUDA Tools and Threads – Slide 22
 Any executable with CUDA code requires two dynamic
 libraries
   The CUDA runtime library (cudart)
   The CUDA core library (cuda)




                                        CUDA Tools and Threads – Slide 23
 An executable compiled in device emulation mode
 (nvcc –deviceemu) runs completely on the host
 using the CUDA runtime
   No need of any device and CUDA driver
   Each device thread is emulated with a host thread




                                          CUDA Tools and Threads – Slide 24
 Running in device emulation mode, one can
   Use host native debug support (breakpoints, inspection,
    etc.)
   Access any device-specific data from host code and vice-
    versa
   Call any host function from device code (e.g. printf)
    and vice-versa
   Detect deadlock situations caused by improper usage of
    __syncthreads



                                           CUDA Tools and Threads – Slide 25
 Emulated device threads execute sequentially, so
  simultaneous access of the same memory location by
  multiple threads could produce different results
 Dereferencing device pointers on the host or host
  pointers on the device can produce correct results in
  device emulation mode, but will generate an error in
  device execution mode




                                        CUDA Tools and Threads – Slide 26
 Results of floating-point computations will slightly
 differ because of
   Different compiler outputs, instructions sets
   Use of extended precision for intermediate results
      There are various options to force strict single precision on
       the host




                                                  CUDA Tools and Threads – Slide 27
 New Visual Studio Based GPU Integrated
 Development

 http://developer.nvidia.com/object/nexus.html


 Available in Beta (as of October 2009)




                                           CUDA Tools and Threads – Slide 28
 Based on original material from
   http://en.wikipedia.com/wiki/CUDA, accessed 6/22/2011.
   The University of Akron: Charles Van Tilburg
   The University of Illinois at Urbana-Champaign
      David Kirk, Wen-mei W. Hwu
   Oxford University: Mike Giles
   Stanford University
      Jared Hoberock, David Tarjan

 Revision history: last updated 6/23/2011.



                                      CUDA at the University of Akron – Slide 29

								
To top