PGI Accelerator

W
Shared by: alicejenny
Categories
Tags
-
Stats
views:
0
posted:
9/26/2012
language:
Unknown
pages:
28
Document Sample
scope of work template
							                         Mitglied der Helmholtz-Gemeinschaft




June 9, 2010
                JuRoPA

Jan H. Meinke
  Getting Started with SSH

        Setting a passphrase
          ssh-keygen -p
        Logging in
          ssh userid@juropa.fz-juelich.de
              -X : turn on X-forwarding
              -i /path/to/private/key : Choose different key file
              -A : forward authentication request
        Learn about ssh-agent
          Enter passphrase once with ssh-add
          Secure passphrase-less authentication


June 9, 2010                   1st SimLab Porting Workshop           Slide 2
  Submitting Jobs

       No direct access to compute nodes
       MOAB
          Queuing and job scheduling system
          Most frequently used commands
             msub : submit a new job
             showq -w userid : shows job of userid
             canceljob jobid : cancels a queued job
             mjobctl : queries and changes job properties
             checkjob -v jobid : information about a job




June 9, 2010                 1st SimLab Porting Workshop     Slide 3
  Calculating Pi

      =4 arctan 1
            1     N        4
          ≃
            N
                ∑i=0   1i0.52


       We can split the sum in any way we want.




June 9, 2010                    1st SimLab Porting Workshop   Slide 4
  Calculating Pi

      =4 arctan 1
            1     N        4
          ≃
            N
                ∑i=0   1i0.52
        Examples
           svn checkout --username guest --password guest
            https://svn.version.fz-juelich.de/simlab/workshop/june2010
          pi.c : simple, single-threaded implementation
          pi_omp.c : OpenMP parallelized version
          pi_mpi.c : MPI parallelized version
        Use make to compile

June 9, 2010                    1st SimLab Porting Workshop              Slide 5
  A Simple Launch Script

               #!/bin/bash -x
               #MSUB -l nodes=1:ppn=8
               cd $PBS_O_WORKDIR

               mpiexec -np 1 ./pi



        Compile pi with make pi
        Submit job with msub launch_pi.sh
          Starts job on 1 node with 8 MPI tasks
          -l nodes=1:ppn=8 can be given on command line
        Check your jobs with showq -w your_userid

June 9, 2010                        1st SimLab Porting Workshop   Slide 6
  An OpenMP Launch Script

               #!/bin/bash -x
               #MSUB -l nodes=1:ppn=8
               #MSUB -v tpt=8
               export OMP_NUM_THREADS=8

               cd $PBS_O_WORKDIR

               ./pi_omp
        Compile pi_omp with make pi_omp
        Submit job with msub launch_pi_omp.sh
          Starts job on 1 node with 1 MPI task and 8 threads
          -v tpt=8 sets the number of threads per task to 8
        Compare time with the simple implementation

June 9, 2010                    1st SimLab Porting Workshop     Slide 7
  An MPI/OpenMP Launch Script

               #!/bin/bash -x
               #MSUB -l nodes=2:ppn=8
               #MSUB -v tpt=4
               export OMP_NUM_THREADS=4
               NSLOTS=4
               cd $PBS_O_WORKDIR

               mpiexec --exports=OMP_NUM_THREADS -np $NSLOTS ./pi_mpi
        Compile pi_mpi with make pi_mpi_hybrid
        Submit job with msub launch_pi_mpi.sh
          Starts job on 2 node with 4 MPI task and 4 threads per
           task
        Change the ratio between MPI and OpenMP task and find
           out which one is fastest. (One node is enough.)
June 9, 2010                      1st SimLab Porting Workshop           Slide 8
  Interactive Session

        msub -I -X -l nodes=1:ppn=8:turbomode
          -I make the session interactive (capital i)
          -X turns X-forwarding on
          -l walltime=00:12:30 : sets the running time of the
           session to 12:30 min (lowercase L).
          -v tpt=2 : threads per task as before
          -N name_of_job
          … (see, e.g.,
           http://www.fz-juelich.de/jsc/juropa/usage/quick-intro)
          All these options can go in launch scripts, too.



June 9, 2010                  1st SimLab Porting Workshop           Slide 9
  Interactive Session

        Start interactive session
          msub -I -X -l nodes=1:ppn=8:turbomode -v tpt=8
        Use export OMP_NUM_THREADS=2 (4,8) to test the scaling
           of pi_omp




June 9, 2010               1st SimLab Porting Workshop     Slide 10
  Chain Jobs

        Define job dependencies using MOAB
          -W x=depend:afterok:jobname
          after, afterany, afterok, afternotok, before, beforeany,
           beforeok, beforenotok, on (
               http://www.clusterresources.com/products/moab/11.5jobdependencies.shtml)
        Somewhat buggy at the moment




June 9, 2010                          1 st SimLab Porting Workshop                  Slide 11
  Overview
  UNiform Interface to Computing Resources
      Transparent, secure, intuitive access to
       distributed computing and data resources
      Easy installation and configuration
      Application and workflow support
      Active European developer community,
       quick and efficient support
      Open source, BSD licensed
       http://www.unicore.eu




June 9, 2010                                      Slide 12
  Clients
      URC: Graphical user interface
      UCC: Commandline client
       and Emacs integration
      HiLA: API




                                              hila-unicore6   UNICORE 6 implementation

                                       hila-grid-common       Common Grid implementations

                                         hila-grid-api        Grid API

                                       hila-api               Generic hila-api, not Grid related

June 9, 2010                                                                        Slide 13
  Workflow
  Features                    Clients
      Simple graphs (DAGs)     UNICORE Rich client
      Workflow variables       Commandline client
      Loops and
       control constructs
        While
        For-each
        If-else
      Conditions
        Exit code
        File existence
        File size
        Workflow variables


June 9, 2010                                           Slide 14
  Execution environments
      Improved MPI support for e.g.
         MPI
         OpenMP
         Hybrid




  Application integration
      GridBean integrates
       application specific
       panels and visualizer


June 9, 2010                           Slide 15
  Generalized Matrix-Matrix Multiplication (GEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Core routine in linpack
        Triple loop over k, l, and I
        For the exercises, we assume square matrices of size N
        Per element C(k,l), we have
          1 multiplication
          N multiplications and additions as part of the sum
          1 multiplication + 1 addition
        Total: N * N * (3 + 2N) operations

June 9, 2010                   1st SimLab Porting Workshop       Slide 16
  Generalized Matrix-Matrix Multiplication (GEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Simple double precision implementation
          Look at code dgemm.cpp
          make dgemm1
          Start interactive session
          Time routine for different matrix sizes
          How many floating point operations per second?
          Percentage of single-processor peak performance
           (11.72 GFLOPS)?

June 9, 2010                   1st SimLab Porting Workshop   Slide 17
  Generalized Matrix-Matrix Multiplication (GEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Simple double precision implementation
          Look at code dgemm_loop_exchanged.cpp
          What is different?
          make dgemm2
          Start interactive session (if necessary)
          Time routine for different matrix sizes
          How many floating point operations per second?
          Percentage of single-processor peak performance?

June 9, 2010                   1st SimLab Porting Workshop    Slide 18
  Generalized Matrix-Matrix Multiplication (GEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Simple double precision implementation
          What happens if we change the two inner loops back?
          Code is just as fast. Why?
          Try -opt-report as compiler option
          Compiler exchanges inner loops.




June 9, 2010                   1 st SimLab Porting Workshop      Slide 19
  Generalized Matrix-Matrix Multiplication (GEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Simple double precision implementation
          Look at code matmultadd1.c
          What is different?
          make dgemm3
          Start interactive session (if necessary)
          Time routine for different matrix sizes
          How many floating point operations per second?
          Percentage of single-processor peak performance?

June 9, 2010                   1st SimLab Porting Workshop    Slide 20
  Using Intel's Math Kernel Library (MKL)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Highly optimized implementation
          Look at code dgemm_mkl.cpp
          cblas_dgemm(order, transA, transB, width, width, width,
           alpha, A, width, B, width, beta, C, width);
          module load mkl; module load intel
          make dgemm4
          Prepare chain job to time different matrix sizes


June 9, 2010                   1st SimLab Porting Workshop      Slide 21
  Using a job chain

           #!/bin/bash
           # This script starts a chain of jobs.
           # The job name is also used to pass the size of the matrix

           lastsize=1024

           # The first job is launched without any dependency
           msub -N $lastsize launch_dgemm4.sh

           # All other jobs depend on the previous one.
           for s in 2048 4096 8192 16384
           do
              msub -N $s -W x=depend:afterok:$lastsize launch_dgemm4.sh
              lastsize=$s
           done

June 9, 2010                       1st SimLab Porting Workshop            Slide 22
  Using a job chain

           #!/bin/bash
           # This script starts a chain of jobs.
           # The job name is also used to pass the size of the matrix

           lastsize=1024

           # The first job is launched without any dependency
           msub -N $lastsize launch_dgemm4.sh

           # All other jobs depend on the previous one.
           for s in 2048 4096 8192 16384
           do
              msub -N $s -W x=depend:afterok:$lastsize launch_dgemm4.sh
              lastsize=$s
           done



           Time routine for different matrix sizes by running
            jobchain.sh
           How many floating point operations per second?
           Percentage of single-node peak performance (94
            GFLOPS)?
June 9, 2010                                           1st SimLab Porting Workshop   Slide 23
  MKL and OMP_NUM_THREADS

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Highly optimized implementation
          Try different values of OMP_NUM_THREADS
          Time routine for different matrix sizes
          How many floating point operations per second?
          Percentage of single-node peak performance?
        If you do this in an interactive session don't forget to load
            the modules mkl and intel


June 9, 2010                   1st SimLab Porting Workshop              Slide 24
  Single Precision GEMM (SGEMM)

      C = ABC
      C kl = ∑i Aki Bil  C kl

        Highly optimized implementation
          Copy dgemm_mkl.cpp to sgemm_mkl.cpp
          Modify code, so that it uses float instead of double
          cblas_sgemm(order, transA, transB, width, width, width,
           alpha, A, width, B, width, beta, C, width);
          make sgemm
          Time routine for different matrix sizes
          How many floating point operations per second?
          Percentage of single-node peak performance?
June 9, 2010                   1st SimLab Porting Workshop      Slide 25
               Peak Performance!




June 9, 2010      1st SimLab Porting Workshop   Slide 26
  SGEMM (MKL)

       More than twice the double precision performance
       191 GFLOPS compared to 197 GFLOPS on NVIDIA Tesla
          1060 card (354.96 GFLOPS for Kernel)
       100% of peak
        Use optimized libraries whenever you can




June 9, 2010              1st SimLab Porting Workshop   Slide 27
  Overlapping Computation and Communication

       MPI_Isend and MPI_Ireceive offer non-blocking
           communication
       Run the overlap_full example
          make
          write launch script or use interactive session
          look at overlap
       Does it work?




June 9, 2010                  1st SimLab Porting Workshop   Slide 28

						
Related docs
Other docs by alicejenny
to view Lesson from Teachers
Views: 201  |  Downloads: 0
GUIDELINES FOR POST EXPOSURE PROPHYLAXIS PEP
Views: 133  |  Downloads: 0
FIRST BANK ADDITION City of Bloomington
Views: 0  |  Downloads: 0
Is There Bubble in US Housing Markets MIT
Views: 24  |  Downloads: 0
CCEVS Policy Letter NIAP CCEVS
Views: 0  |  Downloads: 0
Ratification of Protocol No
Views: 233  |  Downloads: 0
Michigan Proposed Insurance Survey ASTSWMO
Views: 0  |  Downloads: 0
The Impact of the new NHS Dental Contract
Views: 0  |  Downloads: 0
OVERVIEW OF THE Bad Request
Views: 189  |  Downloads: 0