PGI Accelerator
Document Sample


Mitglied der Helmholtz-Gemeinschaft
June 9, 2010
JuRoPA
Jan H. Meinke
Getting Started with SSH
Setting a passphrase
ssh-keygen -p
Logging in
ssh userid@juropa.fz-juelich.de
-X : turn on X-forwarding
-i /path/to/private/key : Choose different key file
-A : forward authentication request
Learn about ssh-agent
Enter passphrase once with ssh-add
Secure passphrase-less authentication
June 9, 2010 1st SimLab Porting Workshop Slide 2
Submitting Jobs
No direct access to compute nodes
MOAB
Queuing and job scheduling system
Most frequently used commands
msub : submit a new job
showq -w userid : shows job of userid
canceljob jobid : cancels a queued job
mjobctl : queries and changes job properties
checkjob -v jobid : information about a job
June 9, 2010 1st SimLab Porting Workshop Slide 3
Calculating Pi
=4 arctan 1
1 N 4
≃
N
∑i=0 1i0.52
We can split the sum in any way we want.
June 9, 2010 1st SimLab Porting Workshop Slide 4
Calculating Pi
=4 arctan 1
1 N 4
≃
N
∑i=0 1i0.52
Examples
svn checkout --username guest --password guest
https://svn.version.fz-juelich.de/simlab/workshop/june2010
pi.c : simple, single-threaded implementation
pi_omp.c : OpenMP parallelized version
pi_mpi.c : MPI parallelized version
Use make to compile
June 9, 2010 1st SimLab Porting Workshop Slide 5
A Simple Launch Script
#!/bin/bash -x
#MSUB -l nodes=1:ppn=8
cd $PBS_O_WORKDIR
mpiexec -np 1 ./pi
Compile pi with make pi
Submit job with msub launch_pi.sh
Starts job on 1 node with 8 MPI tasks
-l nodes=1:ppn=8 can be given on command line
Check your jobs with showq -w your_userid
June 9, 2010 1st SimLab Porting Workshop Slide 6
An OpenMP Launch Script
#!/bin/bash -x
#MSUB -l nodes=1:ppn=8
#MSUB -v tpt=8
export OMP_NUM_THREADS=8
cd $PBS_O_WORKDIR
./pi_omp
Compile pi_omp with make pi_omp
Submit job with msub launch_pi_omp.sh
Starts job on 1 node with 1 MPI task and 8 threads
-v tpt=8 sets the number of threads per task to 8
Compare time with the simple implementation
June 9, 2010 1st SimLab Porting Workshop Slide 7
An MPI/OpenMP Launch Script
#!/bin/bash -x
#MSUB -l nodes=2:ppn=8
#MSUB -v tpt=4
export OMP_NUM_THREADS=4
NSLOTS=4
cd $PBS_O_WORKDIR
mpiexec --exports=OMP_NUM_THREADS -np $NSLOTS ./pi_mpi
Compile pi_mpi with make pi_mpi_hybrid
Submit job with msub launch_pi_mpi.sh
Starts job on 2 node with 4 MPI task and 4 threads per
task
Change the ratio between MPI and OpenMP task and find
out which one is fastest. (One node is enough.)
June 9, 2010 1st SimLab Porting Workshop Slide 8
Interactive Session
msub -I -X -l nodes=1:ppn=8:turbomode
-I make the session interactive (capital i)
-X turns X-forwarding on
-l walltime=00:12:30 : sets the running time of the
session to 12:30 min (lowercase L).
-v tpt=2 : threads per task as before
-N name_of_job
… (see, e.g.,
http://www.fz-juelich.de/jsc/juropa/usage/quick-intro)
All these options can go in launch scripts, too.
June 9, 2010 1st SimLab Porting Workshop Slide 9
Interactive Session
Start interactive session
msub -I -X -l nodes=1:ppn=8:turbomode -v tpt=8
Use export OMP_NUM_THREADS=2 (4,8) to test the scaling
of pi_omp
June 9, 2010 1st SimLab Porting Workshop Slide 10
Chain Jobs
Define job dependencies using MOAB
-W x=depend:afterok:jobname
after, afterany, afterok, afternotok, before, beforeany,
beforeok, beforenotok, on (
http://www.clusterresources.com/products/moab/11.5jobdependencies.shtml)
Somewhat buggy at the moment
June 9, 2010 1 st SimLab Porting Workshop Slide 11
Overview
UNiform Interface to Computing Resources
Transparent, secure, intuitive access to
distributed computing and data resources
Easy installation and configuration
Application and workflow support
Active European developer community,
quick and efficient support
Open source, BSD licensed
http://www.unicore.eu
June 9, 2010 Slide 12
Clients
URC: Graphical user interface
UCC: Commandline client
and Emacs integration
HiLA: API
hila-unicore6 UNICORE 6 implementation
hila-grid-common Common Grid implementations
hila-grid-api Grid API
hila-api Generic hila-api, not Grid related
June 9, 2010 Slide 13
Workflow
Features Clients
Simple graphs (DAGs) UNICORE Rich client
Workflow variables Commandline client
Loops and
control constructs
While
For-each
If-else
Conditions
Exit code
File existence
File size
Workflow variables
June 9, 2010 Slide 14
Execution environments
Improved MPI support for e.g.
MPI
OpenMP
Hybrid
Application integration
GridBean integrates
application specific
panels and visualizer
June 9, 2010 Slide 15
Generalized Matrix-Matrix Multiplication (GEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Core routine in linpack
Triple loop over k, l, and I
For the exercises, we assume square matrices of size N
Per element C(k,l), we have
1 multiplication
N multiplications and additions as part of the sum
1 multiplication + 1 addition
Total: N * N * (3 + 2N) operations
June 9, 2010 1st SimLab Porting Workshop Slide 16
Generalized Matrix-Matrix Multiplication (GEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Simple double precision implementation
Look at code dgemm.cpp
make dgemm1
Start interactive session
Time routine for different matrix sizes
How many floating point operations per second?
Percentage of single-processor peak performance
(11.72 GFLOPS)?
June 9, 2010 1st SimLab Porting Workshop Slide 17
Generalized Matrix-Matrix Multiplication (GEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Simple double precision implementation
Look at code dgemm_loop_exchanged.cpp
What is different?
make dgemm2
Start interactive session (if necessary)
Time routine for different matrix sizes
How many floating point operations per second?
Percentage of single-processor peak performance?
June 9, 2010 1st SimLab Porting Workshop Slide 18
Generalized Matrix-Matrix Multiplication (GEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Simple double precision implementation
What happens if we change the two inner loops back?
Code is just as fast. Why?
Try -opt-report as compiler option
Compiler exchanges inner loops.
June 9, 2010 1 st SimLab Porting Workshop Slide 19
Generalized Matrix-Matrix Multiplication (GEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Simple double precision implementation
Look at code matmultadd1.c
What is different?
make dgemm3
Start interactive session (if necessary)
Time routine for different matrix sizes
How many floating point operations per second?
Percentage of single-processor peak performance?
June 9, 2010 1st SimLab Porting Workshop Slide 20
Using Intel's Math Kernel Library (MKL)
C = ABC
C kl = ∑i Aki Bil C kl
Highly optimized implementation
Look at code dgemm_mkl.cpp
cblas_dgemm(order, transA, transB, width, width, width,
alpha, A, width, B, width, beta, C, width);
module load mkl; module load intel
make dgemm4
Prepare chain job to time different matrix sizes
June 9, 2010 1st SimLab Porting Workshop Slide 21
Using a job chain
#!/bin/bash
# This script starts a chain of jobs.
# The job name is also used to pass the size of the matrix
lastsize=1024
# The first job is launched without any dependency
msub -N $lastsize launch_dgemm4.sh
# All other jobs depend on the previous one.
for s in 2048 4096 8192 16384
do
msub -N $s -W x=depend:afterok:$lastsize launch_dgemm4.sh
lastsize=$s
done
June 9, 2010 1st SimLab Porting Workshop Slide 22
Using a job chain
#!/bin/bash
# This script starts a chain of jobs.
# The job name is also used to pass the size of the matrix
lastsize=1024
# The first job is launched without any dependency
msub -N $lastsize launch_dgemm4.sh
# All other jobs depend on the previous one.
for s in 2048 4096 8192 16384
do
msub -N $s -W x=depend:afterok:$lastsize launch_dgemm4.sh
lastsize=$s
done
Time routine for different matrix sizes by running
jobchain.sh
How many floating point operations per second?
Percentage of single-node peak performance (94
GFLOPS)?
June 9, 2010 1st SimLab Porting Workshop Slide 23
MKL and OMP_NUM_THREADS
C = ABC
C kl = ∑i Aki Bil C kl
Highly optimized implementation
Try different values of OMP_NUM_THREADS
Time routine for different matrix sizes
How many floating point operations per second?
Percentage of single-node peak performance?
If you do this in an interactive session don't forget to load
the modules mkl and intel
June 9, 2010 1st SimLab Porting Workshop Slide 24
Single Precision GEMM (SGEMM)
C = ABC
C kl = ∑i Aki Bil C kl
Highly optimized implementation
Copy dgemm_mkl.cpp to sgemm_mkl.cpp
Modify code, so that it uses float instead of double
cblas_sgemm(order, transA, transB, width, width, width,
alpha, A, width, B, width, beta, C, width);
make sgemm
Time routine for different matrix sizes
How many floating point operations per second?
Percentage of single-node peak performance?
June 9, 2010 1st SimLab Porting Workshop Slide 25
Peak Performance!
June 9, 2010 1st SimLab Porting Workshop Slide 26
SGEMM (MKL)
More than twice the double precision performance
191 GFLOPS compared to 197 GFLOPS on NVIDIA Tesla
1060 card (354.96 GFLOPS for Kernel)
100% of peak
Use optimized libraries whenever you can
June 9, 2010 1st SimLab Porting Workshop Slide 27
Overlapping Computation and Communication
MPI_Isend and MPI_Ireceive offer non-blocking
communication
Run the overlap_full example
make
write launch script or use interactive session
look at overlap
Does it work?
June 9, 2010 1st SimLab Porting Workshop Slide 28
Get documents about "