Docstoc

HPC

Document Sample
HPC Powered By Docstoc
					http://www.osc.edu/supercomputing/computi
ng/opt/index.shtml




Supercomputing Environments
Using Glenn, the IBM Opteron Cluster at OSC
The Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges,
universities, and companies.

The Ohio Supercomputer Center's IBM Cluster 1350, named "Glenn", includes AMD Opteron
multi-core technologies and the new IBM cell processors. The system offers a peak performance
of more than 22 trillion floating point operations per second and a variety of memory and
processor configurations. OSC's new supercomputer also includes blade systems based on the
Cell Broadband Engine processor. This will allow Ohio researchers and industries to easily use
this new hybrid HPC architecture.

Please see the hardware section for current system specifications.

      Getting started
          o File system
          o Executing programs
          o Batch requests
          o Interactive batch requests
          o Estimating Queue Time
      Programming environment
          o Compiling systems
          o Shared memory
          o Message Passing Interface (MPI)
          o Debugging
          o Performance Analysis
      Software
      Training

Getting started

To login to Glenn at OSC, ssh to the following hostname:
          glenn.osc.edu

From there, you have access to the compiling systems, performance-analysis tools, and
debugging tools. You can run programs interactively or through batch requests. See the
following sections for details.

File system

Glenn accesses the user home directories found on the OSC mass storage environment.
Therefore, users have the same home directory on Glenn as on the Itanium 2 cluster.

The system also has fast local disk space intended for temporary files. You are encouraged to
perform the majority of your work in the temporary space and only store permanent files in your
home directory. To ensure fast access to required files, copy the files to the temporary area at the
start of your session.

The following example shows how to use /tmp, the temporary directory.

       mkdir /tmp/$USER          Create your own temporary directory.
       cp files /tmp/$USER       Copy the necessary files.
       cd /tmp/$USER             Move to the directory.
       ...                       Do work (compile, execute, etc.).
       ...
       cp new files $HOME        Copy important new files back home.
       cd $HOME                  Return to your home directory.
       rm -rf /tmp/$USER         Remove your temporary directory.
       exit                      End the session.

Use this procedure when compiling and executing interactively. The temporary space is not
backed up, and old files may be purged when the temporary file system gets full.

A simpler procedure is available for batch jobs through the TMPDIR environment variable. See
"Batch requests" for more information.

There are times when $TMPDIR has insufficient resources. After system requirements use some
of the hard drive, e.g. for swap space, anywhere from 45 GB to 1.8 TB of local temporary disk
space is available on each node. Jobs which require significant amounts of temporary disk space
(>10 GB) in $TMPDIR should specify that using the PBS -l disk=amount directive described
below. Any job requiring either more than 1.8 TB of temporary space or shared temporary space
should use the /pvfs parallel file system, which is a high-performance, high-capacity shared
temporary space. For more information on parallel file system usage, please consult the web
page for PVFS at OSC.

Executing programs
Commands on Glenn can be executed either interactively or through batch requests. There are
fixed usage limits for interactive execution; jobs that take more than the allowed CPU time must
be executed using batch requests. Current interactive limits are 2 hours of CPU time and 1 GB of
memory. To use the resources of the cluster most efficiently, you are encouraged to use batch
requests whenever possible. See "Batch requests" for more information.

For information on how to execute an MPI program, see the "MPI" section.

To execute a non-MPI program, simply enter the name of the executable. Unless otherwise
specified, the number of processors used for a non-MPI parallel program is determined by the
operating system at runtime. To control the number of processors, set the environment variable
OMP_NUM_THREADS. If the number of available processors (four per node) is less than
OMP_NUM_THREADS, then at least one processor will run multiple threads.

The following ksh example causes a.out to use 4 processors if they are available.

          export OMP_NUM_THREADS=4
          ./a.out

The omp_num_threads() function can be called within a program to determine the number of
threads assigned to that program.

          integer function omp_num_threads (fortran)
          int omp_num_threads(); (C)

Batch requests

Batch requests are handled by the TORQUE resource manager and Moab Scheduler. Use the
qsub command to submit a batch request, qstat to view the status of your requests, and qdel to
delete unwanted requests. For more information, see the manual pages for each command.

The following options are often useful when submitting batch requests. The options may appear
on the qsub command line or preceded by #PBS at the beginning of the batch request file.

                           Option                             Meaning
               -N job                             Name the job.
                                                  Use shell rather than your
               -S shell                           default login shell to interpret
                                                  the job script.
                                                  Total wallclock time limit in
               -l walltime=time                   seconds or
                                                  hours:minutes:seconds
               -l                          Request use of numprocs
               nodes=numnodes:ppn=numprocs processors (max 4) on each of
                                                  numnodes nodes (max 175).
                                                  (OPTIONAL) Request use of
                                                  amount of memory per node.
                                                  Default units are bytes; can also
               -l mem=amount
                                                  be expressed in megabytes (e.g.
                                                  mem=1000MB) or gigabytes
                                                  (eg. mem=2GB).
                                                  (OPTIONAL) Request use of
                                                  amount of local scratch disk
                                                  space per node. Default units
                                                  are bytes; can also be expressed
               -l file=amount                     in megabytes (e.g.
                                                  file=10000MB) or gigabytes
                                                  (eg. file=10GB). Only required
                                                  for jobs using > 10GB of local
                                                  scratch space per node.
                                                  (OPTIONAL) Request use of
                                                  N licenses for package. If
                                                  omitted, N=1. Only required for
               -l software=package[+N]            jobs using specific software
                                                  packages with limited numbers
                                                  of licenses; see software
                                                  documentation for details.
               -j oe                              Redirect stderr to stdout.
                                                  Send e-mail when the job
               -m ae
                                                  finishes or aborts.

By default, your batch jobs begin execution in your home directory. This is true even if you
submit the job from another directory.

To facilitate the use of temporary disk space, a unique temporary directory is automatically
created at the beginning of each batch job. This directory is also automatically removed at the
end of the job. Therefore, it is critical that all files required for further analysis be copied
back to permanent storage in your $HOME area prior to the end of your batch script. You
access the directory through the $TMPDIR environment variable. Note that in jobs using more
than one node, $TMPDIR is not shared -- each node has its own distinct instance of $TMPDIR.

Single-CPU sequential jobs should either set the -l nodes resource limit to 1:ppn=1 or leave it
unset entirely. The following is an example of a sequential job which uses $TMPDIR as its
working area.

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -N myscience
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/science
cp my_program.f mysci.in $TMPDIR
cd $TMPDIR
pgf77 -O3 my_program.f -o mysci
/usr/bin/time ./mysci > mysci.hist
cp mysci.hist mysci.out $HOME/Beowulf/cdnz3d

If you have the above request saved in a file named my_request.job (and my_program.f saved in
a subdirectory called science/), the following command will submit the request.

          opt-login01:~> qsub my_request.job
          1151787.opt-batch.osc.edu

You can use the qstat command to monitor the progress of the resulting batch job. In the above
example, the number 1151787 is the job identifier ori jobid. When the job finishes, my_results
will appear in the science subdirectory, and the standard output generated by the job will appear
in a file called my_job.oN, where N is the jobid. The N differentiates multiple submissions of the
same job, for each submission generates a different number. This file will appear in the directory
where you executed the qsub command. The directory from where you execute the qsub
command can be referenced by the environment variable $PBS_O_WORKDIR from within a
PBS batch script only.

All batch jobs must set the -l walltime resource limit, as this allows the Moab Scheduler to
backfill small, short running jobs in front of larger, longer running jobs. This in turn helps
improve turnaround time for all jobs.

Sample large memory serial job:

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=16gb
#PBS -N cdnz3d
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/Beowulf/cdnz3d
cp cdnz3d cdin.dat acq.dat cdnz3d.in $TMPDIR
cd $TMPDIR
./cdnz3d > cdnz3d.hist
cp cdnz3d.hist cdnz3d.out $HOME/Beowulf/cdnz3d
ja

Single-node jobs that request 16 GB or more of memory will be scheduled on the quad-socket
large memory nodes. The maximum amount of memory available on a node is 64 GB.
Sample large disk serial job:

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l file=96gb
#PBS -N cdnz3d
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/Beowulf/cdnz3d
cp cdnz3d cdin.dat acq.dat cdnz3d.in $TMPDIR
cd $TMPDIR
./cdnz3d > cdnz3d.hist
cp cdnz3d.hist cdnz3d.out $HOME/Beowulf/cdnz3d
ja

Single-node jobs that request more than 45 GB of temporary space will be scheduled on the
quad-socket nodes. The maximum amount of local disk space available on a node is 1800 GB;
jobs in need of more temporary space than that must use the /pvfs parallel file system instead.

Estimating Queue Time

To get an estimate of how long before a job (identified by jobid) starts, use the following
command:

      showstart [jobid]

This will query the Moab scheduler for an estimate of the job's start time. Please keep in mind
that this is an estimate and may change over time, depending on system load and other factors.

Programming environment

Glenn supports two programming models of parallel execution: shared memory on exactly one
node, through compiler directives and automatic parallelization; and distributed memory across
multiple nodes, through message passing. See the sections below for more information.

Compiling systems

FORTRAN 77, Fortran 90, C, and C++ are supported on the IBM Opteron cluster. The IBM
Opteron cluster has the Intel and Portland Group suites of optimizing compilers, which tend to
generate faster code than that generated by the standard GNU compilers.

The following examples produce the Linux executable a.out for each type of source file for the
Portland Group and Intel compilers. Options which have been found to produce good
performance with many (though not necessarily all) programs are given under "Recommended
Options".
               Portland                                                            Recommended
 Language                          Recommended Options                   Intel
                Group                                                                 Options
             pgcc          -Xa -tp x64 -fast -                       icc
C                                                                                 -O2 -ansi
             sample.c      Mvect=assoc,cachesize:1048576             sample.c
                           -A -fast -tp x64 -
             pgCC                                                    icpc
C++                        Mvect=assoc,cachesize:1048576 --                       -O2 -ansi
             sample.C                                                sample.C
                           prelink-objects
FORTRAN pgf77                                                        ifort
                           -fast -Mvect=assoc,cachesize:1048576                   -O2
77      sample.f                                                     sample.f
             pgf90                                           ifort
Fortran 90              -fast -Mvect=assoc,cachesize:1048576            -O2
             sample.f90                                      sample.f90

For more information on command-line options for each compiling system, see the manual pages
(man pgf77, man icpc, etc...).

Shared memory

Users can automatically optimize single-node sequential programs for shared-memory parallel
execution using the Portland Group -Mconcur or Intel -parallel compiler option.

          pgf77 -O2 -Mconcur sample.f
          pgf90 -O2 -Mconcur sample.f90
          pgcc -O2 -Mconcur sample.c
          pgCC -O2 -Mconcur sample.C

          ifort -O2 -parallel sample.f
          ifort -O2 -parallel sample.f90
          icc -O2 -parallel sample.c
          icpc -O2 -parallel sample.C


In addition to automatic parallelization, both the Fortran and C/C++ compilers understand the
OpenMP set of directives, which give the programmer a finer control over the parallelization.
The -mp (Portland Group) and -openmp (Intel) compiler options activate translation of source-
level OpenMP directives and pragmas.

A sample batch script appears below. The request first copies a Fortran file from a subdirectory
of the user's home directory to the temporary space. It then compiles the file for OpenMP
threaded execution, runs the executable using 4 threads on 1 node, and copies the results back to
the previous subdirectory. Notice that the careful use of full file names allows this request to be
submitted safely from any subdirectory.

        #PBS   -l   walltime=1:00:00
        #PBS   -l   nodes=1:ppn=4
        #PBS   -N   my_job
        #PBS   -S   /bin/ksh
        #PBS   -j   oe
        cd $TMPDIR
        cp $HOME/science/my_program.f .
        pgf77 -O2 -mp my_program.f
        export OMP_NUM_PROCS=4
        ./a.out > my_results
        cp my_results $HOME/science

Message Passing Interface (MPI)

The system uses the MPICH implementation of the Message Passing Interface (MPI), optimized
for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel
processing using a distributed-memory model. For more information on MPI, see the Training
section of the OSC website.

Each program file using MPI must include the MPI header file. The following statement must
appear near the beginning of each C or Fortran source file, respectively.

        #include <mpi.h>
        include 'mpif.h'

To compile an MPI program, use the MPI wrapper scripts which invoke the Portland Group or
Intel compilers depending on which module is loaded prior to executing the compilation
command. The MPI compilers take the same options as the compiler they wrap. Here are some
examples which produce an executable named a.out:

        mpif77 sample.f

        mpif90 sample.f90

        mpicc sample.c

        mpiCC sample.C

Use the mpiexec command to run the resulting executable in a batch job; this command will
automatically determine how many processors to use on based on your batch request.

         mpiexec a.out

Here is an example of an MPI job which uses 8 of the Infiniband-equipped nodes on the IBM
Opteron cluster:

        #PBS   -l   walltime=1:00:00
        #PBS   -l   nodes=8:ppn=4
        #PBS   -N   my_job
        #PBS   -S   /bin/ksh
        #PBS   -j   oe

        cd $HOME/science
        mpif77 -O3 mpiprogram.f
        pbsdcp a.out $TMPDIR
         cd $TMPDIR
         mpiexec ./a.out > my_results
         cp my_results $HOME/science

Jobs that request a large number of nodes (for instance nodes > 100) are very difficult to
schedule and may sit in the queue for a very long time. In practice it is best to start out requesting
node=2:ppn=4 and then increase the number of nodes as you are able to confirm that your code's
performance scales up with larger numbers of processors.

mpiexec will normally spawn one MPI process per CPU requested in a batch job. However, this
behavior may be modified with the -pernode command line options. The -pernode option
requests that one MPI process be spawned per node. These options are intended to be used for
codes which mix MPI message passing with some form of shared memory programming model,
such as OpenMP or POSIX threads.

If you wish to use fewer than the assigned number of processors, set the -n option to mpiexec to
the required number. Here is an example:

          #PBS   -l   walltime=1:00:00
          #PBS   -l   nodes=5:ppn=4
          #PBS   -N   my_job
          #PBS   -S   /bin/ksh
          #PBS   -j   oe

          ...
          mpiexec -n 19 a.out
          # running 19 MPI processes

If you wish to run one MPI process on each node for benchmarking or multithreading purposes,
you need to continue specifying ppn=4, but add the -pernode option to mpiexec. Here is an
example:


          #PBS   -l   walltime=1:00:00
          #PBS   -l   nodes=5:ppn=4
          #PBS   -N   my_job
          #PBS   -S   /bin/ksh
          #PBS   -j   oe

          ...
          mpiexec -pernode a.out
          # running 5 MPI processes, one on each node

The pbsdcp command used in the example above is a distributed copy command; it copies the
listed file or files to the specified destination (the last argument) on each node of the cluster
assigned to your job. This is needed when copying files to directories which are not shared
between nodes, such as /tmp or $TMPDIR.

Debugging
The GNU debugger gdb is recommended for interactive or post-mortem analysis of sequential
programs. To debug a program with gdb, first compile the program with the -g option.

          pgf77 -g program.f
          pgf90 -g program.f90
          pgcc -g program.c
          pgCC -g program.C

To debug a program interactively, run the debugger on the appropriate executable.

          gdb a.out

To analyze a core file after an unsuccessful execution, run the debugger on the core file and
supply the executable that generated the file.

          gdb a.out core

A graphical interface called ddd is also available for gdb. Data Display Debugger (DDD) is a
graphical front-end for command line debuggers, like gdb. As with gdb, the program must first
be compiled with the -g option as given above. To debug a program interactively, run the
debugger on the appropriate executable.

          ddd a.out

Further information and documentation on DDD can be found at
http://www.gnu.org/software/ddd.

The totalview debugger is designed to run on parallel programs using MPI, OpenMP, or
pthreads. The user interacts with totalview via a graphical user interface (GUI). All OSC clusters
are designed to run compiled parallel code via the PBS batch system. Using the standard batch
submission process a user cannot interact directly with their running program. However PBS also
permits running in interactive batch mode. This allows the user to use GUI programs such as
totalview to run a parallel code. The resource (memory, CPU) limits for an interactive batch job
are the same as the standard batch limits for that user. The following is a sample interactive batch
script named mybatchfile:

          #PBS   -j   oe
          #PBS   -N   totalview
          #PBS   -S   /bin/ksh
          #PBS   -l   nodes=2:ppn=4
          #PBS   -l   walltime=1:00:00
          #PBS   -v   DISPLAY

There is no script section as this is intended to run interactively. The PBS lines are there to
request resources. On the command line use qsub to request an interactive shell:

          >> qsub -I mybatchfile
          qsub: waiting for job 0.opt-batch.osc.edu to start
          qsub: job 0.opt-batch.osc.edu ready
The same request may also accomplished without a batchfile by typing all the resource requests
directly on the command line:

        >> qsub -I -v DISPLAY -l nodes=2:ppn=4 -l walltime=1:00:00 -j oe -N
totalview -S /bin/tcsh

Once you have an interactive shell on one of the compute nodes, you can treat this shell like any
other shell, except for all the extended environment variables under PBS, like
$PBS_O_WORKDIR, $TMPDIR, etc. To invoke totalview you can run mpiexec with the -tv
option on your MPI program:

          [optXXXX]% mpiexec -tv myMPIprogram

For more information on using interactive batch see the manual page for qsub.

Within totalview, you can set breakpoints and examine variables on a per-process basis.

Performance Analysis
Software
Training

https://hpc.cineca.it/docs/HPCUserGuide/OldIBMCLXUserGuide




System architecture
CLX is an IBM Linux Cluster 1350, made (mostly) of 512 2-way IBM X335 nodes. Each
computing node contains 2 Xeon Pentium IV processors. All the compute nodes have 2GB of
memory (1GB per processor). 768 processors of CLX are Xeon Pentium IV at 3.06 GHz? with
512MB of L2 cache and the Front Side Bus (FSB) at 533MHz. 256 processors, bought at the
beginning of 2005, are Xeon Pentium IV EM64T (Nocona) at 3.00GHz with 1024MB of L2
cache, the FSB at 800MHz and support Hyper-Threading Technology. All the CLX processors
are capable of 2 double precision floating point operations per cycle, using the INTEL SSE2
extensions. Login and service node processors are at 2.8GHz and have more memory. All the
nodes are interconnected to each other through a Myrinet network ( http://www.myricom.com),
capable of a maximum bandwidth of 256MB/s between each pair of nodes. The core component
of the network is a pair of M3-CLOS Myrinet-2000 switches in CLOS256+256 configuration.
The global peak performance of CLX is of 6.1 TFlops?. The queuing system of CLX is LSF (it
was OpenPBS? up to Agoust 2005).

Disks and Filesystems
CLX from Agoust 2005 conforms totally to the CINECA Infrastructure (LinK?).
Programming environment
The programming environment of the IBM-CLX machine consists in a choice of Compilers for
the main scientific languages (Fortran, C and C++), Debuggers to help users in finding bugs and
errors in the codes, Profilers to help in code optimisation, ...

In general you must "load" the correct environment also for using programming tools like
compilers.

If you use a given set of modules to compile an application, very probably you will need the
same modules to run it, because by default linking is dynamic on linux systems, and the
application will need at runtime the shared libraries of compiler and libraries. To minimize the
number of needed modules at runtime, use static linking to compile the applications.

Compilers

Available compilers are standard GNU gcc and g77, GNU g95, INTEL and Portland Group
(PGI) compilers. After loading the appropriate module, the command:

man <compiler_command>
gives you the complete list of the flags supported by the compiler.

INTEL Compiler

Initialize the environment with one of the module commands:

module load compiler/intel
module load compiler/intel7
module load compiler/intel8

       ifort: Fortran77 and Fortran90 compiler
       icc: C and C++ compiler:

Find the documentation of the two compilers respectively in the directories:

$IFORT_DOC
$ICC_DOC

Some optimizations we do suggest:

       to align data with cache-line boundaries, tune to Pentium Xeon processor and -O3
        optimizations:

-align -tpp7 -O3

       to add loop vectorization (SSE/SSE2 instructions) to improve loops performance:
-align -tpp7 -O3 -xN -ftz

PGI

Initialize the environment with one of the module commands:

module load compiler/pgi

The name of the PGI compilers is:

 *    pgf90: Fortran90 compiler
 *    pgf77: Fortran77 compiler
 *    pgcc: C compiler
 *    pgCC: C++ compiler

Find the documentation of the PGI compilers in the directory:

$PGI_DOC

Some optimizations we do suggest:

         to align data with cache-line boundaries, tune to Pentium Xeon processor and -O3
          optimizations:

-Mcache_align -tp p7 -O3

         best PGI suggested combination:

-Mcache_align -tp p7 -fast

         best PGI suggested combination to use loop vectorization (SSE/SSE2 instructions) to
          improve loops performance:

-Mcache_align -tp p7 -fastsse

GNU

g77 and gcc are always available but are not the best optimizing compilers.

Optimize with -O2 or -O3.

To try the GNU g95 compiler you have to load the relative module before:

module load compiler/g95

Debuggers
Enabling compiler runtime checks

Pay attention: some flags are available only for the Fortran compiler.

INTEL

Compile and link using the options:

-O0 -g -traceback -fpstkchk -check bounds -fpe0

no optimizations, debug info, check array for addressing into correct bounds, floating point
exceptions trap If at runtime your code dies, then there is a problem. You can run your code
using the debugger or analyze the core (core not available with PGI compilers).

PGI

Compile and link using

         -O0 -g -C -Ktrap=ovf,divz,inv

no optimizations, debug info, check array for addressing into correct bounds, floating point
exceptions trap

GNU

Compile and link using

-O0 -g -Wall -fbounds-check

no optimizations, debug info, high level warning, check array for addressing into correct bounds

==== Intel: idb (serial debugger)====

idb -gdb ./executable
see gdb and idb documentation

PGI: pgdbg (serial debugger)
pgdbg      ./executable
see pgdbg documentation

GNU: gdb (serial debugger)
gdb ./executable

Valgrind

valgrind is a run time debugger very useful to find malicious errors in codes as memory leaks,
use of uninitialized memory, mismatched use of malloc, new, free, delete, overlapping src and
dst pointers in memcpy, strcpy, .... Compile your code and link with -O0 -g options (add -
traceback if you are using Intel compiler), and run your code with: valgrind --tool=memcheck
./your_executable your_input_params Your code will be run very slow, but suspected memory
errors are displayed on the stderr. See http://valgrind.org/ for documentation.

Core file analysis

Create core ONLY in the /scratch area, to do not exceed your home quota! First, enable core
dumping

bash:            ulimit -c unlimited
csh/tcsh:        limit coredumpsize unlimited

If you are using Intel compiler, set the following environment variable:

bash:            export decfort_dump_flag=TRUE
csh/tcsh:        setenv decfort_dump_flag TRUE
Run your code and create the core file. To analyze it:

INTEL
idb -gdb ./executable core

GNU
gdb ./executable core

Parallel debugger Totalview and DDT

Totalview is available for debugging parallel codes. See the documentation here:
http://www.etnus.com/Documentation/ Only GNU and Intel compilers are supported. GNU:
compile and link with "-O0 -g" INTEL: compile and link with "-O0 -g -traceback" To run
Totalview

Environment setup:

Initialize the environment with the command: module load totalview If the DISPLAY is not
correctly set (e.g., you are running it within a “bsub –Is” session), set it with something similar to
“export DISPLAY your_IP:0.0”.

Run Totalview:
bsub -a tv -n 2 -Is mpirun.lsf ./your_executable -tvopt -no_ask_on_dlopen
When Totalview starts, it is not immediate to see the source being debugged, but you must follow the
next steps:

GO > YES
VIEW > Lookup Function > and find
"MAIN__"           for a Fortran code,
"main"             for a C code
When the source code is displayed you can set breakpoints, etc....
DDT

DDT is a graphical debugger for serial and parallel programs. First of all, compile your code
with the typical optimization flags: GNU: compile and link with "-O0 -g" INTEL: compile and
link with "-O0 -g -traceback" PGI: compile and link with "-O0 -g"

Then load the correct module:

module load ddt

and run your code with a command like:

bsub -n 4 -Is ddt $PWD/executable_name

follow the instruction of the graphical windows, choosing “mpich gm” as "MPI implementation"
if you run a parallel code. Further documentation: http://www.allinea.com

Profilers

gprof

In order to check where your code spends most of its time, you can use gprof. It uses data
collected by the -pg compiling option to construct a text display of the functions within your
application (call tree and CPU time spent in every subroutine). gprof provides quick access to the
profiled data, which lets you identify the functions that are the most CPU-intensive. The text
display also lets you manipulate the display in order to focus on the application's critical areas.
Usage:

compiler_name -pg <optimization flags> -o filename filename.f
run the program filename and get the output profiling file gmon.out Finally perform profiling

gprof filename gmon.out

It is also possible to profile at code line-level (see man gprof for other options). In this case you
must use also the “-g” flag:

compiler_name -g -pg <optimization flags> -o filename filename.f
gprof -annotated-source filename gmon.out

ODT

to be completed Documentation: http://www.allinea.com

Scientific libraries
MKL

MKL is the Intel Math Kernel Library. It contains a very highly optimized BLAS library,
LaPACK? and more Intel highly optimized routines To compile and link with MKL you need to
add the following flags: compiling:

-I$MKL_INCLUDE
linking:

-L$MKL_LIB -lmkl_lapack -lmkl_ia32 -lguide
Find the MKL documentation on the CLX in $MKL_DOC. You can view the MKL manual on line with the
command:

acroread $MKL_DOC/mklman.pdf
or download and view it locally.


Parallel programming
The parallel programming is mainly based on the MPICH-GM version of MPI (myrinet enabled
MPI). The main four parallel-MPI compilers available are:

mpif90
mpif77
mpicc
mpiCC
for Fortran90, Fortran77, C and C++ respectively. These command names are the same for all suites of
compilers, but they behave differently depending on the module you have loaded. In all cases you will
run the applications compiled with the parallel compiler with the command:

mpirun <executable>
Remember: you can use mpirun only within LSF scripts or LSF interactive sessions. To choose the desired
underlying compiler, select the appropriate environment with one of the commands:

module load mpich/intel
module load mpich/pgi
module load mpich/gnu

Use

module avail

to know what versions are available. A version of MPICH, especially useful for debugging and
for third party codes, is the one that does not rely on Myrinet protocol, but uses standard TCP/IP
instead. The name of the modules for that version of MPICH are:

mpich-p4/gnu
mpich-p4/intel
mpich-p4/pgi
Default environment

The default environment of CLX has the following loaded modules:

mpich/intel
compiler/intel
In fact the compiler Intel is the best compiler for the HPC. If you need to use another compiler and MPI
version, unload the intel modules, or clean totally the environment with:

module purge

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/3/2012
language:Unknown
pages:18