Open MPI ntro - PowerPoint

Document Sample
Open MPI ntro - PowerPoint Powered By Docstoc
					Introduction to OpenMP

For a more detailed tutorial see:
    http://www.openmp.org
   Look at the presentations
               Concepts

• Directive based programming
  – declare properties of language structures
    (sections, loops)
  – scope variables
• A few service routines
  – get information
• Compiler options
• Environment variables
 OpenMP Programming Model
• fork-join parallelism
• Master thread spawns a team of threads as
  needed.
             Typical OpenMP Use
 • Generally used to parallelize loops
     – Find most time consuming loops
     – Split iterations up between threads

                               void main()
void main()                    {
{                                double Res[1000];
  double Res[1000];
  for(int i=0;i<1000;i++) {    #pragma omp parallel for
      do_huge_comp(Res[i]);      for(int i=0;i<1000;i++) {
   }                                do_huge_comp(Res[i]);
}                                 }
                               }
           Thread Interaction
• OpenMP operates using shared memory
  – Threads communicate via shared variables
• Unintended sharing can lead to race conditions
  – output changes due to thread scheduling
• Control race conditions using synchronization
  – synchronization is expensive
  – change the way data is stored to minimize the need
    for synchronization
              Syntax format
• Compiler directives
  – C/C++
     • #pragma omp construct [clause [clause] …]
  – Fortran
     • C$OMP construct [clause [clause] … ]
     • !$OMP construct [clause [clause] … ]
     • *$OMP construct [clause [clause] … ]
• Since we use directives, no changes need
  to be made to a program for a compiler
  that doesn’t support OpenMP
                 Using OpenMP
• Compilers can automatically place directives with option
   – -qsmp=auto (IBM)
   – xlf_r and xlc do a good job (IBM)
   – some loops may speed up, some may slow down
• Compiler option required when you write in directives
   – -qsmp=omp (IBM)
   – -mp (sgi)
• Can mix directives with automatic parallelization
   – -qsmp=auto:omp (IBM)


• Scoping variables is the hard part!
   – shared variables, thread private variables
         OpenMP Directives
• 5 categories
  – Parallel Regions
  – Worksharing
  – Data Environment
  – Synchronization
  – Runtime functions / environment variables
• Basically the same between C/C++ and
  Fortran
              Parallel Regions
• Create threads with omp parallel

         double A[1000]
         omp_set_num_threads(4);
         #pragma omp parallel
         {
               int ID = omp_get_thread_num();
               dosomething(ID, A);
         }
• Threads share A (default behavior)
• Threads all start at same time then synchronize at a
  barrier at the end to continue with code.
           Sections construct
• The sections construct gives a different
  structured block to each thread
• By default there is a barrier at the end. Use the
  nowait clause to turn off.
           #pragma omp parallel
           #pragma omp sections
           {
                  X_calculation();
           #pragma omp section
                  y_calculation();
           #pragma omp section
                  z_calculation();
           }
       Work-sharing constructs
• the for construct splits up loop iterations

• By default, there is a barrier at the end of the “omp
  for”. Use the “nowait” clause to turn off the barrier.

          #pragma omp parallel
          #pragma omp for
               for (I=0;I<N;I++)
               {
                      NEAT_STUFF(I);
               }
         Short-hand notation
• Can combine parallel and work sharing
  constructs
         #pragma omp parallel for
                for (I=0;I<N;I++){
                          NEAT_STUFF(I);
                }




• There is also a “parallel sections”
  construct
                           A Rule
• In order to be made parallel, a loop must
  have canonical “shape”
                                     index++;
                                     ++index;
                           <         index--;
                                     --index;
   for (index=start; index <= end;   index += inc;          )
                           >=
                           >         index -= inc;
                                     index = index + inc;
                                     index = inc + index;
                                     index = index – inc;
                          An example
             #pragma omp parallel for private(j)
             for (i = 0; i < BLOCK_SIZE(id,p,n); i++)
                        for (j = 0; j < n; j++)
                                   a[i][j] = MIN(a[i][j], a[i][k] + tmp[j])

By definition, private variable values are undefined at loop entry and exit

To change this behavior, you can use the
        firstprivate(var) and lastprivate(var)
clauses

             x[0] = complex_function();
             #pragma omp parallel for private(j) firstprivate(x)
             for (i = 0; i < n; i++)
                        for (j = 0; j < m; j++)
                                   x[j] = g(i, x[j-1]);
                        answer[i] = x[j] – x[i];
           Scheduling Iterations
• The schedule clause effects how loop iterations are
  mapped onto threads
• schedule(static [,chunk])
   – Deal-out blocks of iterations of size “chunk” to each thread.
• schedule(dynamic[,chunk])
   – Each thread grabs “chunk” iterations off a queue until all
     iterations have been handled.
• schedule(guided[,chunk])
   – Threads dynamically grab blocks of iterations. The size of he
     block starts large and shrinks down to size “chunk” as the
     calculation proceeds.
• schedule(runtime)
   – Schedule and chunk size taken from the OMP_SCHEDULE
     environment variable.
                    An example

          #pragma omp parallel for private(j) schedule(static, 2)
          for (i = 0; i < n; i++)
                     for (j = 0; j < m; j++)
                                x[j][j] = g(i, x[j-1]);



You can play with the chunk size to meet load balancing issues, etc.
    Scheduling considerations
• Dynamic is most general and provides
  load balancing
• If choice of scheduling has (big) impact on
  performance, something is wrong:
  – overhead too big => work in loop too small
• n can be specification expression, not just
  constant
   Synchronization Directives
• BARRIER
  – inside PARALLEL, all threads synchronize
• CRITICAL (lock) / END CRITICAL (lock)
  – section that can be executed by one thread
    only
  – lock is optional name to distinguish several
    critical constructs from each other
              An example
double area, pi, x;
int i, n;


area = 0.0;

#pragma omp parallel for private(x)
for (i = 0; i < n; i++)
{
           x = (i + 0.5)/n;

#pragma omp critical
         area += 4.0/(1.0 + x*x);
}
pi = area / n;
                   Reductions
• Sometimes you want each thread to
  calculate part of a value then collapse all
  that into a single value
• Done with reduction clause
         area = 0.0;
         #pragma omp parallel for private(x) reduction (+:area)
         for (i = 0; i < n; i++)
         {
                    x = (i + 0.5)/n;
                    area += 4.0/(1.0 + x*x);
         }
         pi = area / n;
                        Another Example
/* A Monte Carlo algorithm for calculating pi */
int                 count;                /* points inside the unit quarter circle */
unsigned short      xi[3];                /* random number seed */
int                 i;                    /* loop index */
int                 samples;              /* Number of points to generate */
double              x,y;                  /* Coordinates of points */
double              pi;                   /* Estimate of pi */

xi[0] = 1;               /* These statements set up the random seed */
xi[1] = 1;
xi[2] = 0;
count = 0;                                                    OpenMP Issues
for (i = 0; i < samples; i++)
{                                                             Each thread needs different
             x = erand48(xi);                                 random number seeds
             y = erand48(xi);
             if (x*x + y*y <= 1.0) count++;
}                                                             count is shared
pi = 4.0 * count / samples;                                   we need the aggregate
printf(“Estimate of pi: %7.5f\n”, pi);
                        OpenMP Version
/* A Monte Carlo algorithm for calculating pi */
int                 count;                /* points inside the unit quarter circle */
unsigned short      xi[3];                /* random number seed */
int                 i;                    /* loop index */
int                 samples;              /* Number of points to generate */
double              x,y;                  /* Coordinates of points */
double              pi;                   /* Estimate of pi */

omp_set_num_threads(omp_get_num_procs());
xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();
count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)
for (i = 0; i < samples; i++)
{
             x = erand48(xi);
             y = erand48(xi);
             if (x*x + y*y <= 1.0) count++;
}
pi = 4.0 * count / samples;
printf(“Estimate of pi: %7.5f\n”, pi);
                    An alternate version
…

#pragma omp parallel private(xi, t,I,x,y,local_count)
{
           xi[0] = 1; xi[1] = 1;
           xi[2] = tid = omp_get_thread_num();
           t = omp_get_num_threads();
           local_count = 0;

           for (i = tid; i < samples; i += t)
           {
                        x = erand48(xi);
                        y = erand48(xi);
                        if (x*x + y*y <= 1.0) local_count++;
           }
#pragma omp critical
           count += local_count;
}
  pi = 4.0 * count / samples;
  printf(“Estimate of pi: %7.5f\n”, pi);
}
          Conditional Execution
•   Overhead of fork/join is high
•   If a loop is small, you don’t want to parallellize
•   But, you may not know how big until runtime
•   Conditional clause for parallel execution
    – if ( expression )
          area = 0.0;
          #pragma omp parallel for private(x) reduction (+:area) if (n > 5000)
          for (i = 0; i < n; i++)
          {
                     x = (i + 0.5)/n;
                     area += 4.0/(1.0 + x*x);
          }
          pi = area / n;
                 Scope Rules
• Shared memory programming model
  – most variables are shared by default
• Global variables are shared
• But not everything is shared
  – stack variables in functions are private
• variable set and then used in DO is PRIVATE
• array whose subscript is constant w.r.t.
  PARALLEL DO and is set and then used within
  the DO is PRIVATE
              Scope Clauses
• DO and for directive has extra clauses, the
  most important
  – PRIVATE (variable list)
  – REDUCTION (op: variable list)
     • op is sum, min, max
     • variable is scalar, XLF allows array
         Scope Clauses (2)
• PARALLEL and PARALELL DO and
  PARALLEL SECTIONS have also
 – DEFAULT (variable list)
    • scope determined by rules
 – SHARED (variable list)
 – IF (scalar logical expression)
    • directives are like programming language
      extension, not compiler option
   integer i,j,n
   real*8 a(n,n), b(n)
   read (1) b
!$OMP PARALLEL DO
!$OMP PRIVATE (i,j) SHARED (a,b,n)
   do j=1,n
    do i=1,n
       a(i,j) = sqrt(1.d0 + b(j)*i)
    end do
   end do
!$OMP END PARALLEL DO
                 Matrix Multiply
!$OMP PARALLEL DO PRIVATE(i,j,k)
do j=1,n
 do i=1,n
   do k=1,n
    c(i,j) = c(i,j) + a(i,k) * b(k,j)
   end do
 end do
end do
                  Analysis
•   Outer loop is parallel: columns of c
•   Not optimal for cache use
•   Can put more directives for each loop
•   Then granularity might be too fine
            OMP Functions
•   int omp_get_num_procs()
•   int omp_get_num_threads()
•   int omp_get_thread_num()
•   void omp_set_num_threads(int)
           Serial Directives
• MASTER / END MASTER
  – executed by master thread only
• DO SERIAL / END DO SERIAL
  – loop immediately following should not be
    parallelized
  – useful with -qsmp=omp:auto
• SINGLE
  – only one thread executes the block
           Example Serial Execution
/* A Monte Carlo algorithm for calculating pi */
…
omp_set_num_threads(omp_get_num_procs());
xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();
count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)
for (i = 0; i < samples; i++)
{
             x = erand48(xi);
             y = erand48(xi);
             if (x*x + y*y <= 1.0) count++;
#pragma omp single
             {
             printf(“Loop Iteration: %d\n”, i);
             }
}
pi = 4.0 * count / samples;
printf(“Estimate of pi: %7.5f\n”, pi);
  Fortran Parallel Directives
• PARALLEL / END PARALLEL
• PARALLEL SECTIONS / SECTION /
  SECTION / END PARALLEL
  SECTIONS
• DO / END DO
  – work sharing directive for DO loop
    immediately following
• PARALLEL DO / END PARALLEL DO
  – combined section and work sharing

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:52
posted:8/8/2012
language:English
pages:34