MPI Message Passing

Document Sample
MPI Message Passing Powered By Docstoc
					    Parallel Computing 3
    Models of Parallel
               Ondřej Jakl
Institute of Geonics, Academy of Sci. of the CR
                Outline of the lecture
• Aspects of practical parallel programming
• Parallel programming models
• Data parallel
   – High Performance Fortran
• Shared variables/memory
   – compiler’s support: automatic/assisted parallelization
   – OpenMP
   – thread libraries
• Message passing

                         Parallel programming (1)

            • Primary goal: maximization of performance
               – specific approaches are expected to be more efficient than
                 universal ones
                   • considerable diversity in parallel hardware

                   • techniques/tools are much more dependent on the target platform than
                     in sequential programming
               – understanding the hardware will make it easier to make programs
                 get high performance
                   • back to the era of assembly programming?
            • On the contrary, standard/portable/universal methods
              increase the productivity in software development and

              Parallel programming (2)

• Parallel programs are more difficult to write and debug than
  sequential ones
   – parallel algorithms can be generally qualitatively different form the
     corresponding sequential ones
       • the change of the form of the code may be not enough
   – several new classes of potential software bugs (e.g. race
   – difficult debugging
   – issues of scalability

                   General approaches

• Special programming language supporting concurrency
   – theoretically advantageous, in practice not as much popular
   – ex.: Ada, Occam, Sisal, etc. (there are dozens of designs)
   – language extensions: CC++, Fortran M, etc.
• Universal programming language (C, Fortran,...) with
  parallelizing compiler
   – autodetection of parallelism in the sequential code
   – easier for shared memory, limited efficiency
       • matter of future? (despite of 30 years of intense research)
   – ex.: Forge90 for Fortran (1992), some standard compilers
• Universal programming language plus a library of external
  parallelizing functions
   – mainstream nowadays
   – ex.: PVM (Parallel Virtual Machine), MPI (Message Passing
     Interface), Pthreads a. o.
         Parallel programming models

• A parallel programming model is a set of software
  technologies to express parallel algorithms and match
  applications with the underlying parallel systems
• Considered models:
    – data parallel [just introductory info in this course]
    – shared variables/memory [related to the OpenMP lecture
      in part II of the course]
   – message passing [continued in the next lecture (MPI)]

Data parallel model
                Hardware requirements
• Assumed underlying hardware: multicomputer or multiprocessor
    – originally associated with SIMD machines
      such as CM-200
       • multiple processing elements perform the same
         operation on multiple data simultaneously
       • array processors


                     Data parallel model
• Based on concept of applying the same operation (e.g. “add 1 to every
  array element”) to a number of a data ensemble in parallel
    – a set of tasks operate collectively on the same data structure (usually an
      array) – each task on a different partition
• On multicomputers the data structure is         Fortran90 fragment   A = A+1
  split up and resides as “chunks” in the
  local memory of each task                                   real A(100)
• On multiprocessors, all tasks may have
  access to the data structure through
  global memory
• The tasks are loosely synchronized               do i = 0, 50        do i = 51, 100
                                                    A(i) = A(i)+1       A(i) = A(i)+1
    – at the beginning and end of the parallel     enddo               enddo
• SPMD execution model
                                                     Task 1                 Task 2
• Higher-level parallel programming
    – data distribution and communication done by compiler
        • transfer low-level details from programmer to compiler
            – compiler converts the program into standard code with calls to a message passing
              library (MPI usually); all message passing is done invisibly to the programmer
+ Ease of use
    – simple to write, debug and maintain
        • no explicit message passing
        • single-threaded control (no spawn, fork, etc.)
– Restricted flexibility and control
    – only suitable for certain applications
        • data in large arrays
        • similar independent operations on each element
        • naturally load-balanced
    – harder to get top performance
        • reliant on good compilers                                                              10
              High Performance Fortran

• The best known representative of data parallel programming language
• HPF version 1.0 in 1993 (extends Fortran 90), version 2.0 in 1997
• Extensions to Fortran 90 to support data parallel model, including
   – directives to tell compiler how to distribute data
         • DISTRIBUTE, ALIGN directives
         • ignored as comments in serial Fortran compilers
    –   mathematical operations on array-valued arguments
    –   reduction operations on arrays
    –   FORALL construct
    –   assertions that can improve optimization of generated code
         • INDEPENDENT directive
   – additional intrinsics and library routines
• Available e.g. in the Portland Group PGI Workstation package
• Nowadays not frequently used                                          11
            HPF data mapping example

REAL A(12, 12)     ! declaration
REAL B(16, 16)     ! of an arrays
!HPF$ TEMPLATE T(16,16) ! and a template
!HPF$ ALIGN B WITH T                 ! align B with T
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! align A with T and shift
!HPF$ PROCESSORS P(2, 2) ! declare number of procesors 2*2
!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P   ! distribution of arrays



                                                 [Mozdren 2010]
                 Codistributed arrays
               Data parallel in MATLAB
• Parallel MATLAB (the MathWorks): Parallel Computing Toolbox
   – plus Distributed Computing Server for greater parallel environments
   – released in 2004; increasing popularity
• Some features coherent to the data parallel model
   – codistributed arrays: arrays partitioned into segments, each of which resides
     in the workspace of a different task                L1      L2     L3      L4
       • allow to handle larger data sets than
         in a single MATLAB session
       • support for more than 150 MATLAB
         functions (e.g. finding eigenvalues)
       • in a very similar way as with regular arrays
   – parallel FOR loop: loop iterations without
     enforcing their particular ordering                   parfor i = (1:nsteps)
                                                             x = i * step;
       • distributes loop iterations over a set of tasks     s = s + (4 /(1 + x^2));
       • iterations must be independent of each other      end
Shared variables model
                 Hardware requirements
• Assumed underlying hardware: multiprocessor
    – collection of processors that share
      common memory
    – interconnection fabric supporting
      single address space
• Not applicable to multicomputers
    – but: Intel Cluster OpenMP
• Easier to apply than message passing
    – allows incremental parallelization        Interconnection
• Based on the notion of threads                     fabric
                                                 (bus, crossbar)

                                                 after [Wilkinson2004]

  Thread vs. process (1)

                 Code                Process
Stack   Thread

                        Interrupt routines
Stack   Thread

                   Thread vs. process (2)
• Thread (“lightweight” processes) differs from (“heavyweight”) process:
   – all threads in a process share the same memory space
   – each thread has a thread private area for its local variables
        – e.g. stack
    – threads can work on shared data structures
    – threads can communicate with each other via the shared data
• Threads originally not targeted
  at the technical or HPC computing                                       Heap        Process

    – low level, task (rather than data)
                                              Stack   Thread


• Details of thread/process relationship
  are very OS dependent                       Stack   Thread
                                                                      Interrupt routines


                Thread communication
• Parallel application generates, when appropriate, a set of cooperating
    – usually one per processor
    – distinguished by enumeration
• Shared memory provides means to exchange data among threads
    – shared data can be                   Thread 1                Thread 2
      accessed by all threads
    – no message passing        Program   my_a = 23
                                          sh_a = a               my_a = sh_a+1

                                Private    23                       24

                                Shared                 23
                 Thread synchronization
• Threads execute their programs asynchronously
• Writes and reads are always nonblocking
• Accessing shared data needs careful control
    – need some mechanisms to ensure that the actions occur in the correct order
        • e.g. write of A in thread 1 must occur before its read in thread 2
• Most common synchronization constructs:
   – master section: a section of code executed by one thread only
        • e.g. initialisation, writing a file
    – barrier: all threads must arrive at a barrier before any thread can
      proceed past it
        • e.g. delimiting phases of computation (e.g. a timestep)
    – critical section: only one thread at a time can enter a section of
        • e.g. modification of shared variables
• Makes shared-variables programming error-prone
                  Accessing shared data
• Consider two threads each of which is to add 1 to a shared data item X,
  e.g. X = 10.
     1. read X
     2. compute X+1
     3. write X back
• If step 1 is performed at the same time
  by both threads, the result will be 11
  (instead of expected 12)
• Race condition: two or more threads
  (processes) are reading or writing shared
  data, and the result depends on who runs
  precisely when
                                             Thread 1              Thread 2
• X=X+1 must be atomic operation
• Can be ensured by mechanisms of mutual exclusion                  [Wilkinson2004]

     – e.g. critical section, mutex, lock, semaphore, monitor

                  Fork/Join parallelism
• Initially only the master thread is active     Master Thread

    – executes sequential code                            Other threads
• Basic operations:                                                       fork

   – fork: master thread creates / awakens
      additional threads to execute

      in a parallel region                                                join

   – join: at end of parallel region created                              fork
      threads die / are suspended
• Dynamic thread creation
    – the number of active threads changes
      during execution                                              [Quinn 2004]

    – fork is not an expensive operation
• Sequential program a special / trivial case of a shared-memory parallel
                 Computer realization
• Compiler’s support:
    – automatic parallelization
    – assisted parallelization
    – OpenMP
• Thread libraries:
    – POSIX threads, Windows threads

[next slides]

               Automatic parallelization
• The code instrumented automatically by the compiler
    – according the compilation flags and/or environment variables
• Parallelizes independent loops only
    – processed by the prescribed number of parallel threads
• Usually provided by Fortran compilers for multiprocessors
    – as a rule proprietary solutions
• Simple and sometimes fairly efficient
• Applicable to programs with a simple structure
• Ex.:
    – XL Fortran (IBM, AIX): -qsmp=auto option, XLSMPOPTS
      environment variable (the number of threads)
    – Fortran (SUN, Solaris): -autopar flag, PARALLEL environment
    – PGI C (Portland Group, Linux): -Mconcur flag

                  Assisted parallelization
• The programmer provides the compiler with additional information by
  adding compiler directives
    – special lines of source code with meaning only to a compiler that
      understands them
         • in the form of stylized Fortran comments or #pragma in C
         • ignored by nonparallelizing compilers
• Assertive and prescriptive directives [next slides]
• Diverse formats of the parallelizing directives, but similar capabilities
     standard required

                    Assertive directives
• Hints that state facts that the compiler might not guess from the code
• Evaluation context dependent
• Ex.: XL Fortran (IBM, AIX)
    – no dependencies (the references in the loop do not overlap, parallelization
      possible): !SMP$ ASSERT (NODEPS)
    – trip count (average number of iterations of the loop; helps to decide if
      unroll or parallelize the loop): !SMP$ ASSERT (INTERCNT(100))

                  Prescriptive directives
• Instructions for the parallelizing compiler, which it must obey
    – clauses may specify additional information
• A means for manual parallelization
• Ex.: XL Fortran (IBM, AIX)
    – parallel region:                     !SMP$ PARALLEL <clauses>
      defines a block of code that can       <block>
      be executed by a team                !SMP$ END PARALLEL
      of threads concurrently
    – parallel loop:
                                           !SMP$ PARALLEL DO <clauses>
      enables to specify which loops
                                             <do loop>
      the compiler should parallelize
                                           !SMP$ END PARALLEL DO

• Besides directives, additional constructs within the base language to
  express parallelism can be introduced
    – e.g. the forall statement in Fortran 95
• API for writing portable multithreaded applications based on the shared
  variables model
    – master thread spawns a team of threads as needed
    – relatively high level (compared to thread libraries)
• A standard developed by the OpenMP Architecture Review Board
    – first specification in 1997
• A set of compiler directives and library routines
• Language interfaces for Fortran, C and C++
    – OpenMP-like interfaces for other languages (e.g. Java)
• Parallelism can be added incrementally
    – i.e. the sequential program evolves into a parallel program
    – single source code for both the sequential and parallel versions
• OpenMP compilers available on most platforms (Unix, Windows, etc.)

[More in a special lecture]
                       Thread libraries
• Collection of routines to create, manage, and coordinate threads
• Main representatives:
   – POSIX threads (Pthreads),
   – Windows threads (Windows (Win32) API)
• Explicit threading not primarily intended for parallel programming
    – low level, quite complex coding

               Example: PI calculation

Calculation of π by the numerical integration formula

                                                        F(x) = 4/(1+x2)
                          dx
                      1 x 2                                              2.0

Numerical integration based on the rectangle method:

                                                           0.0                  1.0
set n (number of strips)                                         x

for each strip
   calculate the height y of the strip (rectangle) at its midpoint
   sum all y to the result S
multiply S by the width of the strips
print result

             PI in Windows threads (1)
/* Pi, Win32 API */
#include <windows.h>
#define NUM_THREADS 2
HANDLE thread_handles[NUM_THREADS];
static long num_steps = 100000;
double step, global_sum = 0.0;

void Pi (void *arg) {
  int i, start;
  double x, sum = 0.0;
  start = *(int *)arg;
  step = 1.0/(double)num_steps;
  for (i = start; i <= num_steps; i = i + NUM_THREADS){
    x = (i - 0.5) * step;
    sum = sum + 4.0 / (1.0 + x * x);
  global_sum += sum;
             PI in Windows threads (2)
void main () {
  double pi; int i;
  DWORD threadID;
  int threadArg[NUM_THREADS];
  for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1;
  for (i = 0; i < NUM_THREADS; i++) {
    thread_handles[i] = CreateThread(0,0,(LPTHREAD_START_ROUTINE)Pi,
  pi = global_sum * step;
  printf(" pi is %f \n",pi);

               PI in POSIX threads (1)
/* Pi , pthreads library */
#define _REENTRANT
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 2
pthread_t thread_handles[NUM_THREADS];
pthread_mutex_t hUpdateMutex;
pthread_attr_t attr;
static long num_steps = 100000;
double step, global_sum = 0.0;
void* Pi (void *arg) {
  int i, start;
  double x, sum = 0.0;
  start = *(int *)arg;
  step = 1.0 / (double)num_steps;
  for (i = start; i <= num_steps; i = i + NUM_THREADS){
    x = (i - 0.5) * step;
    sum = sum + 4.0 / (1.0 + x * x);
  global_sum += sum;
               PI in POSIX threads (2)
void main () {
  double pi; int i;
  int retval;
  pthread_t threadID;
  int threadArg[NUM_THREADS];
  for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1;
  for (i = 0; i < NUM_THREADS; i++) {
    retval = pthread_create(&threadID,NULL,Pi,&threadArg[i]);
    thread_handles[i] = threadID;
  for (i=0; i<NUM_THREADS; i++) {
    retval = pthread_join(thread_handles[i],NULL);
  pi = global_sum * step;
  printf(" pi is %.10f \n",pi);

                PI in OpenMP (1)
/* Pi, OpenMP, using parallel for and reduction */
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NUM_THREADS 2
static long num_steps = 1000000;
double step;

void main () {
  int i;
  double x, pi, sum = 0.0;
  step = 1.0/(double) num_steps;

                     PI in OpenMP (1)
#pragma omp parallel for reduction(+:sum) private(x)
  for (i = 1; i < num_steps; i++){
    x = (i - 0.5) * step;
    sum += 4.0 / (1.0 + x*x);
  pi = sum * step;
  printf("Pi is %.10f \n",pi);

NB: Programs such as PI calculation are likely to be successfully
parallelized through automatic parallelization as well

Message passing model
                Hardware requirements
• Assumed underlying hardware: multicomputer
    – collection of processors,
      each with its own local memory
    – interconnection network supporting
      message transfer between every
      pair of processors
• Supported by all (parallel)
  architectures – the most general
    – naturally fits multicomputers
    – easily implemented on multiprocessors
• Complete control: data distribution and communication              [Quinn2004]

• May not be easy to apply – sequential-to-parallel transformation
  requires major effort
    – one giant step rather than many tiny steps
    – message passing = “assembler” of parallel computing
                       Message passing
• Parallel application generates (next slide) a set of cooperating processes
    – process = instance of o running program
    – usually one per processor
    – distinguished by the unique ID number
        • rank (MPI), tid (PVM), etc.
• To solve a problem, processes alternately perform computations and
  exchange messages                        Process 1               Process 2
    – basic operations: send, receive          x                       y
    – no shared memory space necessary
• Messages transport the contents of        send(&x, 2)   Data transfer
  variables of one process to variables                                    recv(&y, 1)
  of other process.
• Message passing has also
  a synchronization function
                        Process creation
• Static process creation
    – fixed number of processes in time
    – specified before the execution (e.g. on the command line)
    – usually the processes follow the same code, but their control paths
      through the code can differ – depending on the ID
        • SPMD (Single Program Multiple Data) model
        • one master process (ID 0) – several slave processes
• Dynamic process creation                                      Process 1

    – varying number of processes in time
                                                                spawn(); Start process 2 Process 2
        • just one process at the beginning

    – processes can create (destroy) other
      processes: the spawn operation
        • rather expensive!
    – the processes often differ in code
        • MPMD (Multiple Program Multiple
          Data) model                                                                       39
           Point-to-point communication
• Exactly two processes are involved
• One process (sender / source) sends a message and another process
  (receiver / destination) receives it
    – active participation of processes on both sides usually required
         • two-sided communication
• In general, the source and destination processes operate asynchronously
    – the source may complete sending a message long before the destination
      gets around to receiving it
    – the destination may initiate receiving a message that has not yet been sent
• The order of messages is guaranteed (they do not overtake)
• Examples of technical issues
    –   handling more messages waiting to be received
    –   sending complex data structures
    –   using message buffers
    –   send and receive routines – blocking vs. nonblocking
     (Non-)blocking & (a-)synchronous
• Blocking operation: only returns (from the subroutine call) when the
  operation has completed
    – ex.: sending fax on a standard machine
• Nonblocking operation: returns immediately, the operation need not be
  completed yet, other work may be performed in the meantime
    – the completion of the operation can/must be tested
    – ex.: sending fax on a machine with memory

• Synchronous send: does not complete until the message has been
    – provides (synchronizing) info about the message delivery
    – ex.: sending fax (on a standard machine)
• Asynchronous send: completes as soon as the message is on its way
    – sender only knows when the message has left
    – ex.: sending a letter

               Collective communication
• Transfer of data in a set of processes
• Provided by most message passing systems
• Basic operations [next slides]:
    –   barrier: synchronization of processes
    –   broadcast: one-to-many communication of the same data
    –   scatter: one-to-many communication of different portions of data
    –   gather: many-to-one communication of the (different, but related) data
    –   reduction: gather plus combination of data with arithmetic/logical
• Root – in some collective operations, the single prominent
  source / destination
    – e.g. in broadcast
• Collective operations can be built out as a set of point-to-point
  operations, but these “blackbox” routines
   – hide a lot of the messy details
   – are usually more efficient
         • can take advantage of special communication hardware
• A basic mechanism for synchronizing
• Inserted at the point in each process
  where it must wait for the others
• All processes can continue from this
  point when all the processes have
  reached it
    – or when a stated number of processes
      have reached this point
• Often involved in other operations


• Distributes the same piece of data from a single source (root) to all
  processes (concerned with problem)
    – multicast – sending the message to a defined group of processes




                           B         B        B         B

• Distributes each element of an array in the root to a separate process
    – including the root
    – contents of the ith array element sent to the ith process



                             A          B         C         D

• Collects data from each process at the root
    – value from the ith process is stored in the ith array element (rank order)


            Gather           A         B         C         D


                            A          B         C         D

• Gather operation combined with specified arithmetic/logical operation
    1. collect data from each processor
    2. reduce these data to a single
       value (such as a sum or max)                        Reduce
    3. store the reduced result on the

       root processor                            E F GH

                                                 I JKL

                                                 MNO P

                                                    A E IM

           Message passing system (1)
• Computer realization of the message passing model
• Most popular message passing systems (MPS):
   – Message Passing Interface (MPI) [next lecture]
   – Parallel Virtual Machine (PVM)
   – in distributed computing Corba, Java RMI, DCOM, etc.

              Message passing system (2)
• Information needed by MPS to transfer a message include:
    – sending process and location, type and amount of transferred data
         • no interest in data itself (message body)
    – receiving process(-es) and storage to receive the data
• Most of this information is attached as message envelope
    – may be (partially) available to the receiving process
• MPS may provide various information to the processes
    – e.g. about the progress of communication
• A lot of other technical aspects, e. g.:
    –   process enrolment in MPS
    –   addressing scheme
    –   content of the envelope
    –   using message buffers (system, user space)

                  WWW (what, when, why)
Message passing (MPI)           Shared variables (OMP)          Data parallel (HPF)
+ easier to debug               + easier to program than        + easier to program than
+ easiest to optimize             MP, code is simpler             MP
+ can overlap                   + implementation can be         + simpler to debug than SV
  communication and               incremental                   + does not require shared
  computation                   + no message start-up costs       memory
+ potential to high             + can cope with irregular       – DP style suitable only for
  scalability                     communication patterns          certain applications
+ support on all parallel       – limited to shared-memory      – restricted control over
  architectures                   systems                         data and work distribution
– harder to program             – harder to debug and           – difficult to obtain top
– load balancing, deadlock        optimize                        performance
  prevention, etc. need to be   – scalability limited           – a few API’s available
  addressed                     – usually less efficient than   – out of date?
 most freedom and                MP equivalents


• The definition of parallel programming models is not
  uniform in literature; other models can be e.g.
   – thread programming model
   – hybrid models, e. g. the combination of the message
      passing and shared variables model
       • explicit message passing between the nodes of a cluster as well as
         shared-memory and multithreading within the nodes
• Models continue to evolve along with the changing world
  of computer hardware and software
   – CUDA parallel programming model for CUDA GPU architecture

                       Further study

• The message passing model and shared variables model
  somehow treated in all general textbooks on parallel
   • exception: [Foster 1995] almost skips data sharing
• There are plenty of books dedicated to shared objects,
  synchronisation and shared memory, e.g. [Andrews 2000]
  Foundations of Multithreaded, Parallel,
  and Distributed Programming
   • not necessarily focusing on parallel processing
• Data parallelism is usually a marginal topic


Shared By: