Docstoc

MPI_ Message Passing Interface_1_

Document Sample
MPI_ Message Passing Interface_1_ Powered By Docstoc
					Crash Course in Parallel Programming Using MPI
Adam Jacobs HCS Research Lab 01/10/07

Outline – PCA Preparation
     

Parallel Computing Distributed Memory Architectures Programming Models Flynn‟s Taxonomy Parallel Decomposition Speedups

Parallel Computing
 

Motivated by high computational complexity and memory requirements of large applications Two Approaches
–

Shared Memory

–

Distributed Memory



The majority of modern systems are clusters (distributed memory architecture)
– –

Many simple machines connected with a powerful interconnect ASCI Red, ASCI White, … IBM Blue Gene



Also a hybrid approach can be used
–

Shared Memory Systems
   

Memory resources are shared among processors Relatively easy to program for since there is a single unified memory space Scales poorly with system size due to the need for cache coherency Example:
–

Symmetric Multiprocessors (SMP)
 

Each processor has equal access to RAM 4-way motherboards MUCH more expensive than 2-way

Distributed Memory Systems


Individual nodes consist of a CPU, RAM, and a network interface
–

A hard disk is not necessary; mass storage can be supplied using NFS


 

Information is passed between nodes using the network No need for special cache coherency hardware More difficult to write programs for distributed memory systems since the programmer must keep track of memory usage

Programming Models
 

Multiprogramming
–

Multiple programs running simultaneously Global address space available to all processors Shared data is written to this global space Data is sent directly to processors using “messages”

Shared Address
– –

 

Message Passing
–

Data Parallel

Flynn’s Taxonomy


SISD – Single Instruction, Single Data
–

Normal Instructions Vector Operations, MMX, SSE, Altivec


  

SIMD – Single Instruction, Multiple Data
–

MISD – Multiple Instructions, Single Data MIMD – Multiple Instructions, Multiple Data SPMD – Single Program, Multiple Data

Parallel Decomposition


Data Parallelism
– – –

Parallelism within a dataset such that a portion of the data can be computed independently from the rest Usually results in coarse-grained parallelism (compute farms) Allows for automatic load balancing strategies Parallelism between distinct functional blocks such that each block can be performed independently Especially useful for pipeline structures



Functional Parallelism
– –

Speedup
Performanc e( p) Speedup( p)  Performanc e(1)

Speedup( p) Efficiency ( p)  p

Super-linear Speedup


Linear speedup is the best that can be achieved
–

Or is it?



Super-linear speedup occurs when parallelizing an algorithm results in a more efficient use of hardware resources
–

–

1MB task doesn‟t fit in a single processor 2 512 KB tasks do fit, results in lower effective memory access times

MPI: Message Passing Interface

Adam Jacobs HCS Research Lab 01/10/07

Slides created by Raj Subramaniyan

Outline – MPI Usage
     

Introduction MPI Standard MPI Implementations MPICH: Introduction MPI calls Present Emphasis

Parallel Computing
    

Motivated by high computational complexity and memory requirements of large applications Cooperation with other processes

Cooperative and one-sided operations
Processes interact with each other by exchanging information Models:
–
– –

SIMD SPMD MIMD

Cooperative Operations
 

Cooperative: all parties agree to transfer data Message-passing is an approach that makes the exchange of data cooperative




Data must both be explicitly sent and received
Any change in the receiver's memory is made with the receiver's participation

MPI: Message Passing Interface*
  

MPI: A message passing library specification A message passing model and not a specific product Designed for parallel computers, clusters and heterogeneous networks Standardization began in 1992 and the final draft was made available in 1994





Broad participation of vendors, library writers, application specialists and scientists

* Message Passing Interface Forum accessible at http://www.mpi-forum.org/

Features of MPI
       

Point-to-point communication Collective operations Process groups Communication contexts Process topologies Bindings for Fortran 77 and C Environmental management and inquiry Profiling interface

Features NOT included in MPI
 

Explicit shared-memory operations Operations that require more operating system support than the standard; for example, interrupt-driven receives, remote execution, or active messages Program construction tools Explicit support for threads

 




Support for task management
I/O functions

MPI Implementations*


Listed below are MPI implementations available for free
– – – – – – – – –

Appleseed (UCLA) CRI/EPCC (Edinburgh Parallel Computing Centre) LAM/MPI (Indiana University) MPI for UNICOS Systems (SGI) MPI-FM (University of Illinois); for Myrinet MPICH (ANL) MVAPICH (Infiniband) SGI Message Passing Toolkit OpenMPI

* A detailed list of MPI implementations with features can be found at http://www.lam-mpi.org/mpi/implementations/

MPICH*


MPICH: A portable implementation of MPI developed at the Argonne National Laboratory (ANL) and Mississippi State University (MSU)

  

Very widely used
Supports all the specs of MPI-1 standard Features part of MPI-2 standard are under development (ANL alone)

* http://www-unix.mcs.anl.gov/mpi/mpich/

Writing MPI Programs
#include "mpi.h" // Gives basic MPI types, definitions #include <stdio.h> int main( argc, argv ) int argc; char **argv; { MPI_Init( &argc, &argv ); // Starts MPI :: Actual code including normal „C‟ calls and MPI calls :: MPI_Finalize(); // Ends MPI return 0; }
Part of all programs

Initialize and Finalize


MPI_Init
– –

Initializes all necessary MPI variables Forms the MPI_COMM_WORLD communicator


A communicator is a list of all the connections between nodes

–

Opens necessary TCP connections
Waits for all processes to reach the function Closes TCP connections Cleans up



MPI_Finalize
– –

–

Rank and Size


Environment details:
– –

How many processes are there? (MPI_Comm_size) Who am I? (MPI_Comm_rank)


 

MPI_Comm_size( MPI_COMM_WORLD, &size )
MPI_Comm_rank( MPI_COMM_WORLD, &rank ) The rank is a number between 0 and size-1

Sample Hello World Program
#includes…… int main(int argc, char* argv[]) { int my_rank, p; int source, dest; int tag = 0; char mesg[100]; MPI_Status status; // process rank and number of processes // rank of sender and receiving process // tag for messages // storage for message // stores status for MPI_Recv statements

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if (my_rank!=0) { sprintf(mesg, "Greetings from %d!", my_rank); // stores into character array dest = 0; // sets destination for MPI_Send to process 0 MPI_Send(mesg, strlen(mesg)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } // sends string to process 0 else { for(source = 1; source < p; source++){ MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); // recv from each process printf("%s\n", message); // prints out greeting to screen } MPI_Finalize(); } // shuts down MPI

Compiling MPI Programs


Two methods:
– –

Compilation commands Using Makefile



Compilation commands
– – –

mpicc -o hello_world hello-world.c mpif77 -o hello_world hello-world.f Likewise mpiCC and mpif90 are available for C++ and Fortran90 respectively



Makefile.in is a template Makefile
–

mpireconfig translates Makefile.in to a Makefile for a particular system

Running MPI Programs


To run hello_world on two machines:
– –

mpirun -np 2 hello_world Must specify full path of „executable‟ mpirun –t mpirun -help



To know the commands executed by mpirun
–



To get all the mpirun options
–

MPI Communications


Typical blocking send:
–

–

send (dest, type, address, length) dest: integer representing the process to receive the message type: data type being sent (often overloaded) (address, length): contiguous area in memory being sent MPI_Send/MPI_Recv provide point-to-point communication broadcast (type, address, length)

 

Typical global operation:
–

Six basic MPI calls (init, finalize, comm, rank, send, recv)

MPI Basic Send/Recv
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
buf: initial address of send buffer dest: rank of destination (integer) tag: message tag (integer) comm: communicator (handle) count: number of elements in send buffer (nonnegative integer) datatype: datatype of each send buffer element (handle)

int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )
status: status object (Status)


source: rank of source (integer)

status is mainly useful when messages are received with MPI_ANY_TAG and/or MPI_ANY_SOURCE

Information about a Message


count argument in recv indicates maximum length of a message Actual length of message can be got using
MPI_Get_Count
MPI_Status status; MPI_Recv( ..., &status ); ... status.MPI_TAG; ... status.MPI_SOURCE; MPI_Get_count( &status, datatype, &count );



Example: Matrix Multiplication Program
/* send matrix data to the worker tasks */

averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; for (dest=1; dest<=numworkers; dest++) { rows = (dest <= extra) ? averow+1 : averow; // If # rows not divisible absolutely by # workers printf("sending %d rows to task %d\n",rows,dest); // some workers get an additional row MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); // Starting row being sent MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); // # rows sent count = rows*NCA; // Gives total # elements being sent MPI_Send(&a[offset][0], count, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); count = NCA*NCB; // Equivalent to NRB * NCB; # elements in B MPI_Send(&b, count, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } // Increment offset for the next worker

MASTER SIDE

Example: Matrix Multiplication Program (contd.)
/* wait for results from all worker tasks */ mtype = FROM_WORKER; for (i=1; i<=numworkers; i++) // Get results from each worker { source = i; MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);

MASTER SIDE

count = rows*NCB; // #elements in the result from the worker MPI_Recv(&c[offset][0], count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); } /* print results */ } /* end of master section */

Example: Matrix Multiplication Program (contd.)
if (taskid > MASTER) { // Implies a worker node mtype = FROM_MASTER; source = MASTER; printf ("Master =%d, mtype=%d\n", source, mtype); // Receive the offset and number of rows MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); printf ("offset =%d\n", offset); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); printf ("row =%d\n", rows); count = rows*NCA; // # elements to receive for matrix A MPI_Recv(&a, count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf ("a[0][0] =%e\n", a[0][0]); count = NCA*NCB; // # elements to receive for matrix B MPI_Recv(&b, count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);

WORKER SIDE

Example: Matrix Multiplication Program (contd.)
for (k=0; k<NCB; k++) SIDE for (i=0; i<rows; i++) { c[i][k] = 0.0; // Do the matrix multiplication fro the # rows you are assigned to for (j=0; j<NCA; j++) c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; printf ("after computing \n"); MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); printf ("after send \n"); } /* end of worker */ // Sending the actual result

WORKER

Asynchronous Send/Receive


MPI_Isend() and MPI_Irecv() are non-blocking; control returns to program after call is made

int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request )
request: communication request (handle); output parameter

Detecting Completions


Non-blocking operations return (immediately) “request handles” that can be waited on and queried MPI_Wait waits for an MPI send or receive to complete
– – – –



int MPI_Wait ( MPI_Request *request, MPI_Status *status)
request matches request on Isend or Irecv status returns the status equivalent to status for MPI_Recv when complete blocks for send until message is buffered or sent so message variable is free blocks for receive until message is received and ready

Detecting Completions (contd.)


MPI_Test tests for the completion of a send or receive int MPI_Test ( MPI_Request *request, int *flag, MPI_Status *status)
–
– – –

request, status as for MPI_Wait does not block flag indicates whether operation is complete or not enables code which can repeatedly check for communication completion

Multiple Completions


Often desirable to wait on multiple requests; ex., A master/slave program

int MPI_Waitall( int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[] ) int MPI_Waitany( int count, MPI_Request array_of_requests[], int *index, MPI_Status *status ) int MPI_Waitsome( int incount, MPI_Request array_of_requests[], int *outcount, int array_of_indices[], MPI_Status array_of_statuses[] )


There are corresponding versions of test for each of these

Communication Modes


Synchronous mode (MPI_Ssend): the send does not complete until a matching receive has begun Buffered mode (MPI_Bsend): the user supplies the buffer to system Ready mode (MPI_Rsend): user guarantees that matching receive has been posted Non-blocking versions are MPI_Issend, MPI_Irsend, MPI_Ibsend







Miscellaneous Point-to-Point Commands
  

MPI_Sendrecv MPI_Sendrecv_replace MPI_cancel Used for buffered modes
–



–

MPI_Buffer_attach MPI_Buffer_detach

Collective Communication
  

One to Many (Broadcast, Scatter) Many to One (Reduce, Gather) Many to Many (Allreduce, Allgather)

Broadcast and Barrier
Any type of message can be sent; size of message should be known to all int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

buffer: pointer to message buffer count: number of items sent datatype: type of item sent root: sending processor comm: communicator within which broadcast takes place Note: count and type should be the same on all processors



Barrier synchronization (broadcast without message?)

int MPI_Barrier ( MPI_Comm comm )

Reduce
 

Reverse of broadcast; all processors send to a single processor Several combining functions available
–

MAX, MIN, SUM, PROD, LAND, BAND, LOR, BOR, LXOR, BXOR, MAXLOC, MINLOC

int MPI_Reduce ( void *sentbuf, void *result, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm )

Scatter and Gather


MPI_Scatter: Source (array) on the sending processor is spread to all processors MPI_Gather: Opposite of scatter; array locations at the receiver correspond to the rank of the senders



Many-to-many Communication




MPI_Allreduce – Syntax like reduce, except no root parameter – All nodes get result MPI_Allgather – Syntax like gather, except no root parameter – All nodes get resulting array

Evaluating Parallel Programs


MPI provides tools to evaluate performance of parallel programs
– –

Timer Profiling Interface

   

MPI_Wtime gives the wall clock time MPI_WTIME_IS_GLOBAL can be used to check the synchronization of times for all the processes PMPI_.... is an entry point for all routines; can be used for profiling -mpilog option at compile time can be used to generate logfiles

Recent Developments


MPI-2
– –

–
–

Dynamic process management One-sided communication Parallel file-IO Extended collective operations

 

MPI for Grids; ex., MPICH-G, MPICH-G2 Fault-tolerant MPI; ex., Starfish, Cocheck

One-sided Operations


One-sided: one worker performs transfer of data Remote memory reads and writes Data can be accessed without waiting for other processes





File Handling
 

Similar to general programming languages Sample function calls:
– – – – –

MPI_File_open MPI_File_read MPI_File_seek MPI_File_write MPI_File_set_size MPI_File_Iread MPI_File_Iwrite



Non-blocking reads and writes are also possible
– –

C Datatypes
           

MPI_CHAR : char MPI_BYTE : See standard; like unsigned char MPI_SHORT : short MPI_INT : int MPI_LONG : long MPI_FLOAT : float MPI_DOUBLE : double MPI_UNSIGNED_CHAR : unsigned char MPI_UNSIGNED_SHORT : unsigned short MPI_UNSIGNED : unsigned int MPI_UNSIGNED_LONG : unsigned long MPI_LONG_DOUBLE : long double

mpiP


A lightweight profiling library for MPI applications
–

In order to use in an application, simply add the – lmpiP flag to the compile script





Determines how much time a program spends in MPI calls versus the rest of the application Shows which MPI calls are used most frequently

Jumpshot


Graphical profiling tool for MPI
–

Java-Based



Useful for determining communication patterns in an application
–

– –

Color-coded bars represent time spent in an MPI function Arrows denote message passing Single line denotes actual processing time

Summary


The parallel computing community has cooperated to develop a full-featured standard message-passing library interface Several implementations are available Many applications are being developed or ported presently

 


 

MPI-2 process beginning
Lots of MPI material available Very good facilities available at the HCS Lab for MPI-based projects
–

Zeta Cluster will be available for class projects

References
[1] “The Message Passing Interface (MPI) Standard,” http://www-unix.mcs.anl.gov/mpi/ [2] “LAM/MPI Parallel Computing,” http://www.lam-mpi.org [3] W. Gropp, “Tutorial on MPI: The Message-Passing Interface,” http://wwwunix.mcs.anl.gov/mpi/tutorial/gropp/talk.html [4] D. Culler and J. Singh, “Parallel Computer Architecture: A Hardware/Software Approach”

Fault-Tolerant Embedded MPI

Motivations
  

MPI functionality required for HPC space applications
–

De-facto standard/parallel programming model in HPC MPI is inherently fault-intolerant, original design choice Good basis for ideas, API standards, etc. Not readily amenable to HPEC platforms

Fault-tolerant extensions for HPEC space systems
–

Existing HPC tools for MPI and fault-tolerant MPI
– –



Focus on lightweight fault-tolerant MPI for HPEC (FEMPI: Fault-tolerant Embedded Message Passing Interface)
– –

Leverage prior work throughout HPC community Leverage prior work at UF on HPC with MPI

Primary Source of Failures in MPI


Nature of failures
– –

Individual processes of MPI job crash (Process failure) Communication failure between two MPI processes (Network failure) When a receiver node fails, sender encounters a timeout on a blocking send call, as no matching receive is found and returns an error Whole communicator context crashes and hence the entire MPI job N×N open TCP connections in many MPI implementations; in such cases, the whole job crashes immediately on failure of any node Applies to collective communication calls as well Health status of nodes provided by failure detection service (via SR) Check node status before communication with another node to avoid establishing communication with a dead process If receiver dies after status check and before communication, then timeout-based recovery will be used



Behavior on failure
– – – –



Avoid failure/crash of entire application
– – –

FEMPI Software Architecture


Low-level communication is provided through FEMPI using SelfReliant‟s DMS Heartbeating via SR and a process notification extension to the SRP enables FEMPI fault detection Application and FEMPI checkpointing make use of existing checkpointing libraries; checkpoint communication uses DMS MPI Restore process on System Controller is responsible for recovery decisions based on application policies
Application
Recovery Policy







MPI Restore

FEMPI Runtime Environment JMA/FTMA
Communication Checkpointing Health Status

JM
Failure Intimation

FTM
DMS AMS+CMS
(Health Monitoring) Health Information (Communication)

Failure Notification

System Controller

Self-Reliant

Fault Tolerance Actions


Fault tolerance is provided through three stages
– – –

Detection of a fault Notification Recovery



Self-Reliant services used to provide detection and notification capabilities
– –

Heartbeats and other functionality are already provided in API Notification service built as an extension to FTM of JMS

 

FEMPI will provide features to enable recovery of an application Employs reliable communications to reduce faults due to communication failure
–

Low-level communications provided through Self-Reliant services (DMS) instead of directly over TCP/IP


				
DOCUMENT INFO