Programming Clusters using
Message-Passing Interface (MPI)
Dr. Rajkumar Buyya
Cloud Computing and Distributed Systems (CLOUDS) Laboratory
The University of Melbourne
Melbourne, Australia
www.cloudbus.org
Outline
Introduction to Message Passing
Environments
HelloWorld MPI Program
Compiling and Running MPI programs
Elements of Hello World Program
MPI Routines Listing
Communication in MPI programs
Summary
Message-Passing Programming
Paradigm
Each processor in a message-passing program
runs a sub-program
written in a conventional sequential language
all variables are private
communicate via special subroutine calls
M M M Memory
P P P Processors/Node
Interconnection Network
SPMD: A dominant paradigm for
writing data parallel applications
main(int argc, char **argv)
{
if(process is assigned Master role)
{
/* Assign work and coordinate workers and collect results */
MasterRoutine(/*arguments*/);
}
else /* it is worker process */
{
/* interact with master and other workers. Do the work and send
results to the master*/
WorkerRoutine(/*arguments*/);
}
}
Messages
Messages are packets of
data moving between sub-
programs.
The message passing system
has to be told the following
information
Sending processor
Source location
Data type
Data length
Receiving processor(s)
Destination location
Destination size
Messages
Access:
Each sub-program needs to be connected to a message passing
system
Addressing:
Messages need to have addresses to be sent to
Reception:
It is important that the receiving process is capable of dealing
with the messages it is sent
A message passing system is similar to:
Post-office, Phone line, Fax, E-mail, etc
Message Types:
Point-to-Point, Collective, Synchronous (telephone)/Asynchronous
(Postal)
Message Passing Systems and MPI
- www.mpi-forum.org
Initially each manufacturer developed their own message
passing interface
Wide range of features, often incompatible
MPI Forum brought together several Vendors and users of HPC
systems from US and Europe – overcome above limitations.
Produced a document defining a standard, called
Message Passing Interface (MPI), which is derived from
experience or common features/issues addressed by many
message-passing libraries. It aimed:
to provide source-code portability
to allow efficient implementation
it provides a high level of functionality
support for heterogeneous parallel architectures
parallel I/O (in MPI 2.0)
MPI 1.0 contains over 115 routines/functions that can be
grouped into 8 categories.
General MPI Program Structure
MPI Include File
Initialise MPI Environment
Do work and perform message communication
Terminate MPI Environment
MPI programs
MPI is a library - there are NO language
changes
Header Files
C: #include
MPI Function Format
C: error = MPI_Xxxx(parameter,...);
MPI_Xxxx(parameter,...);
Example - C
#include
/* include other usual header files*/
main(int argc, char **argv)
{
/* initialize MPI */
MPI_Init(&argc, &argv);
/* main part of program */
/* terminate MPI */
MPI_Finalize();
exit(0);
}
MPI helloworld.c
#include
main(int argc, char **argv)
{
int numtasks, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, & numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Hello World from process %d of %d\n“,
rank, numtasks);
MPI_Finalize();
}
MPI Programs Compilation and
Execution
Manjra: GRIDS Lab Linux Cluster
Master Node: Master: manjra.cs.mu.oz.au
manjra.cs.mu.oz.au Internal worker nodes:
Dual Xeon 2GHz node1
512 MB memory node2
250 GB integrated storage ....
Gigabit LAN node13
CDROM & Floppy Drives
Red Hat Linux release 7.3
(Valhalla)
Worker Nodes(node1..node13)
Each of the 13 worker node
consists of the following:
Pentium 4 2GHz
512 MB memory
40 GB harddisk
Gigabit LAN
Red Hat Linux release 7.3
(Valhalla)
Manjra Linux cluster
How Manjra cluster looks
Front View Back View
A snapshot of Manjra cluster
Compile and Run Commands
Compile:
[mpicc helloworld.c -o helloworld (standard)]
manjra> mpicc helloworld.c helloworld.o (Use this on Manjra)
Run: No of processes
manjra> mpirun -np 3 helloworld.o [hosts picked from
configuration file automatically]
manjra> mpirun -np 3 -machinefile machines.list helloworld.o
NOTE: when you run firsttime, you need to enter “password” again
– due to “customised” (security issue) installation in “Manjra”
cluster . Only students are given this privilege!
The file machines.list contains nodes list:
node1
..
node13
Some nodes may not work today, if they had failed!
Sample Run and Output
A Run with 3 Processes:
manjra> mpirun -np 3 -machinefile machines.list helloworld.o
Hello World from process 0 of 3
Hello World from process 1 of 3
Hello World from process 2 of 3
A Run by default
manjra> helloworld.o
Hello World from process 0 of 1
You can also use mpirun to exec standard
commands
manjra> mpirun -np 4 -machinefile machines.list hostname
Sample Run and Output
A Run with 6 Processes:
manjra> mpirun -np 6 -machinefile machines.list helloworld
Hello World from process 0 of 6
Hello World from process 3 of 6
Hello World from process 1 of 6
Hello World from process 5 of 6
Hello World from process 4 of 6
Hello World from process 2 of 6
Note: Process execution need not be in
process number order.
Sample Run and Output
A Run with 6 Processes:
manjra> mpirun -np 6 -machinefile machines.list helloworld
Hello World from process 0 of 6
Hello World from process 3 of 6
Hello World from process 1 of 6
Hello World from process 2 of 6
Hello World from process 5 of 6
Hello World from process 4 of 6
Note: Change in process output order. For
each run, process mapping can be different.
They may run on machines with different
load. Hence such difference.
More on MPI Program Elements
and Error Checking
Handles
MPI controls its own internal data structures
MPI releases „handles‟ to allow programmers
to refer to these
“C” handles are of distinct typedef„d types
and arrays are indexed from 0
Some arguments can be of any type - in C
these are declared as void *
Initializing MPI
The first MPI routine called in any MPI
program must be MPI_Init.
The C version accepts the arguments to main
int MPI_Init(int *argc, char ***argv);
MPI_Init must be called by every MPI
program
Making multiple MPI_Init calls is erroneous
MPI_INITIALIZED is an exception to first
rule
MPI_COMM_WORLD
MPI_INIT defines a
communicator called
MPI_COMM_WORLD for every
process that calls it.
All MPI communication calls
require a communicator
argument
MPI processes can only
communicate if they share a
communicator.
A communicator contains a
group which is a list of
processes
Each process has it‟s rank
within the communicator
A process can have several
communicators
Communicators
MPI uses objects called Communicators that
defines which collection of processes communicate
with each other.
Every process has unique integer identifier
assigned by the system when the process initialises.
A rand is sometimes called process ID.
Processes can request information from a
communicator
MPI_Comm_rank(MPI_comm comm, int *rank)
Returns the rank of the process in comm
MPI_Comm_size(MPI_Comm comm, int *size)
Returns the size of the group in comm
Finishing up
An MPI program should call MPI_Finalize
when all communications have completed.
Once called no other MPI calls can be made
Aborting:
MPI_Abort(comm)
Attempts to abort all processes listed in
comm
if comm = MPI_COMM_WORLD the whole program
terminates
Hello World with Error Check
Display Hostname of MPI Process
#include
main(int argc, char **argv)
{
int numtasks, rank;
int resultlen;
static char mpi_hostname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name( mpi_hostname, &resultlen );
printf("Hello World from process %d of %d running on %s\n", rank,
numtasks, mpi_hostname);
MPI_Finalize();
}
MPI Routines
MPI Routines – C and Fortran
Environment Management
Point-to-Point Communication
Collective Communication
Process Group Management
Communicators
Derived Type
Virtual Topologies
Miscellaneous Routines
Environment Management Routines
Point-to-Point Communication
A simplest form of message passing
One process sends a message to another
Several variations on how sending a message
can interact with execution of the sub-
program
Point-to-Point variations
Synchronous Sends
provide information about the completion of the
message
e.g. fax machines
Asynchronous Sends
Only know when the message has left
e.g. post cards
Blocking operations
only return from the call when operation has completed
Non-blocking operations
return straight away - can test/wait later for
completion
Point-to-Point Communication
Collective Communications
Collective communication routines are higher
level routines involving several processes at a
time
Can be built out of point-to-point
communications
Barriers
synchronise processes
Broadcast
one-to-many communication
Reduction operations
combine data from several processes to produce a single
(usually) result
Collective Communication Routines
Process Group Management
Routines
Communicators Routines
Derived Type Routines
Virtual Topologies Routines
Miscellaneous Routines
MPI Communication Routines
and Examples
MPI Messages
A message contains a number of elements
of some particular data type
MPI data types
Basic Types
Derived types
Derived types can be built up from basic
types
“C” types are different from Fortran types
MPI Basic Data types - C
MPI datatype C datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
Point-to-Point Communication
Communication between two processes
Source process sends message to
destination process
Communication takes place within a
communicator
Destination process is identified by its rank
in the communicator
MPI provides four communication modes for
sending messages
standard, synchronous, buffered, and ready
Only one mode for receiving
Standard Send
Completes once the message has been sent
Note: it may or may not have been received
Programs should obey the following rules:
It should not assume the send will complete before the
receive begins - can lead to deadlock
It should not assume the send will complete after the
receive begins - can lead to non-determinism
processes should be eager readers - they should guarantee
to receive all messages sent to them - else network
overload
Can be implemented as either a buffered
send or synchronous send
Standard Send (cont.)
MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
buf the address of the data to be sent
count the number of elements of datatype buf contains
datatype the MPI datatype
dest rank of destination in communicator comm
tag a marker used to distinguish different message types
comm the communicator shared by sender and receiver
ierror the fortran return value of the send
Standard Blocking Receive
Note: all sends so far have been blocking (but this
only makes a difference for synchronous sends)
Completes when message received
MPI_Recv(buf, count, datatype, source, tag, comm,
status)
source - rank of source process in communicator comm
status - returns information about message
Synchronous Blocking Message-Passing
processes synchronise
sender process specifies the synchronous mode
blocking - both processes wait until transaction completed
For a communication to succeed
Sender must specify a valid destination
rank
Receiver must specify a valid source rank
The communicator must be the same
Tags must match
Message types must match
Receivers buffer must be large enough
Receiver can use wildcards
MPI_ANY_SOURCE
MPI_ANY_TAG
actual source and tag are returned in status parameter
Standard/Blocked Send/Receive
MPI Send/Receive a Character
(cont...)
// mpi_com.c
#include
#include
int main(int argc, char *argv[])
{
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='X';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
printf("Rank0 sent: %c\n", outmsg);
source = 1;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
MPI Send/Receive a Character
else if (rank == 1) {
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
printf("Rank1 received: %c\n", inmsg);
dest = 0;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
}
MPI_Finalize();
}
Execution Demo
mpicc mpi_com.c
[raj@manjra mpi]$ mpirun -np 2 a.out
Rank0 sent: X
Rank0 recv: Y
Rank1 received: X
Non Blocking Message Passing
Exercise: Ping Pong
1. Write a program in which two processes
repeatedly pass a message back and forth.
2. Insert timing calls to measure the time
taken for one message.
3. Investigate how the time taken to exchange
messages varies with the size of the
message.
A simple Ping Pong.c (cont..)
#include
#include
int main(int argc, char *argv[])
{
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='X';
char pingmsg[10]; char pongmsg[10]; char buff[100];
MPI_Status Stat;
strcpy(pingmsg, "ping");
strcpy(pongmsg, "pong");
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Why + 1 ?
if (rank == 0) { /* Send Ping, Receive Pong */
dest = 1;
source = 1;
rc = MPI_Send(pingmsg, strlen(pingmsg)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(buff, strlen(pongmsg)+1, MPI_CHAR, source, tag, MPI_COMM_WORLD,
&Stat);
printf("Rank0 Sent: %s & Received: %s\n", pingmsg, buff);
}
A simple Ping Pong.c
else if (rank == 1) { /* Receive Ping, Send Pong */
dest = 0;
source = 0;
rc = MPI_Recv(buff, strlen(pingmsg)+1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
printf("Rank1 received: %s & Sending: %s\n", buff, pongmsg);
rc = MPI_Send(pongmsg, strlen(pongmsg)+1, MPI_CHAR, dest,
tag, MPI_COMM_WORLD);
}
MPI_Finalize();
}
Timers
C: double MPI_Wtime(void);
Returns an elapsed wall clock time in seconds (double
precision) on the calling processor.
Time is measured in seconds
Time to perform a task is measured by consulting the time
before and after
Upcoming Evaluations
Mid term exam: “peer” evaluation (for Part A
& B) – just before the start of Part C.
Review your understanding of topics covered so far.
No official marking – just a test on “How you are going with
the subject?”.
Date: April 13 (Wed),
Time: 20 min (exam), 15min (for peer marking)
Assignment 2:
Implementation of “parallel” Matrix multiplication (using
MPI)
Deadline: April 19 (Tuesday) from 1pm-3pm at ICT Building,
Level 2 (Masters students Lab), 2.11 or 2.13.
Acknowledgements:
MPI Slides are Derived from
Dirk van der Knijff, High Performance
Parallel Programming, PPT Slides
MPI Notes, Maui HPC Centre:
http://www.buyya.com/csc433/MPITut.pdf
Melbourne Advanced Research Computing
Center
http://www.hpc.unimelb.edu.au
Cluster Book (Chapter 3):
MPI and PVM Programming
Self Study (Will NOT appear in Exam!)
ADDITIONAL CONTENT
Running Applications using PBS
(Portable Batch System) on
Manjra cluster
PBS
PBS is a batch system - jobs get submitted to a queue
The job is a shell script to execute your program
The shell script can contain job management instructions (note
that these instructions can also be in the command line)
PBS will allocate your job to some other computer, log in as you,
and execute your script, ie your script must contain cd's or
aboslute references to access files (or globus objects)
Useful PBS commands:
qsub - submits a job
qstat - monitors status
qdel - deletes a job from a queue
PBS directives
Some PBS directives to insert at the start
of your shell script:
#PBS -q
#PBS -e (stderr location)
#PBS -o (stdout location)
#PBS -eo (combines stderr and stdout)
#PBS -t (maximum time)
#PBS -l = (eg -l nodes=2)
Manjra and PBS
runs a batch system - called
PBS:
You submit a script telling the system how to run your job
The script requests the number of nodes in DEDICATED mode.
The batch system is PBS
Queue Details [raj@manjra mpi]$ qstat -q
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
workq -- 10000:00 10000:00 13 0 0 -- E R
defaultq -- -- -- -- 0 0 -- E R
--- ---
0
mpich on majra
Run with
qsub
where jobscript is
#PBS –l nodes=2
mpirun
PBS Script
> [raj@manjra mpi]$ cat hello.bat
cd mpi
/usr/local/mpich/mpich-1.2.5.2/bin/mpirun -np 5
helloworld-hostname
> [raj@manjra mpi]$ cat hello.sh
#!/bin/bash
cd /home/mpi678-2010/mpi
mpirun -np 5 ./helloworld
Give Full path of your
working directory for your
programs execution.
Submitting to a Queue
[raj@manjra mpi]$ qsub hello.bat
2811.manjra.cs.mu.oz.au
ID Assigned to your job
[raj@manjra mpi]$ qsub –V hello.sh
2811.manjra.cs.mu.oz.au
Q Status
[raj@manjra mpi]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
2807.manjra hello.bat raj 0 Q workq
2813.manjra hello.bat raj 0 E workq
Output – Result/Error
Output
hello.bat.oXXXXX
Error, if any
hello.bat.eXXXXX
Where XXXXX is the ID assigned to your
job by PBS
References
PBS User Guide:
http://www.doesciencegrid.org/public/pbs