Embed
Email

Grid Computing

Document Sample

Shared by: linzhengnd
Categories
Tags
Stats
views:
3
posted:
11/16/2011
language:
English
pages:
71
Programming Clusters using

Message-Passing Interface (MPI)







Dr. Rajkumar Buyya

Cloud Computing and Distributed Systems (CLOUDS) Laboratory

The University of Melbourne

Melbourne, Australia

www.cloudbus.org

Outline



 Introduction to Message Passing

Environments

 HelloWorld MPI Program

 Compiling and Running MPI programs

 Elements of Hello World Program

 MPI Routines Listing

 Communication in MPI programs

 Summary

Message-Passing Programming

Paradigm

 Each processor in a message-passing program

runs a sub-program

 written in a conventional sequential language

 all variables are private

 communicate via special subroutine calls



M M M Memory





P P P Processors/Node







Interconnection Network

SPMD: A dominant paradigm for

writing data parallel applications

main(int argc, char **argv)

{

if(process is assigned Master role)

{

/* Assign work and coordinate workers and collect results */

MasterRoutine(/*arguments*/);

}

else /* it is worker process */

{

/* interact with master and other workers. Do the work and send

results to the master*/

WorkerRoutine(/*arguments*/);

}

}

Messages



 Messages are packets of

data moving between sub-

programs.

 The message passing system

has to be told the following

information

 Sending processor

 Source location

 Data type

 Data length

 Receiving processor(s)

 Destination location

 Destination size

Messages



 Access:

 Each sub-program needs to be connected to a message passing

system

 Addressing:

 Messages need to have addresses to be sent to

 Reception:

 It is important that the receiving process is capable of dealing

with the messages it is sent

 A message passing system is similar to:

 Post-office, Phone line, Fax, E-mail, etc

 Message Types:

 Point-to-Point, Collective, Synchronous (telephone)/Asynchronous

(Postal)

Message Passing Systems and MPI

- www.mpi-forum.org

 Initially each manufacturer developed their own message

passing interface

 Wide range of features, often incompatible

 MPI Forum brought together several Vendors and users of HPC

systems from US and Europe – overcome above limitations.

 Produced a document defining a standard, called

Message Passing Interface (MPI), which is derived from

experience or common features/issues addressed by many

message-passing libraries. It aimed:

 to provide source-code portability

 to allow efficient implementation

 it provides a high level of functionality

 support for heterogeneous parallel architectures

 parallel I/O (in MPI 2.0)

 MPI 1.0 contains over 115 routines/functions that can be

grouped into 8 categories.

General MPI Program Structure



MPI Include File









Initialise MPI Environment









Do work and perform message communication







Terminate MPI Environment

MPI programs



 MPI is a library - there are NO language

changes

 Header Files

 C: #include

 MPI Function Format

 C: error = MPI_Xxxx(parameter,...);

MPI_Xxxx(parameter,...);

Example - C

#include

/* include other usual header files*/

main(int argc, char **argv)

{

/* initialize MPI */

MPI_Init(&argc, &argv);



/* main part of program */



/* terminate MPI */

MPI_Finalize();

exit(0);

}

MPI helloworld.c

#include

main(int argc, char **argv)

{

int numtasks, rank;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, & numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);



printf("Hello World from process %d of %d\n“,

rank, numtasks);



MPI_Finalize();

}

MPI Programs Compilation and

Execution

Manjra: GRIDS Lab Linux Cluster



 Master Node:  Master: manjra.cs.mu.oz.au

manjra.cs.mu.oz.au  Internal worker nodes:

 Dual Xeon 2GHz  node1

 512 MB memory  node2

 250 GB integrated storage  ....

 Gigabit LAN  node13

 CDROM & Floppy Drives

 Red Hat Linux release 7.3

(Valhalla)

 Worker Nodes(node1..node13)

 Each of the 13 worker node

consists of the following:

 Pentium 4 2GHz

 512 MB memory

 40 GB harddisk

 Gigabit LAN

 Red Hat Linux release 7.3

(Valhalla)



Manjra Linux cluster

How Manjra cluster looks



 Front View  Back View

A snapshot of Manjra cluster

Compile and Run Commands



 Compile:

 [mpicc helloworld.c -o helloworld (standard)]

 manjra> mpicc helloworld.c helloworld.o (Use this on Manjra)

 Run: No of processes

 manjra> mpirun -np 3 helloworld.o [hosts picked from

configuration file automatically]

 manjra> mpirun -np 3 -machinefile machines.list helloworld.o

 NOTE: when you run firsttime, you need to enter “password” again

– due to “customised” (security issue) installation in “Manjra”

cluster . Only students are given this privilege!

 The file machines.list contains nodes list:

 node1

 ..

 node13

 Some nodes may not work today, if they had failed!

Sample Run and Output



 A Run with 3 Processes:

 manjra> mpirun -np 3 -machinefile machines.list helloworld.o

 Hello World from process 0 of 3

 Hello World from process 1 of 3

 Hello World from process 2 of 3



 A Run by default

 manjra> helloworld.o

 Hello World from process 0 of 1



 You can also use mpirun to exec standard

commands

 manjra> mpirun -np 4 -machinefile machines.list hostname

Sample Run and Output



 A Run with 6 Processes:

 manjra> mpirun -np 6 -machinefile machines.list helloworld

 Hello World from process 0 of 6

 Hello World from process 3 of 6

 Hello World from process 1 of 6

 Hello World from process 5 of 6

 Hello World from process 4 of 6

 Hello World from process 2 of 6



 Note: Process execution need not be in

process number order.

Sample Run and Output



 A Run with 6 Processes:

 manjra> mpirun -np 6 -machinefile machines.list helloworld

 Hello World from process 0 of 6

 Hello World from process 3 of 6

 Hello World from process 1 of 6

 Hello World from process 2 of 6

 Hello World from process 5 of 6

 Hello World from process 4 of 6

 Note: Change in process output order. For

each run, process mapping can be different.

They may run on machines with different

load. Hence such difference.

More on MPI Program Elements

and Error Checking

Handles



 MPI controls its own internal data structures

 MPI releases „handles‟ to allow programmers

to refer to these

 “C” handles are of distinct typedef„d types

and arrays are indexed from 0

 Some arguments can be of any type - in C

these are declared as void *

Initializing MPI



 The first MPI routine called in any MPI

program must be MPI_Init.

 The C version accepts the arguments to main

 int MPI_Init(int *argc, char ***argv);

 MPI_Init must be called by every MPI

program

 Making multiple MPI_Init calls is erroneous

 MPI_INITIALIZED is an exception to first

rule

MPI_COMM_WORLD

 MPI_INIT defines a

communicator called

MPI_COMM_WORLD for every

process that calls it.

 All MPI communication calls

require a communicator

argument

 MPI processes can only

communicate if they share a

communicator.

 A communicator contains a

group which is a list of

processes

 Each process has it‟s rank

within the communicator

 A process can have several

communicators

Communicators

 MPI uses objects called Communicators that

defines which collection of processes communicate

with each other.

 Every process has unique integer identifier

assigned by the system when the process initialises.

A rand is sometimes called process ID.

 Processes can request information from a

communicator

 MPI_Comm_rank(MPI_comm comm, int *rank)

 Returns the rank of the process in comm

 MPI_Comm_size(MPI_Comm comm, int *size)

 Returns the size of the group in comm

Finishing up



 An MPI program should call MPI_Finalize

when all communications have completed.

 Once called no other MPI calls can be made

 Aborting:

MPI_Abort(comm)

 Attempts to abort all processes listed in

comm

if comm = MPI_COMM_WORLD the whole program

terminates

Hello World with Error Check

Display Hostname of MPI Process



#include

main(int argc, char **argv)

{

int numtasks, rank;

int resultlen;

static char mpi_hostname[MPI_MAX_PROCESSOR_NAME];



MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Get_processor_name( mpi_hostname, &resultlen );



printf("Hello World from process %d of %d running on %s\n", rank,

numtasks, mpi_hostname);



MPI_Finalize();

}

MPI Routines

MPI Routines – C and Fortran



 Environment Management

 Point-to-Point Communication

 Collective Communication

 Process Group Management

 Communicators

 Derived Type

 Virtual Topologies

 Miscellaneous Routines

Environment Management Routines

Point-to-Point Communication



 A simplest form of message passing

 One process sends a message to another

 Several variations on how sending a message

can interact with execution of the sub-

program

Point-to-Point variations



 Synchronous Sends

 provide information about the completion of the

message

 e.g. fax machines

 Asynchronous Sends

 Only know when the message has left

 e.g. post cards

 Blocking operations

 only return from the call when operation has completed

 Non-blocking operations

 return straight away - can test/wait later for

completion

Point-to-Point Communication

Collective Communications



 Collective communication routines are higher

level routines involving several processes at a

time

 Can be built out of point-to-point

communications

 Barriers

 synchronise processes

 Broadcast

 one-to-many communication

 Reduction operations

 combine data from several processes to produce a single

(usually) result

Collective Communication Routines

Process Group Management

Routines

Communicators Routines

Derived Type Routines

Virtual Topologies Routines

Miscellaneous Routines

MPI Communication Routines

and Examples

MPI Messages



 A message contains a number of elements

of some particular data type

 MPI data types

 Basic Types

 Derived types

 Derived types can be built up from basic

types

 “C” types are different from Fortran types

MPI Basic Data types - C



MPI datatype C datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

Point-to-Point Communication



 Communication between two processes

 Source process sends message to

destination process

 Communication takes place within a

communicator

 Destination process is identified by its rank

in the communicator

 MPI provides four communication modes for

sending messages

 standard, synchronous, buffered, and ready

 Only one mode for receiving

Standard Send

 Completes once the message has been sent

 Note: it may or may not have been received

 Programs should obey the following rules:

 It should not assume the send will complete before the

receive begins - can lead to deadlock

 It should not assume the send will complete after the

receive begins - can lead to non-determinism

 processes should be eager readers - they should guarantee

to receive all messages sent to them - else network

overload

 Can be implemented as either a buffered

send or synchronous send

Standard Send (cont.)

MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm)

buf the address of the data to be sent

count the number of elements of datatype buf contains

datatype the MPI datatype

dest rank of destination in communicator comm

tag a marker used to distinguish different message types

comm the communicator shared by sender and receiver

ierror the fortran return value of the send

Standard Blocking Receive



 Note: all sends so far have been blocking (but this

only makes a difference for synchronous sends)

 Completes when message received

MPI_Recv(buf, count, datatype, source, tag, comm,

status)

source - rank of source process in communicator comm

status - returns information about message

 Synchronous Blocking Message-Passing

 processes synchronise

 sender process specifies the synchronous mode

 blocking - both processes wait until transaction completed

For a communication to succeed



 Sender must specify a valid destination

rank

 Receiver must specify a valid source rank

 The communicator must be the same

 Tags must match

 Message types must match

 Receivers buffer must be large enough

 Receiver can use wildcards

 MPI_ANY_SOURCE

 MPI_ANY_TAG

 actual source and tag are returned in status parameter

Standard/Blocked Send/Receive

MPI Send/Receive a Character

(cont...)

// mpi_com.c

#include

#include

int main(int argc, char *argv[])

{

int numtasks, rank, dest, source, rc, tag=1;

char inmsg, outmsg='X';

MPI_Status Stat;



MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);



if (rank == 0) {

dest = 1;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

printf("Rank0 sent: %c\n", outmsg);

source = 1;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

}

MPI Send/Receive a Character



else if (rank == 1) {

source = 0;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &Stat);

printf("Rank1 received: %c\n", inmsg);

dest = 0;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,

MPI_COMM_WORLD);

}



MPI_Finalize();

}

Execution Demo



 mpicc mpi_com.c

 [raj@manjra mpi]$ mpirun -np 2 a.out

Rank0 sent: X

Rank0 recv: Y

Rank1 received: X

Non Blocking Message Passing

Exercise: Ping Pong





1. Write a program in which two processes

repeatedly pass a message back and forth.

2. Insert timing calls to measure the time

taken for one message.

3. Investigate how the time taken to exchange

messages varies with the size of the

message.

A simple Ping Pong.c (cont..)

#include

#include

int main(int argc, char *argv[])

{

int numtasks, rank, dest, source, rc, tag=1;

char inmsg, outmsg='X';

char pingmsg[10]; char pongmsg[10]; char buff[100];

MPI_Status Stat;



strcpy(pingmsg, "ping");

strcpy(pongmsg, "pong");



MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

Why + 1 ?



if (rank == 0) { /* Send Ping, Receive Pong */

dest = 1;

source = 1;

rc = MPI_Send(pingmsg, strlen(pingmsg)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

rc = MPI_Recv(buff, strlen(pongmsg)+1, MPI_CHAR, source, tag, MPI_COMM_WORLD,

&Stat);

printf("Rank0 Sent: %s & Received: %s\n", pingmsg, buff);

}

A simple Ping Pong.c

else if (rank == 1) { /* Receive Ping, Send Pong */

dest = 0;

source = 0;

rc = MPI_Recv(buff, strlen(pingmsg)+1, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &Stat);

printf("Rank1 received: %s & Sending: %s\n", buff, pongmsg);

rc = MPI_Send(pongmsg, strlen(pongmsg)+1, MPI_CHAR, dest,

tag, MPI_COMM_WORLD);

}



MPI_Finalize();

}

Timers



 C: double MPI_Wtime(void);

 Returns an elapsed wall clock time in seconds (double

precision) on the calling processor.

 Time is measured in seconds

 Time to perform a task is measured by consulting the time

before and after

Upcoming Evaluations



 Mid term exam: “peer” evaluation (for Part A

& B) – just before the start of Part C.

 Review your understanding of topics covered so far.

 No official marking – just a test on “How you are going with

the subject?”.

 Date: April 13 (Wed),

 Time: 20 min (exam), 15min (for peer marking)



 Assignment 2:

 Implementation of “parallel” Matrix multiplication (using

MPI)

 Deadline: April 19 (Tuesday) from 1pm-3pm at ICT Building,

Level 2 (Masters students Lab), 2.11 or 2.13.

Acknowledgements:

MPI Slides are Derived from

 Dirk van der Knijff, High Performance

Parallel Programming, PPT Slides

 MPI Notes, Maui HPC Centre:

 http://www.buyya.com/csc433/MPITut.pdf

 Melbourne Advanced Research Computing

Center

 http://www.hpc.unimelb.edu.au

 Cluster Book (Chapter 3):

 MPI and PVM Programming

Self Study (Will NOT appear in Exam!)



ADDITIONAL CONTENT

Running Applications using PBS

(Portable Batch System) on

Manjra cluster

PBS

 PBS is a batch system - jobs get submitted to a queue

 The job is a shell script to execute your program

 The shell script can contain job management instructions (note

that these instructions can also be in the command line)

 PBS will allocate your job to some other computer, log in as you,

and execute your script, ie your script must contain cd's or

aboslute references to access files (or globus objects)

 Useful PBS commands:

 qsub - submits a job

 qstat - monitors status

 qdel - deletes a job from a queue

PBS directives



 Some PBS directives to insert at the start

of your shell script:

 #PBS -q

 #PBS -e (stderr location)

 #PBS -o (stdout location)

 #PBS -eo (combines stderr and stdout)

 #PBS -t (maximum time)

 #PBS -l = (eg -l nodes=2)

Manjra and PBS



 runs a batch system - called

PBS:

 You submit a script telling the system how to run your job

 The script requests the number of nodes in DEDICATED mode.

 The batch system is PBS

 Queue Details [raj@manjra mpi]$ qstat -q

 Queue Memory CPU Time Walltime Node Run Que Lm State

 ---------------- ------ -------- -------- ---- --- --- -- -----

 workq -- 10000:00 10000:00 13 0 0 -- E R

 defaultq -- -- -- -- 0 0 -- E R

 --- ---

 0

mpich on majra



 Run with

qsub





 where jobscript is

#PBS –l nodes=2

mpirun

PBS Script



> [raj@manjra mpi]$ cat hello.bat

 cd mpi

 /usr/local/mpich/mpich-1.2.5.2/bin/mpirun -np 5

helloworld-hostname

> [raj@manjra mpi]$ cat hello.sh

 #!/bin/bash

 cd /home/mpi678-2010/mpi

 mpirun -np 5 ./helloworld

Give Full path of your

working directory for your

programs execution.

Submitting to a Queue



 [raj@manjra mpi]$ qsub hello.bat

 2811.manjra.cs.mu.oz.au



ID Assigned to your job



 [raj@manjra mpi]$ qsub –V hello.sh

 2811.manjra.cs.mu.oz.au

Q Status



 [raj@manjra mpi]$ qstat

 Job id Name User Time Use S Queue

 ---------------- ---------------- ---------------- -------- - -----

 2807.manjra hello.bat raj 0 Q workq







 2813.manjra hello.bat raj 0 E workq

Output – Result/Error



 Output

 hello.bat.oXXXXX

 Error, if any

 hello.bat.eXXXXX

 Where XXXXX is the ID assigned to your

job by PBS

References



 PBS User Guide:

 http://www.doesciencegrid.org/public/pbs



Related docs
Other docs by linzhengnd
i-Health
Views: 0  |  Downloads: 0
State employees recall events of September 11
Views: 7  |  Downloads: 0
0804050421330_2110
Views: 4  |  Downloads: 0
Listino2009 - Meetup
Views: 0  |  Downloads: 0
TwoSurveyCalculator
Views: 0  |  Downloads: 0
Guidelines.xlsx
Views: 0  |  Downloads: 0
APPALACHIA AND THE OZARKS
Views: 2  |  Downloads: 0
Proliferation Studies
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!