FT-MPI
Graham E Fagg Making of the holy grail or a YAMI that is FT
FT-MPI
• • • • • • • What is FT-MPI (its no YAMI) Building an MPI for Harness First sketch of FT-MPI Simple FT enabled Example A bigger meaner example (PSTSWM) Second view of FT-MPI Future directions
FT-MPI is not just a YAMI
• FT-MPI as in Fault Tolerant MPI • Why make a FT version?
– Harness is going to be very robust compared to previous systems.
• No single point of failure unlike PVM • Allow MPI users to take advantage of this high level of robustness, rather than just provide Yet Another MPI Implementation (YAMI)
Why FT-MPI
• Current MPI applications live under the MPI fault tolerant model of no faults allowed.
– This is great on an MPP as if you lose a node you generally lose a partion anyway. – Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hense have the wrong result anyway.
Why FT-MPI
• No-matter how we implement FT-MPI, it must follow current MPI-1.2 (or 2) practices. I.e. we can’t really change too much about how it works (semmantics) or how it looks (syntax). • Makes coding for FT a little interesting and very dependent on the target application classes. As will be shown.
So first what does MPI do?
• All communication is via a communicator • Communicators form an envelope in which communication can occur, and contains information such as process groups, topology information and attributes (key values)
What does MPI do?
• When an application starts up, it has a single communicator that contains all members known an MPI_COMM_WORLD • Other communicators containing subsection of the original communictor can be created from this communicator using collective (meaning blocking, group operations).
What does MPI do?
• Until MPI-2 and the advent of MPI_Spawn (which isnot really supported by any implementations except LAM) it was not possible to add new members to the range of addressable members in an MPI application. • If you can’t address (name) them, you can’t communicate directly with them.
What does MPI do?
• If a member of a communicator failed for some reason, the specification mandated that rather than continuing which would lead to unknown results in a doomed application, the communicator is invalidated and the application halted in a clean manner. • In simple if something fails, everything does.
What we would like?
• Many applications are capable or can be made capable of surviving such a random failure. • Initial Goal:
– Provide a version of MPI that allows a range of alternatives to an application when a sub-part of the application has failed. – Range of alternatives depends on how the applications themselves will handle the failure.
Building an MPI for Harness
• Before we get into the gritty of what we do when we get an error, how are we going to build something in the first place? • Two methods:
– Take an existing implementation (ala MPICH) and re-engineer it for our own uses (the most popular method currently) – Build an implementation from the ground up.
Building a YAMI
• Taking MPICH and building an FT version should be simple…?
– It has a layering system, the MPI API sits on top of the data-structures that sit ontop of a collective communication model, which calls an ADI that provides p2p communications.
Building a YAMI
• MSS tried this with their version of MPI for the Cray T3E
– Found that the layering was not very clean, lots of short cuts and data passed between the layers without going through the expected APIs. – Esp true of routines that handle startup (I.e. process management)
Building a YAMI
• Building a YAMI from scratch
– Not impossible but time consuming – Too many function calls to support (200+) – Can implement a subset (just like compiler writers did for HPF with subset HPF)
– If we later want a *full* implementation then we need a much larger team that we current have. (Look at how long it has taken ANL to keep up to date, and look at their currently outstanding bug list).
Building a YAMI
• Subset of operations best way to go
– Allows us to test a few key applications and find out just how useful and applicable a FTMPI would be.
Building an MPI for Harness
• What does Harness give us, and what do we have to build ourselves? • Harness will give us basic functionality of starting tasks, some basic comms between them, some attribute storage (mboxes) and some indication of errors and failures.
– I.e. mostly what PVM gives us at the moment. – As well as the ability to plug extra bits in...
Harness Basic Structure
Application Application
Pipes / sockets
TCP/IP basic link
Harness run-time
TCP/IP HARNESS Daemon
Harness Basic Structure
Repository Application Application
Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage
Harness Basic Structure
Repository Application Application
Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage
Harness Basic Structure
Application Application
Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage
Harness Basic Structure
Application Application
Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage
Harness Basic Structure
Application Application
Pipes / sockets
FM-Comms-Plugin
Harness run-time
TCP/IP HARNESS Daemon Internal Harness meta-data storage
So what do we need to build for FT-MPI?
• Build the run-time components that provide the user application with an MPI API • Build an interface in this run-time component that allows for fast communications so that we at least provide something that doesn’t run like a 3 legged dog.
Building the run-time system
• The system can be built as several layers.
– The top layer is the MPI API – The next layer handles the internal MPI data structures and some of the data buffering. – The next layer handles the collective communications.
• Breaks them down to p2p, but in a modular way so that different collective operations can be optimised differently depending on the target architecture.
– The lowest layer handles p2p communications.
Building the run-time system
• Do we have any of this already? • Yes… the MPI API layer is currently in a file called MPI_Connect/src/com_layer.c • Most of the data structures are in com_list, msg_list.c, lists.c and hash.c
– Hint, try compiling the library with the flag -DNOMPI
• Means we know what we are up against.
Building the run-time system
• Most complex part if handling the collective operations and all the variants of vector operations.
– PACX and MetaMPI do not support them all, but MagPie is getting closer.
What is MagPie ?
• A Black and White bird that collects shinny objects.
– A software system by Thilo Kielmann of Vrije Universiteit, Amsterdam, NL. – ‘Collects’ is the important word here as its is a package that supports efficient collective operations across multiple clusters. – Most collective operation in most MPI implementation break down into a series of broadcasts which scale well across switches as long as the switches are homogeneous, which is not the case for cluster of clusters. – I.e. can use MagPie to provide the collective substrate.
Building the run-time system
• Just leaves the p2p system, and the interface to the Harness daemons themselves. • The p2p system can be build on Martins fast message layer. • The Harness interface can be implemented on top of PVM 3.4 for now, until Harness itself becomes available.
Building the run-time system
• Last details to worry about is how we are going to change the MPI semantics to report errors and how we continue after them.
– Taking note of how we know there is a failure in the first place.
First sketch of FT-MPI
• First view of FT-MPI is where the users application is able to handle errors and all we have to provide is:
– A simple method for indicating errors/failures – A simple method for recovering from errors
First sketch of FT-MPI
• 3 initial models of failure (another later on)
– (1) There is a failure and the application is shut down (MPI default; gains us little other than meeting the standard). – (2) Failure only effects members of a communicator which communicate with the failed party. I.e. p2p coms still work within the communicator. – (3) That communicator is invalidated completely.
First sketch of FT-MPI
• How do we detect failure?
– 4 ways…
(1) We are told its going to happen by a member of a particular application. (ie I have NaNs everywhere.. Panic) (2) A point-2-point communication fails (3) The p2p system tells use that some-one failed (error propergation within a communicator at the run-time system layer) (much like (1)) (4) Harness tells us via a message from the daemon.
First sketch of FT-MPI
• How do we tell the user application? • Return it an MPI_ERR_OTHER • Force it to check an additional MPI error call to find where the failure occurred.
– Via the cached attribute key values
• FT_MPI_PROC_FAILED which is a vector of length MPI_COMM_SIZE of the original communicator.
• How do we recover if we have just invalidated the communicator the application will use to recover on?
First sketch of FT-MPI
• Some functions are allowed to be used in a partial form to facilitate recovery.
– I.e. MPI_Comm_barrier ( ) can still be used to sync processes, but will only wait for the surviving processes… – The formation of a new communicator will also be allowed to work with a broken communicator. – MPI_Finalize does not need a communicator specified.
First sketch of FT-MPI
• Forming a new communicator that the application can use to continue is the important part. • Two functions can modified to be used:
– MPI_COMM_CREATE (comm, group, newcomm ) – MPI_COMM_SPLIT (comm, colour, key, newcomm )
First sketch of FT-MPI
• MPI_COMM_CREATE ( )
– Called with the group set to a new constant
• FT_MPI_LIVING (!)
– Creates a new communicator that contains all the processes that continue to survive.
• Special case could be to allow MPI_COMM_WORLD to be specified as both input and output communicator.
First sketch of FT-MPI
• MPI_COMM_SPLIT ( )
– Called with the colour set to a new constant
• FT_MPI_NOT_DEAD_YET (!)
– key can be used to control the new rank of processes within the new communicator. – Again creates a new communicator that contains all the processes that continue to survive.
Simple FT enabled Example
• Simple application at first
– Bag of tasks, where the tasks know how to handle a failure.
• Server just divides up the next set of data to be calculated between the survivors. • Clients nominate a new server if they have enough state.
– (Can get the state by using ALL2ALL communications for results).
A bigger meaner example (PSTSWM)
• Parallel Spectral Transform Shallow Water Model – 2D grid calculation
• 3D in actual computation, with 1 axis performing FFTs, the second global reductions and the third layering sequentially upon each logical processor. • Calculation cannot support reduced grids like those supported by the Parallel Community Climate Model (PCMM), a future target application for FT-MPI.
– I.e. if we lose a logical grid point (node) we must replace it!
A bigger meaner example (PSTSWM)
• First Sketch ideas for FT-MPI are fine for applications that can handle a failure and have functional calling sequences that are not too deep…
– I.e. MPI API calls can be buried deep within routines and any errors may take quite a while to bubble to the surface where the application can take effective action to handle them and recover.
A bigger meaner example (PSTSWM)
• This application proceeds in a number of well defined stages and can only handle failure by restarting from a known set of data.
– I.e. user checkpoints have to be taken, and must still be reachable.
• User requirement is for the application to be started and run to completion with the system automatically handling errors without manual intervention.
A bigger meaner example (PSTSWM)
• Invalidating the failed communicators only as in the first sketch are not enough for this application.
– PSTSWM creates communicators for each row and column of the 2-D grid.
A bigger meaner example (PSTSWM)
A bigger meaner example (PSTSWM)
Failed Node
A bigger meaner example (PSTSWM)
Failed Node
Failed Communicator
Failed Communicator
A bigger meaner example (PSTSWM)
Failed Node
This is unknown (butterfly p2p)
This communication works
A bigger meaner example (PSTSWM)
Failed Node
This is unknown as the pervious failure on the axis might not have been detected...
• What is really wanted is for four things to happen….
– Firstly, ALL communicators are marked as broken… even if some are recoverable.
• The underlying system propagates errors message to all communicators, not just the ones directly effected by the failure.
A bigger meaner example (PSTSWM)
– Secondly all MPI operations become NOPs where possible so that, the application can bubble the error to the top level as fast as possible.
A bigger meaner example (PSTSWM)
• Thirdly, the run-time system spawns a replacement node on behalf of the application using a predetermined set of metrics. • Finally, the system allows this new process to be combined with the surviving communicators at MPI_Comm_create time.
– Position (rank) of the new processes is not so important in this application as restart data has to be redistributed anyway, but maybe important for other applications.
A bigger meaner example (PSTSWM)
• For this to occur, we need a means of identifying if a process has been spawned for the purpose of recovery (by either the run-time system or an application itself).
– MPI_Comm_split (com, ft_mpi_still_alive,..) vs – MPI_Comm_split (ft_mpi_external_com, ft_mpi_new_spawned,..) – PSTSWM, doesn’t care which task died and frankly doesn’t want to know!
• Just wants to continue calculating..
A bigger meaner example (PSTSWM)
• How are we going to build an FT version of this application?
– Patrick Worley (ORNL) is currently adding (user) checkpoint and restart capability into the application, as well as on error, get to the top layer functionality, so that a restart can be performed.
• FT-MPI will need to provide an MPI-2 spawn function as well as baseline MPI-1 calls.
– Initially the spawning will be performed by the PSTSWM code, and later by the run-time on its behalf.
A bigger meaner example (PSTSWM)
Failed Node
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
A bigger meaner example (PSTSWM)
New task spawned
A bigger meaner example (PSTSWM)
Application reforming communicators
A bigger meaner example (PSTSWM)
Application reforming communicators
A bigger meaner example (PSTSWM)
Application reforming communicators
A bigger meaner example (PSTSWM)
Application back on-line.
A bigger meaner example (PSTSWM)
• Hope to demo FT-PSTSWM at SC99 • Performance will not be great, as it is sensitive to latency and very sensitive to bandwidth. • But it is probably one of the most difficult classes of applications to support.
– PCCM is the next big application on the list as this model can be reconfigured to handle different grid sizes dynamically.
Future Directions
• When we move from Terra-flop systems to Peta-flop machines we will have a mean time between failures (MTBF) that is less than that of expected execution runs. • Solutions like FT-MPI might help application developers better cope with this situation, without having to checkpoint their applications to (performance) death.
For now, what next?
• Implement a simple MPI implementation on top of PVM 3.4 using as much existing software as possible. • Support functions needed by our two exemplars. • Make sure the lower level systems will use the high performance coms layer efficiently when it becomes available. • Fool some students into working for us (5 years?), for when we want to support the other 200+ functions in MPI.