DoD Contracts

Reviews
Shared by: keara
Stats
views:
12
rating:
not rated
reviews:
0
posted:
11/10/2009
language:
English
pages:
0
FT-MPI Graham E Fagg Making of the holy grail or a YAMI that is FT FT-MPI • • • • • • • What is FT-MPI (its no YAMI) Building an MPI for Harness First sketch of FT-MPI Simple FT enabled Example A bigger meaner example (PSTSWM) Second view of FT-MPI Future directions FT-MPI is not just a YAMI • FT-MPI as in Fault Tolerant MPI • Why make a FT version? – Harness is going to be very robust compared to previous systems. • No single point of failure unlike PVM • Allow MPI users to take advantage of this high level of robustness, rather than just provide Yet Another MPI Implementation (YAMI) Why FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. – This is great on an MPP as if you lose a node you generally lose a partion anyway. – Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hense have the wrong result anyway. Why FT-MPI • No-matter how we implement FT-MPI, it must follow current MPI-1.2 (or 2) practices. I.e. we can’t really change too much about how it works (semmantics) or how it looks (syntax). • Makes coding for FT a little interesting and very dependent on the target application classes. As will be shown. So first what does MPI do? • All communication is via a communicator • Communicators form an envelope in which communication can occur, and contains information such as process groups, topology information and attributes (key values) What does MPI do? • When an application starts up, it has a single communicator that contains all members known an MPI_COMM_WORLD • Other communicators containing subsection of the original communictor can be created from this communicator using collective (meaning blocking, group operations). What does MPI do? • Until MPI-2 and the advent of MPI_Spawn (which isnot really supported by any implementations except LAM) it was not possible to add new members to the range of addressable members in an MPI application. • If you can’t address (name) them, you can’t communicate directly with them. What does MPI do? • If a member of a communicator failed for some reason, the specification mandated that rather than continuing which would lead to unknown results in a doomed application, the communicator is invalidated and the application halted in a clean manner. • In simple if something fails, everything does. What we would like? • Many applications are capable or can be made capable of surviving such a random failure. • Initial Goal: – Provide a version of MPI that allows a range of alternatives to an application when a sub-part of the application has failed. – Range of alternatives depends on how the applications themselves will handle the failure. Building an MPI for Harness • Before we get into the gritty of what we do when we get an error, how are we going to build something in the first place? • Two methods: – Take an existing implementation (ala MPICH) and re-engineer it for our own uses (the most popular method currently) – Build an implementation from the ground up. Building a YAMI • Taking MPICH and building an FT version should be simple…? – It has a layering system, the MPI API sits on top of the data-structures that sit ontop of a collective communication model, which calls an ADI that provides p2p communications. Building a YAMI • MSS tried this with their version of MPI for the Cray T3E – Found that the layering was not very clean, lots of short cuts and data passed between the layers without going through the expected APIs. – Esp true of routines that handle startup (I.e. process management) Building a YAMI • Building a YAMI from scratch – Not impossible but time consuming – Too many function calls to support (200+) – Can implement a subset (just like compiler writers did for HPF with subset HPF) – If we later want a *full* implementation then we need a much larger team that we current have. (Look at how long it has taken ANL to keep up to date, and look at their currently outstanding bug list). Building a YAMI • Subset of operations best way to go – Allows us to test a few key applications and find out just how useful and applicable a FTMPI would be. Building an MPI for Harness • What does Harness give us, and what do we have to build ourselves? • Harness will give us basic functionality of starting tasks, some basic comms between them, some attribute storage (mboxes) and some indication of errors and failures. – I.e. mostly what PVM gives us at the moment. – As well as the ability to plug extra bits in... Harness Basic Structure Application Application Pipes / sockets TCP/IP basic link Harness run-time TCP/IP HARNESS Daemon Harness Basic Structure Repository Application Application Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage Harness Basic Structure Repository Application Application Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage Harness Basic Structure Application Application Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage Harness Basic Structure Application Application Pipes / sockets TCP/IP HARNESS Daemon Internal Harness meta-data storage Harness Basic Structure Application Application Pipes / sockets FM-Comms-Plugin Harness run-time TCP/IP HARNESS Daemon Internal Harness meta-data storage So what do we need to build for FT-MPI? • Build the run-time components that provide the user application with an MPI API • Build an interface in this run-time component that allows for fast communications so that we at least provide something that doesn’t run like a 3 legged dog. Building the run-time system • The system can be built as several layers. – The top layer is the MPI API – The next layer handles the internal MPI data structures and some of the data buffering. – The next layer handles the collective communications. • Breaks them down to p2p, but in a modular way so that different collective operations can be optimised differently depending on the target architecture. – The lowest layer handles p2p communications. Building the run-time system • Do we have any of this already? • Yes… the MPI API layer is currently in a file called MPI_Connect/src/com_layer.c • Most of the data structures are in com_list, msg_list.c, lists.c and hash.c – Hint, try compiling the library with the flag -DNOMPI • Means we know what we are up against. Building the run-time system • Most complex part if handling the collective operations and all the variants of vector operations. – PACX and MetaMPI do not support them all, but MagPie is getting closer. What is MagPie ? • A Black and White bird that collects shinny objects. – A software system by Thilo Kielmann of Vrije Universiteit, Amsterdam, NL. – ‘Collects’ is the important word here as its is a package that supports efficient collective operations across multiple clusters. – Most collective operation in most MPI implementation break down into a series of broadcasts which scale well across switches as long as the switches are homogeneous, which is not the case for cluster of clusters. – I.e. can use MagPie to provide the collective substrate. Building the run-time system • Just leaves the p2p system, and the interface to the Harness daemons themselves. • The p2p system can be build on Martins fast message layer. • The Harness interface can be implemented on top of PVM 3.4 for now, until Harness itself becomes available. Building the run-time system • Last details to worry about is how we are going to change the MPI semantics to report errors and how we continue after them. – Taking note of how we know there is a failure in the first place. First sketch of FT-MPI • First view of FT-MPI is where the users application is able to handle errors and all we have to provide is: – A simple method for indicating errors/failures – A simple method for recovering from errors First sketch of FT-MPI • 3 initial models of failure (another later on) – (1) There is a failure and the application is shut down (MPI default; gains us little other than meeting the standard). – (2) Failure only effects members of a communicator which communicate with the failed party. I.e. p2p coms still work within the communicator. – (3) That communicator is invalidated completely. First sketch of FT-MPI • How do we detect failure? – 4 ways… (1) We are told its going to happen by a member of a particular application. (ie I have NaNs everywhere.. Panic) (2) A point-2-point communication fails (3) The p2p system tells use that some-one failed (error propergation within a communicator at the run-time system layer) (much like (1)) (4) Harness tells us via a message from the daemon. First sketch of FT-MPI • How do we tell the user application? • Return it an MPI_ERR_OTHER • Force it to check an additional MPI error call to find where the failure occurred. – Via the cached attribute key values • FT_MPI_PROC_FAILED which is a vector of length MPI_COMM_SIZE of the original communicator. • How do we recover if we have just invalidated the communicator the application will use to recover on? First sketch of FT-MPI • Some functions are allowed to be used in a partial form to facilitate recovery. – I.e. MPI_Comm_barrier ( ) can still be used to sync processes, but will only wait for the surviving processes… – The formation of a new communicator will also be allowed to work with a broken communicator. – MPI_Finalize does not need a communicator specified. First sketch of FT-MPI • Forming a new communicator that the application can use to continue is the important part. • Two functions can modified to be used: – MPI_COMM_CREATE (comm, group, newcomm ) – MPI_COMM_SPLIT (comm, colour, key, newcomm ) First sketch of FT-MPI • MPI_COMM_CREATE ( ) – Called with the group set to a new constant • FT_MPI_LIVING (!) – Creates a new communicator that contains all the processes that continue to survive. • Special case could be to allow MPI_COMM_WORLD to be specified as both input and output communicator. First sketch of FT-MPI • MPI_COMM_SPLIT ( ) – Called with the colour set to a new constant • FT_MPI_NOT_DEAD_YET (!) – key can be used to control the new rank of processes within the new communicator. – Again creates a new communicator that contains all the processes that continue to survive. Simple FT enabled Example • Simple application at first – Bag of tasks, where the tasks know how to handle a failure. • Server just divides up the next set of data to be calculated between the survivors. • Clients nominate a new server if they have enough state. – (Can get the state by using ALL2ALL communications for results). A bigger meaner example (PSTSWM) • Parallel Spectral Transform Shallow Water Model – 2D grid calculation • 3D in actual computation, with 1 axis performing FFTs, the second global reductions and the third layering sequentially upon each logical processor. • Calculation cannot support reduced grids like those supported by the Parallel Community Climate Model (PCMM), a future target application for FT-MPI. – I.e. if we lose a logical grid point (node) we must replace it! A bigger meaner example (PSTSWM) • First Sketch ideas for FT-MPI are fine for applications that can handle a failure and have functional calling sequences that are not too deep… – I.e. MPI API calls can be buried deep within routines and any errors may take quite a while to bubble to the surface where the application can take effective action to handle them and recover. A bigger meaner example (PSTSWM) • This application proceeds in a number of well defined stages and can only handle failure by restarting from a known set of data. – I.e. user checkpoints have to be taken, and must still be reachable. • User requirement is for the application to be started and run to completion with the system automatically handling errors without manual intervention. A bigger meaner example (PSTSWM) • Invalidating the failed communicators only as in the first sketch are not enough for this application. – PSTSWM creates communicators for each row and column of the 2-D grid. A bigger meaner example (PSTSWM) A bigger meaner example (PSTSWM) Failed Node A bigger meaner example (PSTSWM) Failed Node Failed Communicator Failed Communicator A bigger meaner example (PSTSWM) Failed Node This is unknown (butterfly p2p) This communication works A bigger meaner example (PSTSWM) Failed Node This is unknown as the pervious failure on the axis might not have been detected... • What is really wanted is for four things to happen…. – Firstly, ALL communicators are marked as broken… even if some are recoverable. • The underlying system propagates errors message to all communicators, not just the ones directly effected by the failure. A bigger meaner example (PSTSWM) – Secondly all MPI operations become NOPs where possible so that, the application can bubble the error to the top level as fast as possible. A bigger meaner example (PSTSWM) • Thirdly, the run-time system spawns a replacement node on behalf of the application using a predetermined set of metrics. • Finally, the system allows this new process to be combined with the surviving communicators at MPI_Comm_create time. – Position (rank) of the new processes is not so important in this application as restart data has to be redistributed anyway, but maybe important for other applications. A bigger meaner example (PSTSWM) • For this to occur, we need a means of identifying if a process has been spawned for the purpose of recovery (by either the run-time system or an application itself). – MPI_Comm_split (com, ft_mpi_still_alive,..) vs – MPI_Comm_split (ft_mpi_external_com, ft_mpi_new_spawned,..) – PSTSWM, doesn’t care which task died and frankly doesn’t want to know! • Just wants to continue calculating.. A bigger meaner example (PSTSWM) • How are we going to build an FT version of this application? – Patrick Worley (ORNL) is currently adding (user) checkpoint and restart capability into the application, as well as on error, get to the top layer functionality, so that a restart can be performed. • FT-MPI will need to provide an MPI-2 spawn function as well as baseline MPI-1 calls. – Initially the spawning will be performed by the PSTSWM code, and later by the run-time on its behalf. A bigger meaner example (PSTSWM) Failed Node A bigger meaner example (PSTSWM) Error detected, comms invalidated A bigger meaner example (PSTSWM) Error detected, comms invalidated A bigger meaner example (PSTSWM) Error detected, comms invalidated A bigger meaner example (PSTSWM) New task spawned A bigger meaner example (PSTSWM) Application reforming communicators A bigger meaner example (PSTSWM) Application reforming communicators A bigger meaner example (PSTSWM) Application reforming communicators A bigger meaner example (PSTSWM) Application back on-line. A bigger meaner example (PSTSWM) • Hope to demo FT-PSTSWM at SC99 • Performance will not be great, as it is sensitive to latency and very sensitive to bandwidth. • But it is probably one of the most difficult classes of applications to support. – PCCM is the next big application on the list as this model can be reconfigured to handle different grid sizes dynamically. Future Directions • When we move from Terra-flop systems to Peta-flop machines we will have a mean time between failures (MTBF) that is less than that of expected execution runs. • Solutions like FT-MPI might help application developers better cope with this situation, without having to checkpoint their applications to (performance) death. For now, what next? • Implement a simple MPI implementation on top of PVM 3.4 using as much existing software as possible. • Support functions needed by our two exemplars. • Make sure the lower level systems will use the high performance coms layer efficiently when it becomes available. • Fool some students into working for us (5 years?), for when we want to support the other 200+ functions in MPI.

Related docs
Copyright - DOD Contracts
Views: 1  |  Downloads: 0
dod emall
Views: 301  |  Downloads: 4
DOD version
Views: 27  |  Downloads: 0
DoD 602518-R, January 24, 2003
Views: 11  |  Downloads: 0
SUBJECT Proper Use of Non-DoD Contracts
Views: 0  |  Downloads: 0
SUBJECT Proper Use of Non-DoD Contracts
Views: 0  |  Downloads: 0
DoD 5015
Views: 12  |  Downloads: 0
Copyright DoD Contracts Open Source Software
Views: 27  |  Downloads: 1
DoD Guide to IPPD
Views: 75  |  Downloads: 7
dod fmr volume 5
Views: 143  |  Downloads: 0
letters of intent
premium docs

Other docs by keara