Design - Sequence Diagrams - Get as PowerPoint by fjwuxn

VIEWS: 16 PAGES: 68

									               MPI Verification


    Ganesh Gopalakrishnan and Robert M. Kirby
                      Students
     Yu Yang, Sarvani Vakkalanka, Guodong Li,
Subodh Sharma, Anh Vo, Michael DeLisi, Geof Sawaya
     (http://www.cs.utah.edu/formal_verification)

               School of Computing
                University of Utah

                  Supported by:
             Microsoft HPC Institutes
                NSF CNS 0509379



                                                     1
            “MPI Verification”

                     or


         How to exhaustively verify
               MPI programs
     without the pain of model building
and considering only “relevant interleavings”



                                                2
Computing is at an inflection point




                             (photo courtesy of Intel)




                                                         3
Our work pertains to these:


     MPI programs



     MPI libraries



     Shared Memory Threads based on Locks




                                             4
Name of the Game: Progress Through Precision



1.   Precision in Understanding


2.   Precision in Modeling


3.   Precision in Analysis


4.   Doing Modeling and Analysis with Low Cost



                                                 5
1. Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier



Will P1’s Send Match P2’s Receive ?



                                                              6
Need for Precision in Understanding:

The “crooked barrier” quiz

      P0                 P1                P2
      ---                ---               ---

      MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

      MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier



It will ! Here is the animation



                                                               7
Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              8
Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              9
Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              10
Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              11
Need for Precision in Understanding:

The “crooked barrier” quiz

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              12
Would you rather explain each conceivable situation
in a large API with an elaborate “bee dance” and
informal English…. or would you rather specify it
mathematically and let the user calculate the
outcomes?

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              13
TLA+ Spec of MPI_Wait (Slide 1/2)




                                    14
TLA+ Spec of MPI_Wait (Slide 2/2)




                                    15
Executable Formal Specification can help
validate our understanding of MPI …

                              Visual Studio        Verification
                                  2005            Environment

                                       Phoenix Compiler


                                            MPIC IR

              TLA+ MPI         TLA+ Prog.         MPIC Program
            Library Model        Model               Model

                                                      MPIC Model
                   TLC Model Checker
                                                       Checker

                FMICS 07                                           PADTAD 07

9/15/2010
                                                                               16
The Histrionics of FV for HPC (1)




                                    17
The Histrionics of FV for HPC (2)




                                    18
Error-trace Visualization in VisualStudio




                                            19
2. Precision in Modeling:
The “Byte-range Locking Protocol” Challenge
Asked to see if new protocol using MPI 1-sided was
OK…
lock_acquire (start, end) {                                                         Stage 2
                                                                9     if(conflict) {
                               Stage 1                          10      val[0] = 0
1   val[0] = 1; /* flag */ val[1] = start; val[2] = end;        11         lock_win
2   while(1) {                                                  12         place val in win
3    lock_win                                                   13         unlock_win
4    place val in win                                           14         MPI_Recv(ANY_SOURCE)
5    get values of other processes from win                     15         }
6       unlock_win                                              16    else{
7     for all i, if (Pi conflicts with my range)                17         /* lock is acquired */
8       conflict = 1;                                           18       break;
                                                                19       }
                                                                20   }//end while




             flag start end                 0         -1   -1   0     -1       -1      0       -1   -1




                                                                                                         20
Precision in Modeling:
The “Byte-range Locking Protocol” Challenge
   Studied code
   Wrote Promela Verification Model (a week)
   Applied the SPIN Model Checker
   Found Two Deadlocks Previously Unknown
   Wrote Paper (EuroPVM / MPI 2006) with Thakur and Gropp – won
    one of the three best-paper awards
   With new insight, Designed Correct AND Faster Protocol !

   Still, we felt lucky … what if we had missed the error while hand-
    modeling

   Also hand-modeling was NO FUN – how about running the real
    MPI code “cleverly” ?




                                                                         21
Measurement under Low Contention




                                   22
Measurement under High Contention




                                    23
4. Modeling and Analysis with Reduced Cost…


        Card Deck 0                          Card Deck 1

   0:                                       0:
   1:                                       1:
   2:                                       2:
   3:                                       3:
   4:                                       4:
   5:                                       5:

          • Only the interleavings of the red cards matter
          • So don’t try all riffle-shuffles (12!) / (6!) (6!) = 924
          • Instead just try TWO shuffles of the decks !!



                                                                       24
What works for cards works for MPI
(and for PThreads also) !!
    P0 (owner of window)            P1 (non-owner of window)

  0: MPI_Init                      0: MPI_Init
  1: MPI_Win_lock                  1: MPI_Win_lock
  2: MPI_Accumulate                2: MPI_Accumulate
  3: MPI_Win_unlock                3: MPI_Win_unlock
  4: MPI_Barrier                   4: MPI_Barrier
  5: MPI_Finalize                  5: MPI_Finalize

          •These are the dependent operations
          • 504 interleavings without POR in this example
          • 2 interleavings with POR !!



                                                               25
4. Modeling and Analysis with Reduced Cost
The “Byte-range Locking Protocol” Challenge
   Studied code  DID NOT STUDY CODE
   Wrote Promela Verification Model (a week)  NO MODELING
   Applied the SPIN Model Checker  NEW ISP VERIFIER
   Found Two Deadlocks Previously Unknown  FOUND SAME!
   Wrote Paper (EuroPVM / MPI 2007) with Thakur and Gropp – won
    one of the three best-paper awards  DID NOT WIN 

   Still, we felt lucky … what if we had missed the error while hand-
    modeling  NO NEED TO FEEL LUCKY (NO LOST
    INTERLEAVING – but also did not foolishly do ALL interleavings)

   Also hand-modeling was NO FUN – how about running the real
    MPI code “cleverly” ?  DIRECT RUNNING WAS FUN




                                                                         26
3. Precision in Analysis
The “crooked barrier” quiz again …
     P0                    P1                P2
     ---                   ---               ---

     MPI_Isend ( P2 )      MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier           MPI_Isend( P2 )   MPI_Barrier


Our Cluster NEVER gave us the P0 to P2 match !!!

Elusive Interleavings !!

Bites you the hardest when you port to new platform !!


                                                                 27
3. Precision in Analysis
The “crooked barrier” quiz again …
     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier


SOLVED!! Using the new POE Algorithm

Partial Order Reduction in the presence of
Out of Order Operations and Elusive Interleavings



                                                              28
Precision in Analysis

   POE Works Great (all 41 Umpire Test-Suites Run)
   No need to “pad” delay statements to jiggle schedule
    and force “the other” interleaving
     – This is a very brittle trick anyway!
   Prelim Version Under Submission
     – Detailed Version for EuroPVM…


   Jitterbug uses this approach
     – We don’t need it


   Siegel (MPI_SPIN): Modeling effort
   Marmot : Different Coverage Guarantees..




                                                           29
1-4: Finally! Precision and Low Cost in Modeling
     and Analysis, taking advantage of MPI
     semantics (in our heads…)

     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




     This is how POE does it



                                                              30
Discover All Potential Senders by Collecting (but not
issuing) operations at runtime…



     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( ANY )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                              31
Rewrite “ANY” to ALL POTENTIAL SENDERS




    P0                 P1                P2
    ---                ---               ---

    MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( P0 )

    MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                            32
Rewrite “ANY” to ALL POTENTIAL SENDERS




    P0                 P1                P2
    ---                ---               ---

    MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( P1 )

    MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                            33
Recurse over all such configurations !




     P0                 P1                P2
     ---                ---               ---

     MPI_Isend ( P2 )   MPI_Barrier       MPI_Irecv ( P1 )

     MPI_Barrier        MPI_Isend( P2 )   MPI_Barrier




                                                             34
 If we now have P0-P2 doing this, and P3-5 doing
 the same computation between themselves, no
 need to interleave these groups…

P0              P1             P2                P3              P4                P5
---             ---            ---               ---             ---               ---

MPI_Isend ( P2 ) MPI_Barrier   MPI_Irecv ( * )   MPI_Isend ( P5 ) MPI_Barrier      MPI_Irecv ( * )

MPI_Barrier     MPI_Isend( P2 ) MPI_Barrier      MPI_Barrier     MPI_Isend( P5 )   MPI_Barrier




                                                                                                 35
Why is all this worth doing ?




                                36
    MPI is the de-facto standard
    for programming cluster machines




                                              

(BlueGene/L - Image courtesy of IBM / LLNL)   (Image courtesy of Steve Parker, CSAFE, Utah)


  Our focus:
  Help Eliminate Concurrency Bugs from HPC Programs
  Apply similar techniques for other APIs also (e.g. PThreads, OpenMP)



                                                                                        37
The success of MPI (Courtesy of Al Geist, EuroPVM / MPI 2007)




                                                                38
The Need for Formal Semantics for MPI
                              –   Rendezvous mode
 –   Send                     –   Blocking mode
 –   Receive                  –   Non-blocking mode
 –   Send / Receive           –   Reliance on system buffering
 –   Send / Receive / Replace –   User-attached buffering
 –   Broadcast                –   Restarts/Cancels of MPI Operations
 –   Barrier
 –   Reduce

                                              An MPI program is an
–    Non Wildcard receives
                                              interesting (and legal)
–    Wildcard receives                        combination of elements
–    Tag matching                             from these spaces
–    Communication spaces



                                                                       39
      MPI Library Implementations Would Also Change
      Multi-core – how it affects MPI (Courtesy, Al Geist)


       The core count rises but the number of pins on a socket is
       fixed. This accelerates the decrease in the bytes/flops ratio
       per socket.

       The bandwidth to memory (per core) decreases

       The bandwidth to interconnect (per core) decreases

       The bandwidth to disk (per core) decreases


Need Formal Semantics for MPI,
because we can’t imitate any existing
implementation…

                                                                       40
We are only after “low hanging” bugs…


 Look for commonly committed mistakes
  automatically
  – Deadlocks
  – Communication Races
  – Resource Leaks




                                     41
Deadlock pattern…
                     P0         P1

                     ---        ---
  P0        P1
  ---       ---      Bcast;     Barrier;

  s(P1);    s(P0);
                     Barrier;   Bcast;
  r(P1);    r(P0);




9/15/2010
                                           42
 Communication Race Pattern…
             P0       P1       P2
             ---      ---      ---
OK           r(*);    s(P0);   s(P0);

             r(P1);


             P0       P1       P2
             ---      ---      ---
             r(*);    s(P0);   s(P0);
NOK
             r(P1);


 9/15/2010
                                        43
Resource Leak Pattern…



            P0
            ---
            some_allocation_op(&handle);


            FORGOTTEN DEALLOC !!




9/15/2010
                                           44
Bugs are hidden within huge state-spaces…




 9/15/2010
                                            45
    Partial Order Reduction Illustrated…
    With 3 processes, the
     size of an interleaved
     state space is ps=27

    Partial-order reduction
     explores representative
     sequences from each
     equivalence class

    Delays the execution of
     independent transitions

    In this example, it is
     possible to “get away”
     with 7 states (one
     interleaving)

    9/15/2010
                                           46
A Deadlock Example… (off by one  deadlock)
 // Add-up integrals calculated by each process
     if (my_rank == 0) {
         total = integral;
            for (source = 0; source < p; source++) {
                MPI_Recv(&integral, 1, MPI_FLOAT,source,
                    tag, MPI_COMM_WORLD, &status);
                total = total + integral;      p0:fr 0 p0:fr 1 p0:fr 2
            }                                  p1:to 0 p2:to 0 p3:to 0
        } else {
            MPI_Send(&integral, 1, MPI_FLOAT, dest,
                tag, MPI_COMM_WORLD);
        }



 9/15/2010
                                                                     47
Organization of ISP

   MPI Program


         Simplifications      executable

    Simplified                             request/permit
                               Proc 1                       scheduler
       MPI
     Program       compile
                               Proc n



                                    PMPI calls

                             Actual MPI
                             Library and
                             Runtime




                                                                        48
Summary (have posters for each)
   Formal Semantics for a large subset of MPI 2.0
     – Executable semantics for about 150 MPI 2.0 functions
     – User interactions through VisualStudio API
   Direct execution of user MPI programs to find issues
     – Downscale code, remove data that does not affect control, etc
     – New Partial Order Reduction Algorithm
         » Explores only Relevant Interleavings
     – User can insert barriers to contain complexity
         » New Vector-Clock algorithm determines if barriers are safe
     – Errors detected
         » Deadlocks
         » Communication races
         » Resource leaks
   Direct execution of PThread programs to find issues
     – Adaptation of Dynamic Partial Order Reduction reduces interleavings
     – Parallel implementation – scales linearly



                                                                         49
Also built POR explorer for C / Pthreads programs,
called “Inspect”

  Multithreaded
  C/C++ program


          instrumentation     executable
                    compile                request/permit
  instrumented                thread 1                      scheduler
     program

                              thread n

 Thread library
 wrapper




                                                                        50
    Dynamic POR is almost a “must” !

( Dynamic POR as in Flanagan and Godefroid, POPL 2005)




                                                     51
Why Dynamic POR ?

             a[ j ]++   a[ k ]--




   • Ample Set depends on whether j == k

   • Can be very difficult to determine statically

   • Can determine dynamically




                                                     52
Why Dynamic POR ?


The notion of action dependence (crucial to POR
methods) is a function of the execution




                                                  53
Computation of “ample” sets in Static POR
versus in DPOR
                                      { BT }, { Done }
                                                         Add Red Process to
Ample determined                                         “Backtrack Set”
using “local” criteria
                                                         This builds the Ample
                                                         set incrementally
                                                         based on observed
                                                         dependencies
                         Nearest
                                                         Blue is in “Done” set
                         Dependent
                         Transition
                         Looking          Current State
                         Back
                                          Next move of
                                          Red process




                                                                                 54
    Putting it all together …
 We target C/C++ PThread Programs
 Instrument the given program (largely automated)
 Run the concurrent program “till the end”
 Record interleaving variants while advancing
 When # recorded backtrack points reaches a soft
  limit, spill work to other nodes
 In one larger example, a 11-hour run was finished in
  11 minutes using 64 nodes

 Heuristic to avoid recomputations was essential
  for speed-up.
 First known distributed DPOR



                                                     55
A Simple DPOR Example

 t0:                    {}, {}
  lock(t)
  unlock(t)

 t1:
  lock(t)
  unlock(t)

 t2:
  lock(t)
  unlock(t)




                                 56
For this example, all the paths explored during DPOR

For others, it will be a proper subset




                                                       57
Idea for parallelization: Explore computations from the
backtrack set in other processes.

“Embarrassingly Parallel” – it seems so, anyway !




                                                          58
We then devised a work-distribution scheme…

                          load balancer




Request unloading                         report result
                    idle node id




                    work description
    worker a                                   worker b



                                                          59
Speedup on aget




                  60
Speedup on bbuf




                  61
Historical Note


   Model Checking
     – Proposed in 1981
     – 2007 ACM Turing Award for Clarke, Emerson, and Sifakis


   Bug discovery facilitated by
     – The creation of simplified models
     – Exhaustively checking the models
         » Exploring only relevant interleavings




                                                                62
Looking ahead…

Plans for one year out…




                          63
Finish tool implementation for MPI and others…

     Static Analysis to reduce some cost
     Inserting Barriers (to contain cost) using new vector-
      clocking algorithm for MPI
     Demonstrate on meaningful apps (e.g. Parmetis)
     Plug into MS VisualStudio
     Development of PThread (“Inspect”) tool with same
      capabilities
     Evolving these tools to Transaction Memory, Microsoft
      TPL, OpenMP, …




                                                               64
             Thanks Microsoft !
    and Dennis Crain, Shahrokh Mortazavi
In these times of unpredictable NSF funding, the HPC
Institute Program made it possible for us to produce a
                    great cadre of
             Formal Verification Engineers
  Robert Palmer (PhD – to join Microsoft soon),
Sonjong Hwang (MS), Steve Barrus (BS), Salman
                   Pervez (MS)
    Yu Yang (PhD), Sarvani Vakkalanka (PhD),
Guodong Li (PhD), Subodh Sharma (PhD), Anh Vo
(PhD), Michael DeLisi (BS/MS), Geof Sawaya (BS)
   (http://www.cs.utah.edu/formal_verification)
             Microsoft HPC Institutes
                NSF CNS 0509379


                                                         65
Extra Slides




               66
Looking Further Ahead: Need to clear “idea log-jam in
multi-core computing…”
“There isn’t such a thing as Republican clean air or
Democratic clean air. We all breathe the same air.”




  There isn’t such a thing as an architectural-
  only solution, or a compilers-only solution to
  future problems in multi-core computing…


                                                        67
Now you see it; Now you don’t !

On the menace of non reproducible bugs.

  Deterministic replay must ideally be an option
  User programmable schedulers greatly
   emphasized by expert developers
  Runtime model-checking methods with state-
   space reduction holds promise in meshing with
   current practice…




                                                    68

								
To top