Docstoc

The Paxos Commit Algorithm

Document Sample
The Paxos Commit Algorithm Powered By Docstoc
					       Databases 2
xx   The Paxos Commit Algorithm
xx       The Paxos Commit Algorithm



                             Agenda

        Paxos Commit Algorithm: Overview
        The participating processes
           The resource managers
           The leader
           The acceptors
        Paxos Commit Algorithm: the base version
        Failure scenarios
        Optimizations for Paxos Commit
        Performance
        Paxos Commit vs. Two-Phase Commit
        Using a dynamic set of resource managers
xx       The Paxos Commit Algorithm



               Paxos Commit Algorithm: Overview

        Paxos was applied to Transaction Commit by L.Lamport
         and Jim Gray in Consensus on Transaction Commit
        One instance of Paxos (consensus algorithm) is
         executed for each resource manager, in order to agree
         upon a value (Prepared/Aborted) proposed by it
        “Not-synchronous” Commit algorithm
        Fault-tolerant (unlike 2PC)
           Intended to be used in systems where failures are
            fail-stop only, for both processes and network
        Safety is guaranteed (unlike 3PC)
        Formally specified and checked
        Can be optimized to the theoretically best performance
xx       The Paxos Commit Algorithm



              Participants: the resource managers
      N resource managers (“RM”) execute the distributed
       transaction, then choose a value (“locally chosen value” or
       “LCV”; ‘p’ for prepared iff it is willing to commit)
      Every RM tries to get its LCV accepted by a majority set of
       acceptors (“MS”: any subset with a cardinality strictly greater
       than half of the total).
      Each RM is the first proposer in its own instance of Paxos

                      Participants: the leader

      Coordinates the commit algorithm
      All the instances of Paxos share the same leader
      It is not a single point of failure (unlike 2PC)
      Assumed always defined (true, many leader-(s)election
       algorithms exist) and unique (not necessarily true, but unlike
       3PC safety does not rely on it)
xx       The Paxos Commit Algorithm



                     Participants: the acceptors
                                                         a
        A denotes the set of acceptors          RM1
                                                         Ok!      Consensus box (MS)
        All the instances of Paxos share the
         same set A of acceptors
                                                         p            AC1
        2F+1 acceptors involved in order to                                        AC3
         achieve tolerance to F failures
                                                 RM2
                                                         Ok!          Paxos
                                                                      AC    2

        We will consider only F+1 acceptors,                                       AC4

         leaving F more for “spare” purposes             p             AC5
                                                 RM3
         (less communication overhead)                   Ok!

        Each acceptors keep track of its own
                                                    aState       Acc1 Acc2 Acc3 Acc4 Acc5
         progress in a Nx1 vector
        Vectors need to be merged into a         1st instance    a    a        a    a    a
         Nx|MS| table, called aState, in order    2nd instance    p    p        p    p    p
         to take the global decision (we want     3rd instance    p    p        p    p    p
         “many” p’s)
xx          The Paxos Commit Algorithm



 : Writes on log               Paxos Commit (base)                 rm RM              (N=5)
                                                                acc  MS  A                   (F=2)
  L
 AC0
             AC1   AC2   RM0     RM1   RM2   RM3   RM4                v { p, a}

                                                                               p2a  0          0   v(0)
                                                                      1x   BeginCommit




                                                                      (N-1) x   prepare




                                                           (N(F+1)-1) x         p2a   rm   0       v(rm)




                                                                               rm 0 v(rm)
                                                                              rm 0 v(rm)
                                                          Fx                 rm 0 v(rm)
                                                                            rm 0 v(rm)
                                                                                                          Opt.
                                                                p2b    acc rm 0 v(rm)


                                                          Not blocked iff F acceptors respond

                                                     T2
                                                               If (Global Commit)
T1                                                                then    p3   commit              xN
                                                               else p3       abort
xx        The Paxos Commit Algorithm



                      Global Commit Condition

                           Global Commit

     (rm)(b)(MS)(acc  MS)(     p2b acc rm b   p   was sent  rec.)

         That is: there must be one and only one row for each RM
          involved in the commitment; in each row of those rows
          there must be at least F+1 entries that have ‘p’ as a
          value and refer to the same ballot
xx             The Paxos Commit Algorithm



           [T1] What if some RMs do not submit their LCV?
                                                                                        j  RM m issing  RM
       Leader
                        One majority                                                       v { p, a}
                        of acceptors
      bL1 >0

                                            Leader: «Has resource manager j ever proposed you a
p1a             “accept?”                   value?»



                                            (1) Acceptori: «Yes, in my last session (ballot) bi with it
                                                                   I accepted its proposal vi»
p1b                    “promise”
                                            (2) Acceptori: «No, never»

                                   (Promise not to answer any bL2<bL1)



                                            If (at least |MS| acceptors answered)
                                                        If (for ALL of them case (2) holds) then V=‘a’ [FREE]
p2a            “prepare?”
                                                        else V=v(maximum({bi})                       [FORCED]
                                            Leader: «I am j, I propose V»
xx                The Paxos Commit Algorithm



                             [T2] What if the leader fails?

                 If the leader fails, some leader-(s)election algorithm is
                  executed. A faulty election (2+ leaders) doesn’t
                  preclude safety ( 3PC), but can impede progress…

          L1            MS            L2
ignored
trusted   b1 >0                                    Non-terminating example:
                                b2>b1 ignored
                                      trusted
                                                    infinite sequence of p1a-p1b-
                                                    p2a messages from 2 leaders
           T
                                                   Not really likely to happen
ignored
trusted   b3>b2
                                                   It can be avoided (random T?)
                                  T


                                b4>b3 trusted

           T
xx         The Paxos Commit Algorithm



                            Optimizations for Paxos Commit (1)

          Co-Location: each acceptor is on the same node as a RM and the
           initiating RM is on the same node as the initial leader

               RM0                                RM1          RM2             RM3             RM4
                         BeginCommit

                                        L          p2a          p2a
                                   p3
                   p2a
               AC0                                AC1          AC2


                    -1 message phase (BeginCommit), -(F+2) messages


          “Real-Time assumptions”: RMs can prepare spontaneously. The
           prepare phase is not needed anymore, RMs just “know” they have to
           prepare in some amount of time
         RM0                 RM1            RM2          RM3         RM4
         AC0
               L             AC1            AC2



                                                                           (N-1) x   prepare    Not needed anymore!


                    -1 message phase (Prepare), -(N-1) messages
xx              The Paxos Commit Algorithm



                         Optimizations for Paxos Commit (2)

           Phase 3 elimination: the acceptors send their phase2b messages (the
            columns of aState) directly to the RMs, that evaluate the global commit
            condition

      RM0          RM1        RM2    RM3     RM4     RM0       RM1   RM2    RM3       RM4
            L                                              L
      AC0          AC1        AC2                    AC0       AC1   AC2




p2b                                                                                         p2b

p3




                        Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC)
                        FPC + Co-location + R.T.A. = Optimal Consensus Algorithm
xx        The Paxos Commit Algorithm



                                                Performance

                              2PC                   Paxos Commit                      Faster Paxos Commit
                       No coloc.       Coloc.     No coloc.        Coloc.               No coloc.                 Coloc.


     Message delays*      4              3           5                4                     4                       3

         Messages*      3N-1           3N-3     NF+F+3N-1         NF+3N-3            2NF+3N-1               2FN-2F+3N-3

      Stable storage
                                   2                          2                                         2
      write delays**
      Stable storage
                              N+1                         N+F+1                                     N+F+1
         writes**
                                                                  *Not Assuming RMs’ concurrent preparation (slides-like scenario)
                                                                  **Assuming RMs’ concurrent preparation (r.t. constraints needed)




         If we deploy only one acceptor for Paxos Commit (F=0),
          its fault tolerance and cost are the same as 2PC’s. Are
          they exactly the same protocol in that case?
xx            The Paxos Commit Algorithm



                            Paxos Commit vs. 2PC

             Yes, but…
                             Other RMs
     TM              RM1
                                                               2PC from Lamport
                                                                and Gray’s paper




                                              2PC from the
                                         T2
     T1                                        slides of the
                                                  course




             …two slightly different versions of 2PC!
xx       The Paxos Commit Algorithm



                         Using a dynamic set of RM
                                                                   join          a
        You add one process, the registrar, that                         RM1
                                                                                 Ok!
                                                                                        MS
         acts just like another resource manager,
         despite the following:                                                  p     AC1
           vregistrar  { p, a}
             pad                                                   join   RM2
                                                                                 Ok!
                                                                                             AC3
           vregistrar  {rm : rm joined the transaction}
             Pad                                                                       Paxos
                                                                                        AC
                                                                                 p       2
        RMs can join the transaction until the                    join
                                                                          RM3                AC4
         Commit Protocol begins                                                  Ok!
                                                                                         AC5
        The global commit condition now holds              REG
                                                                  RM1;RM2;RM3

         on the set of resource managers                    RM1
                                                                           Ok!

         proposed by the registrar and decided in           RM2
                                                            RM3
         its own instance of Paxos:

                               Global Commit DynRM

(rm  vregistrar )( b)( MS )( acc  MS )(      p2b acc rm b   p       was sent  rec.)
xx   The Paxos Commit Algorithm



                   Thank You!




                   Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:41
posted:3/22/2012
language:English
pages:15