Fault-Tolerant Distributed Simulation by dfgh4bnmu


									                          Fault-Tolerant Distributed Simulation
                Om. P. Damani                                  Vijay K. Garg
          Dept. of Computer Sciences                   Dept. of Elec. and Comp. Eng.
                       University of Texas at Austin, Austin, TX, 78712
                                 http: maple.ece.utexas.edu

                      Abstract                                An LP may crash due to a bug in the application
   In traditional distributed simulation schemes, entire   code, simulator code, or operating system code. Even
simulation needs to be restarted if any of the partici-    when all the code is correct, the code being run with
pating LP crashes. This is highly undesirable for long     a distributed simulator may have been written for a
running simulations. Some form of fault-tolerance is       sequential simulator 15 . In such cases, it is di cult
required to minimize the wasted computation.               to nd and correct the source of the crash. Even when
   In this paper, a rollback based optimistic fault-       the bug lies entirely in the application code, the user
tolerance scheme is integrated with an optimistic dis-     of an application may not be the developer of the code.
tributed simulation scheme. In rollback recovery           So the user may be unable or unwilling to debug the
schemes, checkpoints are periodically saved on stable      application. The situation will be hopeless, if every
storage. After a crash, these saved checkpoints are        time the system is restarted, the same bug were to lead
used to restart the computation.                           to the same crash. Luckily, experiments with di er-
   We make use of the novel insight that a failure can     ent kind of software systems have shown that most of
be modeled as a straggler event with the receive time      the bugs encountered in practice are transient 9, 17 .
equal to the virtual time of the last checkpoint saved     For example, a process may crash if it does not check
on stable storage. This results in saving of implemen-     for a `null pointer' value on returning from a mem-
tation e orts, as well as reduced overheads. We de ne      ory allocation routine. When the process is restarted,
stable global virtual time SGVT, as the virtual time     more memory may be available, thereby avoiding the
such that no state with a lower timestamp will ever be     crash. Crashes should especially be transient in the
rolled back despite crash failures. A simple change is     optimistic simulation, where a di erent message or-
made in existing GVT algorithms to compute SGVT.           dering or a di erent process scheduling results in a
Our use of transitive dependency tracking eliminates       di erent execution, possibly bypassing the bug that
antimessages. LPs are clubbed in clusters to minimize      caused the original crash. Hence, restarting the failed
stable storage access time.                                process is a viable option, provided steps are taken to
                                                           ensure that the resulting system state is consistent:
1 Introduction                                             for example, by rolling back the other processes.
   In a distributed simulation, a crash of any Logical        A fault tolerance strategy should also be able to
ProcessLP causes the entire computation to halt.         tolerate hardware failures. Hardware failures may be
As the number of LPs taking part in a simulation           in the form of processor malfunctioning, power failure
increases, so does the probability that one of them        or someone tripping over the connecting wires.
will crash during the simulation. Simply restarting           We assume that processes fail by simply crashing
the failed LP may leave the simulation in an incon-        and they do not send out any erroneous messages or
sistent state 5 . So far, the only recourse in such a      do some other harm. A process loses all its volatile
situation has been to restart the entire system. How-      memory in a failure. To reduce the amount of wasted
ever, restarting the system is unacceptable for sim-       computation, it periodically writes its checkpoints to
ulations that run for hours or days. Clearly, some         stable storage. After a failure, it is restarted from its
form of fault-tolerance is required to minimize the        last stable checkpoint. We model a failure as a strag-
wasted computation.                                        gler event with a timestamp equal to the timestamp
   supported in part by the NSF Grants CCR-9520540 and    of the highest checkpoint saved on stable storage. In
ECS-9414780, THECB ARP-320, a General Motors Fellowship,   this model, computation lost due to a failure can be
and an IBM grant.                                          treated in the same way as a computation rolled back
due to a straggler.                                              2 Background in Fault-Tolerance
    Note that a naive application of the above idea to               There are many well-known techniques for provid-
Time Warp does not work. In Time Warp, when an LP                ing fault-tolerance to distributed applications. They
receives a straggler, it sends out antimessages for the          can be classi ed into two broad categories: replica-
messages in its output queue. But the output queue               tion based 4 and checkpointing based 7 . Replica-
is lost in a failure. In fact, all volatile storage is lost in   tion based techniques consume extra resources and
a failure. Hence to tolerate failures, we need a simu-           have synchronization overhead to maintain consisten-
lation scheme that does not use any state information            cies between replicas. In checkpointing based tech-
of the rolled back computation.                                  niques, checkpoints are saved on stable storage, so that
    In 6 , we presented one such simulation scheme.              a failed process can be restarted from its last stable
We integrate that scheme with the optimistic fault-              checkpoint. We only consider the checkpointing based
tolerant method presented in 19 to come up with                  schemes. Checkpointing schemes require synchroniza-
a low overhead, fault-tolerant, optimistic, distributed          tion between processes, or else they su er from the
simulation protocol. The contributions of this paper             domino e ect, where all the processes may rollback
are following:                                                   to their initial state 7 . Note that checkpoints need
      Development of a formal framework for fault-               not be immediately saved on stable storage. Applica-
      tolerant distributed simulation.                           tions with high checkpointing activity may take many
      Modeling of a failure as a straggler to facilitate         checkpoints before writing them to stable storage.
      low overhead fault-tolerance.                                  To avoid both synchronization and domino e ect,
      Integration of the optimistic simulation scheme in         some schemes also save the received messages on stable
       6 with the optimistic fault-tolerance scheme in           storage. This is called message logging. After a failure,
       19 .                                                      a process restores its last checkpoint and replays the
      De nition of Stable GVT: Global Virtual Time in            logged messages. It may inform other processes about
      failure prone systems.                                     its failure and may also request some information from
      Identi cation of the issues involved in the design         other processes. This method is similar to `periodic
      of a fault-tolerant simulation.                            checkpointing' and `coast forward' mechanism used in
There has not been much discussion in simulation lit-            simulation 13 . Message logging schemes can be di-
erature about fault-tolerance. In the seminal paper              vided into three categories 7 : pessimistic, optimistic,
on Time Warp 11 , Je erson mentions that processes               and causal. Pessimistic and causal schemes recreate
may co-ordinate to take a consistent snapshot and save           the pre-failure computation. Pessimistic logging re-
it on stable storage. When any process fails, the en-            quires that each message be synchronously saved in
tire system may roll back to the last snapshot. In               stable storage, before a process acts on it. This is
contrast, our method rolls back only those states that           unacceptable in distributed simulation where message
are dependent on a lost state, and thus minimizes                activity is high. Causal logging piggybacks the pro-
the wasted computation. In 1 , a replication based               cessing order of messages on each outgoing message.
strategy is presented to provide fault-tolerance. Our            This will also result in high overhead due to high mes-
scheme has much lower overhead than that of the repli-           sage activity in simulation.
cation. Some degree of fault-tolerance was built in                  In optimistic schemes, messages are saved to stable
the original TWOS. Signal handlers were installed to             storage asynchronously. As a result, processing order
`catch exceptions'. Once a signal indicating an error            of messages may be lost in a failure. So the execution
was `caught', partially processed events were cleaned            after a failure may be di erent from the pre-failure
and processing was resumed. This approach takes care             execution, resulting in lost states. States dependent
of some errors, but not all. It very well supplements            on a lost state are called orphan states. Correctness
our solution but does not replace it.                            of computation requires that these orphan states be
    In our method, checkpoints are saved to stable               rolled back. It is no coincidence that this reminds one
storage asynchronously, i.e., while computation is in            of optimistic simulation. The seminal work on opti-
progress. Any failure during the saving of check-                mistic recovery by Strom and Yemini 18 was inspired
points is indistinguishable from a scenario, where no            by the Time Warp mechanism.
checkpoints were being saved at that time. Similar                   In an optimistic scheme, a process may fail with-
treatment of failure and straggler coupled with asyn-            out logging any of its received messages since its last
chronous stable storage operations results in reduced            checkpoint. This implies that, to reduce the cost of
overhead for fault-tolerance.                                    accessing stable storage, messages can be logged only
when checkpoints are being written to stable stor-           this straggler is shown in Figure 1b. P 1 rolls back
age. This makes optimistic schemes well suited for           by restoring the state s10. It then takes the actions
distributed simulation, where message activity is high.      needed for maintaining the correctness of the simula-
                                                             tion, which, for our scheme, consists of broadcasting
3 Model of Simulation                                        a rollback announcement shown by dotted arrows.
    We consider an event-driven optimistic simulation.       It restarts from r10, acts on m2, and then acts on
The execution of an LP consists of a sequence of states      m0. Upon receiving the rollback announcement from
where a state transition is caused by execution of an        P 1, P 2 realizes that it is dependent on a rolled back
event. In addition to causing a state transition, exe-       state and so it also rolls back, restores state s20, takes
cuting an event may also schedule new events for other       actions needed, and restarts from state r20. Finally,
LPs or the local LP by sending messages. When LP           the orphan message m1 is discarded by P 2. In 6 we
P 1 acts on a message from P 2, P 1 becomes dependent        have shown that by tracking transitive dependency,
on P 2. This dependency relation is transitive.              only the LP receiving the straggler needs to send a
    An LP periodically saves its checkpoints on sta-         rollback announcement. On receiving this announce-
ble storage. After a failure, an LP restores its last        ment, all other LPs roll back their orphan states and
checkpoint from stable storage and starts executing          discard the received orphan messages. Other LPs do
from there. The resulting execution may be di erent          not need to send rollback announcement while rolling
from the pre-failure execution. States that are not          back in response to a rollback. Simulation proceeds
re-executed after failure are called lost states.            correctly, without requiring antimessages.
    The arrival of a straggler causes an LP to roll back.       Now we describe a simulation in a failure-prone sys-
A state that is lost, rolled back, or transitively de-       tem. Let us look at Figure 1 again. Assume that the
pendent on a lost or rolled back state is called an or-      system has performed the computation shown in Fig-
phan state. We denote transitive dependency by the           ure 1a, but P 1 has not yet received the message
happened before ! relation, which we de ne later           m2. Let P 1 fail before it receives the message m2. It
in this section. As stated earlier, we model a failure       loses its volatile memory, which includes the message
as a straggler event with a timestamp equal to the           m0. Now Figure 1b shows the post-failure compu-
timestamp of the highest checkpoint saved on stable          tation. P 1 restores its last stable checkpoint s10. It
storage. We formally de ne orphan states as follows:         then broadcasts a failure-announcement. On receiving
straggleds  state s was rolled back due to a straggler    this announcement, P 0 and P 2 resend the messages
stableP   timestamp of last stable checkpoint of LP P     m0 and m2 as they might have been lost in the fail-
failureP   event scheduled for P at time stableP        ure. P 2 also realizes that it is dependent on a lost
orphans  straggleds _ 9u : orphanu ^ u ! s          state and rolls back, restores state s20, takes actions
                                                             needed, and restarts from state r20. P 1 on the other
    Note that in this model lost states are treated as       hand processes m0 and m2 in the correct order. This
straggled states. A message sent from an orphan state        shows how we handle a failure and a straggler in the
is called an orphan message. For correctness of a sim-       same way.
ulation, all orphan states must be rolled back and all          In order to track transitive dependency in presence
orphan messages must be discarded. To distinguish            of rollback, in 6 we extended the happened before!
the computation before and after the rollback, we say        relation de ned by Lamport 12 . Intuitively, state s
that an LP starts a new incarnation after each roll-         happens before state u, if, in the simulation diagram,
back. An example of a distributed simulation is shown        there exists a directed path from s to u consisting of
in Figure 1. Numbers shown in parentheses are either         solid or dashed arrows. Failure or rollback announce-
the virtual times of states or the virtual times of sched-   ments, denoted by dotted arrows, do not contribute
uled events carried by the messages. Solid lines indi-       to the happened before relation. For example, in Fig-
cate useful computation, while dashed lines indicate         ure 1a, s10 ! s11 and s00 ! s21, and in Figure 1b
rolled back computation.                                     s11 6! r10. Saying s happened before u is equivalent
    In Figure 1a, s00 schedules an event for P 1 at        to saying that u is transitively dependent on s.
time 5 by sending message m0. P 1 optimistically exe-
cutes this event, resulting in the state transition from     4 The Fault-Tolerant Protocol
s10 to s11, and schedules an event for P 2 at time 7            For fault-tolerance, checkpoints need to be saved
by sending message m1. Then P 1 receives message             on the stable storage. The overhead of separately ac-
m2 which schedules an event at time 2 and is de-             cessing stable storage for the checkpoint of each LP is
tected as a straggler. Execution after the arrival of        unacceptable. Therefore we club LPs into clusters and
       s00 (0)                                                          s00(0)
                        m0 (5)            (0,0)
                                     (5) (0,1)                                                  m0(5)

              s10 (0)                s11 (0,-1)
                                                                           s10 (0)                                  s11
       (0,-1)                                                                          (0,-1)
       (0,0)                                                                           (1,0)
       (0,-1)                                                                          (0,-1)   r10 (0)                             s13 (5)
                                 m2 (2)                                                                    (0,-1)   s12 (2)
                                                             (0,0)                                                                   (0,0)
                                                  m1         (0,1)       s20(0)                            (1,1)                     (1,2)
        s20 (0)                                    (7)                                 m2(2)
                                                                                                           (0,0)                     (0,0)
       (0,-1)                                            s21 (7)
                                                                                       (0,-1)                                 s21
       (0,-1)                                                                          (0,-1)
       (0,0)                                                                           (1,0)     r20 (0)

                           (a)                                                                               (b)
     Figure 1: A Distributed Simulation. a Pre-stragglerfailure computation. b Post-stragglerfailure comp.
 take checkpoint of entire clusters. The idea of cluster-            Cluster Pi :
 ing has already been used in 16 and 2 . In 16 ,                     type entry = int inc, int sii ;
 inter-cluster execution is conservative, whereas intra-                      incarnation, state interval index
 cluster execution is optimistic. In 2 , inter-cluster ex-           var cvt : real;            cluster virtual time
 ecution is optimistic, whereas intra-cluster execution                   max inc : int; stored on stable storage
 is sequential. We assume inter-cluster execution to be                   dv : array 0..n-1 of entry; dependency vector
 optimistic. Our scheme works with both conservative                      iet : array 0..n-1 of set of entry;
 and optimistic policy for intra-cluster execution. For                             incarnation end table
 purpose of exposition, we assume that intra-cluster ex-                  token : entry; rollback announcement
 ecution is sequential. Details of intra-cluster execution                CIQ : set of pointers ;
 can be found in 2 .                                                                cluster input queue for external messages
     Since the simulation inside a cluster is sequential,                 COQ : set of messages ;
 the state of a cluster at a given virtual time is well-                            cluster output queue for external messages
 de ned. This state includes the input messages in all
 the LPs input queues. The state also includes the                   Initialize :
 cluster output queue, which is described later. Clus-                   cvt = 0 ;        max inc = 0 ;
 ters periodically save their state on stable storage. For               8 j : dv j :inc = 0 ; dv j :sii = -1 ;
 discussion we use the term `state of a cluster'. For im-                dv i :sii = 0 ;
 plementation, states of only those LPs need be saved                    8 j : iet j = fg ;      empty set
 on stable storage, that have changed since the last                     token = 0,-1 ; CIQ = fg ; COQ = fg ;
 state saving operation. From here on, we refer to
 intra-cluster messages as `internal' messages and inter-                 Figure 2: Data Structures used by a cluster
 cluster messages as `external' messages.
     From here on, i,j refer to cluster numbers; v refers             6 . It has n entries, where n is the number of clus-
 to incarnation number; s,u refer to states; Pi refers to            ters in the system. Each entry contains an incarnation
 cluster i ; m refers to a message and e refers to an                number and a state interval index. A state interval is
 event.                                                              the sequence of states between two events scheduled
 4.1 Data Structures                                                 by the external messages. The index in the ith entry
   The data structures used by a cluster are shown in                of the DV of Pi corresponds to the number of external
  gure 2. We describe the main data structures below:                messages that Pi has acted on. The index in the j th
   Dependency Vector: To track transitive depen-                     entry corresponds to the index of the latest state of
 dency information, we use a Dependency VectorDV                   Pj on which Pi depends. The incarnation number in
the ith entry is equal to the number of times Pi has         messages also carry the send time, which is the same as
rolled back. The incarnation number in the j th entry        the local virtual time lvt of the sending LP. External
is equal to the highest incarnation number of Pj on          messages carry DV of the sender. Actions taken by
which Pi depends. Let entry en be a tuple incarna-          an LP are shown in gure 3. Actions taken by a
tion v, index t. We de ne a lexicographical ordering
between entries as follows:                                   Logical Process LP : belongs to cluster Pi
   en1 en2  v1 v2  _ v1 = v2  ^ t1 t2  .               var lvt : real ;      local virtual time
   Suppose Pi schedules an event e for Pj by sending              input queue : set of message ;
a message m. Pi attaches its current DV to m. If m            Execute messagem :
is neither an orphan nor a straggler, it is kept in the         lvt = cvt = m:receive time ;
incoming queue by Pj . When the event corresponding                cvt is cluster virtual time
to m is executed, Pj updates its DV with m's DV by              if m is an external message then
taking the componentwise lexicographical maximum,                 8 j : dv j = maxdv j ; m:dv j  ;
as shown in the routine Execute message in gure 3.                 dv is cluster dependency vector
An example of DV is shown in gure 1 where DV of                   dv i :sii + + ;
each state is shown in a box near it.                           Act on the event scheduled by m ;
   In 6 , entries in DV include the virtual time in-          Send messagem
stead of the state interval index. This method does            if intra clusterm then
not work in the presence of failures. Let P receive two           put m; lp:lvt; m:receive time
messages with the same scheduling time. Let P fail                     in destination LP's input queue ;
after scheduling one of the events. We need to dis-             if inter clusterm then
tinguish between the states resulting from these two              sendm; dv; m:receive time;
events. Hence we replace the timestamp in each DV             Rollbackts
entry with a state interval index.                              Restore the latest state s such that s:lvt  ts ;
   In general, DV of the sending state needs to be at-          Discard the states that follow s ;
tached with each message to correctly track transitive          8m 2 input queue: if m:send time ts then
dependencies and detect orphans. But we reduce this                   discard m;            m is orphan
overhead by making the observation that with cluster-
ing, DV needs to be attached with inter-cluster mes-                      Figure 3: Actions of an LP
sages only. For intra-cluster messages, it su ces to
track send time as the receiver's DV is always greater       cluster on receiving an external message are shown in
than or equal to the sender's DV.                            the gure 4. Upon receiving an external message m, Pi
   Incarnation End Table: Besides a dependency               discards m if m is an orphan. This is the case when, for
vector, each cluster also maintains an incarnation end       some j , Pi 's iet and the j th entry of m's DV indicate
table iet. The j th component of iet is a set of entries   that m is dependent on a rolled back state of Pj . A
of the form v; sii, where all states of the vth incar-     straggler for any LP in the cluster is also considered
nation of Pj with indices greater than sii have been         a straggler for the entire cluster. If Pi detects that
rolled back. The iet allows a cluster to detect orphan       m is a straggler, it broadcasts a token containing the
messages.                                                    ith entry v; sii of the DV of its highest checkpoint s
   Cluster Input Queue: Both external and inter-             such that the virtual time of s is no greater than the
nal messages are kept in the destination LP's input          receive time of m. The token basically indicates that
queue. A cluster keeps track of the external mes-            all states of incarnation v with index greater than sii
sages by keeping pointers to them in cluster input           are orphans. States dependent on any of these orphan
queueCIQ.                                                  states are also orphans. For simplicity of presentation,
   Cluster Output Queue: There is one output                 we show Pi rolling back in gure 5 after it receives its
queue per cluster called Cluster Output QueueCOQ.          broadcast. In practice, it will roll back immediately.
Only inter-cluster messages are kept in COQ. Intra-             Steps taken by a cluster on receiving a token are
cluster messages are not saved in any per LP output          shown in gure 5. On receiving a token v; sii from
queue.                                                       Pj , Pi acknowledges the token and enters it in its in-
4.2 Protocol Description                                     carnation end table. It discards all the orphan mes-
  The formal description of our protocol is given in         sages in the cluster input queue. A message is an
 gures 2 to 6. In addition to the receive time, internal     orphan if it is dependent on a rolled back state of Pj .
Receive messagem :                                          Restart       after failure
 if 9j; t : m:dv j :inc; t 2 iet j  ^ t m:dv j :sii         Restore last checkpoint s from stable storage;
   then discard m ; return ; m is orphan                          Broadcastdv i ; dv i :inc = + + max inc ;
  Copy m in input queue of the destination LP;                    Block till all clusters acknowledge.
  In the CIQ, add a pointer to m;
  if m:receive time cvt then m is a straggler               Figure 6: Cluster actions on restarting after a failure
    Let s be the latest cluster checkpoint
          such that s:cvt  m:receive time ;                ber and the index of the restored state. Other clusters
    token = s:dv i  ;                                     react to this token in the same way as they do to the
    Broadcasttoken;                                       token due to a straggler.
       Pi receives its own broadcast and rolls back.            There is one important di erence between a failure
    Block till all clusters acknowledge ;                   and a straggler, which we have not shown in gure 5
  Figure 4: Cluster actions on an external message          for clarity. In a failure, a cluster loses its volatile state,
                                                            i.e., its iet and all the messages that it has received
                                                            but not acted on till its last stable checkpoint. So on
If the cluster is orphan then it restores the maximal       learning about the failure, other clusters must resend
non-orphan state. This is done by rolling back all or-      messages to the failed cluster. Of these messages, the
phan LPs and discarding the orphan messages in each         failed cluster should replay only those messages, which
LPs input queue. Note that in routine Rollback in           it did not act on before the last checkpoint. For this
  gure 3, an LP does not have to rollback if it is not      purpose, we need each sender to put a sequence num-
orphan. After the rollback, cluster incarnation num-        ber on outgoing messages on a per cluster basis. We
bersaved in max inc is incremented. To survive fail-      assume FIFO message order. Each cluster keeps only
ures, max inc is stored in stable storage. It does not      the expected sequence number of next message to be
broadcast a token, which is an important property of        received from every other cluster. Now sender needs to
this protocol. Note that internal orphans are detected      resend only those messages whose sequence number is
using a separate mechanism, as compared to external         greater than the sequence number of the last message
orphans. If the current state of a cluster is not orphan,   received till the checkpoint. These sequence numbers
then no LP and consequently no internal message can         are broadcast along with the failure announcement.
be orphan. Routine Rollback is not called for any           Note that the above scheme can handle an arbitrary
LP. All external orphans are discarded using CIQ.           number of concurrent failures. When a processor fails,
                                                            all clusters on that processor need to be restarted as
                                                            if each one of them have failed independently.
Receive tokenv; sii from Pj :                                 A rolled back state is called rollback inconsistent
  Send acknowledgment ;                                     with the states that occur after the corresponding roll-
  iet j = iet j fv; siig ;                                back 15 . For example, in gure 1, s11 is rollback
  8mp 2 CIQ :             mp : message pointer              inconsistent with s12. Allowing a state to become de-
    if mp:dv j :inc == v ^ sii mp:dv j :sii           pendent on two rollback inconsistent states have been
      then discard mp ; orphan mp                         identi ed by Nicol and Liu as a potential source of
  if dv j :inc == v ^ sii dv j :sii then                crash 15 . The next theorem shows that the our pro-
          Cluster is orphan                                 tocol avoids this problem.
    Let s be the latest checkpoint such that                Theorem 1 A state cannot become dependent on two
      s:dv j  v; sii ; s: highest non-orphan state       rollback inconsistent states.
    8lp 2 cluster: lp.Rollbacks:cvt ;
    dv = s:dv ; dv i :inc = + + max inc ;                   Proof: After a rollback, a process blocks till it re-
                                                            ceives acknowledgment of its rollback announcement
     Figure 5: Cluster actions on receiving a token         from all processes. Therefore, all processes receive the
                                                            rollback announcement before becoming dependent on
    Steps taken by a cluster on restarting after a fail-    a post-rollback state. Now, as per the routine Re-
ure are shown in gure 6. The failure is handled in the      ceive token in gure 5, any state dependent on a
same way as a straggler. After the failure, the cluster     rolled back state is rolled back on receiving the cor-
is restarted from its last checkpoint on stable storage.    responding token. Hence no state can become depen-
It broadcasts a token containing the incarnation num-       dent on two rollback inconsistent states.
4.3 Correctness Proof                                       rolled back 3 . Traditional methods for computing
The following lemma states that Dependency Vectors          GVT fail in presence of failures. A failure of a cluster
correctly track transitive dependency information.          may result in the restoration of a state with the virtual
Lemma 1 If s happens before u, then s:dv  u:dv .           time less than GVT.
Proof. As in 6 , the proof follows by the induction on         It is interesting to note that our modeling of failure
the length of the path from s to u.                         as a straggler event can directly be used in the stan-
    The next theorem proves that our protocol cor-          dard GVT algorithm to come up with a value, which
rectly completes simulation in the presence of failures.    we call SGVT, such that no state with virtual time
                                                            less than SGVT will ever be rolled back. Since failure
Theorem 2 At the end of the simulation, the follow-         is treated as a straggler, so a potential failure of pro-
ing conditions are true:                                    cess P can be treated as an unacknowledged message
                                                            with time-stamp stableP .
     All LPs have received all the messages that they          GVT is approximated by taking the minimum of
     would have received in a sequential simulation.        receive times of all unacknowledged messages and all
     All LPs have processed all the messages in the         the local virtual times of each process. We note that
     increasing order of their receive time.                GVT can be approximated by the minimum of re-
     All orphan states have been rolled back.               ceive times of all unacknowledged messages and all
     All orphan messages have been discarded.               the unprocessed messages in the input queue of a pro-
                                                            cess, which for a process P , we denote by unackedP 
Proof: To simplify the presentation, we assume that         and unprocessedP  respectively. Now we can de ne
each cluster contains only a single LP. The proof can       SGVT as follows:
easily be extended to multiple LPs. First note that         Ti  minfstablePi ; unackedPi ; unprocessedPi g
in the presence of reliable delivery, actions taken after   SGV T  minf8i : Ti g
receiving two tokens commute with each other and
also the actions taken after receiving a token commute      Theorem 3 No state with virtual time less than
with a failure. Therefore, f concurrent failures are        SGVT can ever be rolled back.
not di erent from the case where f processes fail one       Proof: Every rollback has a rst cause in a straggler
after another, such that between failures each process      or a failure. A failure cannot restore a state with a
receives each others failure announcement and takes         timestamp less than the global minimum of stable.
no other action. Hence, we only consider the single         A straggler cannot have a timestamp less than the
non-concurrent failure case.                                global minimum of unacked and unprocessed. Hence
    We have modeled a failure as a straggler event.         the result follows.
However, a failure also results in loss of the volatile        In addition to being useful for fossil collection and
state. We do not use any of the lost information even       output commit, the SGVT has another interesting
if the failure were truly a straggler event. The only in-   application. DV is used by a process to determine
formation we need is the received messages and the iet      whether it needs to roll back due to a rollback of an-
entries. This information is collected from the other       other process. We make the observation that only
processes. The only tricky case is when the sender it-      those entries need to be kept in the DV whose associ-
self has failed and it cannot resend some message as        ated states have virtual time greater than SGVT. De-
the state that sent that message is lost. This is harm-     pendency on a state with virtual time less than SGVT
less because according to the protocol, messages sent       need not be tracked because the corresponding state
from a lost state are also orphan and they anyway need      will never be rolled back. This results in the reduction
to be discarded upon their receive.                         of the overhead associated with the DV. In fact, DV
    So our only remaining proof obligation is to show       starts with only one entry process's own entry. As
that in absence of failures, our protocol handles the       processes interact with one-another, size of DV starts
straggler messages and orphan states correctly. This        increasing. However, SGVT also keeps on increasing.
proof follows directly from the proof of theorem 2 in       So we expect the average number of entries in DV to
 6 . The heart of the proof is that lemma 1 assures us      be su ciently small.
that all orphan states are detected upon the arrival of     4.5 Overheads
the corresponding token.                                       Our scheme incurs the following overheads for pro-
4.4 Stable Global Virtual Time SGVT                       viding fault-tolerance:
   Global Virtual TimeGVT is de ned as the virtual           Accessing stable storage: We need to periodi-
time such that no state prior to GVT will ever be           cally save checkpoints on stable storage. This seems a
necessary cost in absence of redundant resources like      References
those used for replication. We save checkpoints asyn-       1 D. Agrawal and J. R. Agre. Replicated Objects in
chronously. So computation is not blocked when sta-           Time Warp Simulations. Proc. Winter Simulation
ble storage is being accessed.                                Conf., 657-664, 1992.
   Dependency information: We tag a DV with                 2 H. Avril and C. Tropper. Clustered Time Warp and
each inter-cluster message. We expect the number of           Logic Simulation. Proc. 9th Workshop on Parallel and
                                                              Distributed Simulation, 112-119, 1995.
inter-cluster messages to be much smaller than the          3 S. Bellenot. Global Virtual Time Algorithms. Proc.
total number of messages. The size of DV is On en-          Multiconference on Dist. Simulation, 122-127, 1990.
tries, where n is the number of clusters in the system.     4 K. P. Birman. Building Secure and Reliable Network
But as explained in section 4.4, we expect the number         Applications, CT: Manning Pub. Co., 1996.
of entries to be much smaller in practice.                  5 K. M. Chandy and L. Lamport. Distributed Snap-
   Cluster output queue: Only inter-cluster mes-              shots: Determining Global States of Distributed Sys-
                                                              tems. ACM TOCS, 31: 63-75, Feb. 1985.
sages are saved in COQ. So we expect this overhead          6 O. P. Damani, Y. M. Wang and V. K. Garg. Op-
to be much smaller than that for Time Warp.                   timistic Distributed Simulation Based on Transitive
   Clustered rollback: Rollback of a single LP                Dependency Tracking. Proc. 11th Workshop on Par-
means rollback of the entire cluster. This slows down         allel and Distributed Simulation, 90-97, 1997.
the simulation. But each cluster rolls back at most         7 E. N. Elnozahy, D. B. Johnson and Y. M. Wang. A
once in response to each straggler or failure. There is       Survey of Rollback-Recovery Protocols in Message-
no possibility of avalanche of antimessages or echoing        Passing Systems. Tech. Rep. No. CMU-CS-96-181,
 14 . This should compensate for the slowdown owing           Dept. of Computer Science, Carnegie Mellon Univ.,
                                                              ftp: ftp.cs.cmu.edu user mootaz papers S.ps,1996.
to the clustered rollback.                                  8 M. J. Fischer, N. Lynch and M. S. Paterson. Impossi-
                                                              bility of Distributed Consensus with One Faulty Pro-
5 Implementation Issues                                       cess. Journal of the ACM, 322, 374-382, April 1985.
    So far we have discussed the modi cations required      9 J. Gray and A. Reuter. Transaction Processing: Con-
in simulation schemes when failures can occur. Now            cepts and Techniques. San Mateo, CA: Morgan Kauf-
we discuss some general issues that any distributed           mann Publishers, 117-119, 1993.
                                                           10 Y. Huang and C. Kintala. Software Implemented
computation must address to survive failures:                 Fault Tolerance: Technologies and Experience. Proc.
    Failure Detection: In theory, it is impossible to         23rd Fault-Tolerant Computing Symp., 2-9, 1993.
distinguish a failed process from a very slow process      11 D. R. Je erson. Virtual Time. ACM Trans. Prog.
 8 . In practice, many failure detectors have been built      Lang. and Sys., 73, 404-425, 1985.
that work well for most practical situations 10 . Most     12 L. Lamport. Time, Clocks, and the Ordering of
of these detectors use a timeout mechanism.                   Events in a Distributed System. Communications of
    Stable Storage: Stable storage must be available          the ACM, 217, 558-565, 1978.
                                                           13 Y. B. Lin, B. R. Preiss, W. M. Loucks, and E. D.
across failures. In a multi-processor environment this        Lazowska. Selecting the Checkpoint Interval in Time
is easy, as other processors can access the disk even if      Warp Simulation. Proc. 7th PADS, 3-10, 1993.
one of the processors fails. In a networking environ-      14 B. D. Lubachevsky, A. Schwartz, and A. Weiss. Roll-
ment, the local disk may be inaccessible when the cor-        back Sometimes Works ... if Filtered. Proc. Winter
responding processor fails. So a network server must          Simulation Conference, 630-639, 1989.
be used to make checkpoints stable.                        15 D. M. Nicol and X. Liu. The Dark Side of Risk  What
    Process Identity: When a failed process is                your mother never told you about Time Warp . Proc.
                                                              11th PADS, 188-195, 1997.
restarted, it may have a di erent port number or IP        16 H. Rajaei, R. Ayani, and L. E. Thorelli. The Local
address. So location independent identi ers should be         Time Warp Approach to Parallel Simulation. Proc.
used for the purpose of inter-process communication.          7th PADS, 119-126, 1993.
    Environment Variables: If a process is restarted       17 G. Suri, Y. Huang, Y. M. Wang, W. K. Fuchs and C.
on a di erent processor then some inconsistency may           Kintala. An Implementation and Performance Mea-
arise due to mismatch of the values of environment            surement of the Progressive Retry Technique. Proc.
variables in pre- and post-failure computation. Log-          IEEE ICPDS, 41-48, 1995.
                                                           18 R. E. Strom and S. Yemini. Optimistic Recovery
ging and resetting of environment variables is required.      in Distributed Systems. ACM Trans. on Computer
Acknowledgement                                               Systes, 204-226, Aug. 1985.
                                                           19 Y. M. Wang, O. P. Damani, and V. K. Garg. Dis-
   We would like to thank the anonymous referees,             tributed Recovery with K -Optimistic Logging. Proc.
whose thoughtful comments have helped us in improv-           17th Intl. Conf. Dist. Comp. Sys., 60-67, 1997.
ing the presentation of the paper.

To top