CMPT 401

Document Sample
CMPT 401 Powered By Docstoc
					   Chapter 7

   Physical clock synchronization
   Logical clock synchronization
       Causality relation
       Lamport’s logical clock
       Vector logical clock
       Multicast
       ISIS vector clock
   Snapshot
New Issues in DS
   Global time
   Event order
       e1 at 1:00pm on machine m1, e2 at
        1:01pm at machine m2. Which event
        happens earlier?
   Global state
       snapshot
   Mutual exclusion & Synchronization
Time and Clock
   Two roles of time
     - Defines temporal order among events
     - Duration (measured by timer)
   UTC (Universal Coordinated Time) is based
     on Cesium-133 atom oscillation; located at
    over 200 labs in the world
   With satellites, 0.5ms accuracy is possible.
     (100 MIPS  50,000 instructions in 0.5ms).
       Clock Skew
   Skew: Clock reading (from single
    clock) is location-dependent, e.g.,
    distance from satellite or clock
    source on a circuit board
   Drift: Multiple clocks.
        t = the real time
      Cp(t) = the reading of a clock p

        at time t (Cp(t)= t for ideal
      dCp(t) /dt = ticking rate (dCp(t)

        /dt = 1 for ideal clock)
    Consequence: An Example
   When each machine has its own clock, an
    event that occurred after another event may
    nevertheless be assigned an earlier time. But
    it is a different story in DS.
Physical Clock Synchronization
   Cristian’s Algorithm
   The Berkeley Algorithm
   Network Time Protocol (NTP)
Cristian’s Algorithm: Architecture
   WWV-node receiving UTC-signals, serving as
    the central UTC-time server (CUTCS) for the
       WWV is a short wave radio station in Colorado.
   Periodically every node in the DS sends a
    time request to the central UTC time server
   The CUTCS responds with its current time
Adjust Time
   When the client gets the reply, it simply
    set its clock to tUTC
       Time may run backward
   Introduce the change gradually
   Consider the time for message
    Adjust Time (continued)
                             Tp                  I
                 T1                          t

              Client                        Server
   Estimate Tp (propagation delay) from
      T1 – T0 = 2 x Tp + I. where I = processing time.
   Current time = t (server’s time in message) + Tp.
       Assumption?
The Berkeley Algorithm
   In Cristian’s algorithm the central time server
    was passive
   Now it’s active, i.e. it periodically polls all
    other nodes to hand out their current local
    times ci(t).
   Based on the answers it calculates a mean
    and tells all other machines to advance or
    slow down their clocks accordingly.
Relative Clock Synchronization
   Time server periodically sends its time to
    clients and asks for theirs.
   Clients respond with how far ahead or behind
    the server they are
   Time server uses the estimated local times
    for building the arithmetic mean
   Deviations from this arithmetic mean are sent
    to nodes enabling them to slow down
    respectively to speed up.
     The Berkeley Algorithm

a)   The time daemon asks all the other machines for their clock values
b)   The machines answer
c)   The time daemon tells everyone how to adjust their clock
   Cristian’s method and the Berkeley algorithm
    are intended for intranets
   Both may be improved with fault tolerance
       Instead of one UTC server in Cristian’s algorithm
        use n time servers and always take the first
        answer from whatever time serve
       Instead of taking the arithmetic mean from all
        clients in the Berkeley algorithm take the fault
        tolerance mean, i.e. skip deviations with a certain
Network Time Protocol (NTP)
   Goal
       absolute (UTC)-time service in large nets (e.g. Internet)
       high availability (via fault tolerance)
       protection against fabrication (via authentication)
   Architecture
       time-servers build up a hierarchical synchronization subnet
       all primary servers have an UTC-receiver
       secondary servers are synchronized by their parent primary
       all other stations are leaves on level 3 being synchronized by
        level 2 time servers
       accuracy of clocks decreases with increasing level number
       the net is able to reconfigure

Reliability from redundant paths, scalability,
authenticated time sources
Synchronization of Servers (NTP)
   Synchronization subnet can reconfigure if failures
       Primary having lost its UTC source can become a secondary
       Secondary having lost its primary can use another one
   Modes of synchronization:
       Multicast mode (for quick LANs, low accuracy)
            A server within a LAN periodically multicasts time to other
             leaves in the LAN which set their clocks assuming some delay
       Procedure-call mode (Cristian’s algorithm with medium
            A server responds to requests with its actual timestamp
       Symmetric mode (high accuracy)
            Pairs of servers exchange message containing times
     OSF DCE
         Time is an interval [t-e, t+e].

                                New time interval

Two intervals overlap  cannot say which time is earlier
(In case of overlap, Unix make should recompile).
Logical Clock Synchronization
   A powerful building block in DS
       Duplicate detection
       Cache consistency
       Leases
       Commitment
       …
Leslie Lamport

            Time, Clocks, and the Ordering of
            Events in a Distributed System

            Best known for his work on
            1)Temporal logic
The Paper
   Handles problems of clock drift in a DS
       Identifies main function of computer
        clocks, i.e. ordering of events
   Indicates which conditions clocks must
    satisfy to fulfill their role
   Introduces logical clocks
   Benefits of logical clocks?
       Needed for determining causality
Logical Time
   In many cases it’s sufficient just to order the
    related relevant events, i.e. we want to be
    able to position these events relatively, but
    not absolutely.
   Interesting: Relative position of an event on
    the time axis  no need for any scaling on
    this time axis
Ordering Events
   Event ordering linked with concept of
       Saying that event a happened before
        event b is same as saying that event a
        could have affected the outcome of event
       If events a and b happen on processes that
        do not exchange any data, their exact
        ordering is not important.
Causality Relationship
       p1            p2             p3           p4

                    q2         q3        q4      q5

R         r1         r2                   r3          r4

     p1        p2   p1    q2    p1       r3 transitive

    • Event changes state of process.
    • State remains same till next event occurs.
Formal Definition
   a b defined by:
       If a and b are in the same process, and a occurs
        earlier than b, then ab.
       If a is a sending event and b is receiving event of
        same message, then a b.
       If ab and bc, then ac. [Transitive]
   If a b, then a causally precedes (or
    happened before) b; a and b are causally
   a and b are concurrent if neither ab nor
Message-Related Events
   Sending event
   Receiving event
   Message arrival (at kernel) and delivery
     (to user process): Kernel can control
    timing of delivery after arrival.

e11e12e21e22e32 , furthermore e31e32,
whereas e31 is neither related (has happened before) to
e11, nor to e12, nor to e21, nor to e22.
e31 is concurrent to e11, e12, e21, and e22.
Lamport Clock
   Suppose E= {events}, each e in E gets a
    Lamport time stamp L(e), as follows:
       1. e is a pure local event or a sending-event: if e
        has no local predecessor, then L(e) := 1,
        otherwise there is a local predecessor e’, thus
        L(e) := L(e’) + 1
       2. e is a receiving event, with a corresponding
        sending-event s: if e has no local predecessor,
        then L(e) = L(s) +1, otherwise there is a local
        predecessor e’, thus L(e) := max{L(s),L(e’)} + 1

Note: Each local counter is incremented with each local
event. In a communication we adjust the involved
counters of the two communicating nodes to be consistent
with the happened-before-relation.
Remark: Same mechanism can be used to adjust clocks on
different nodes. The Lamport time is consistent with the
happened-before-relation, i.e. if x y, then L( x)<L( y),
but not vice versa.
Adjusting Clocks

Without adjusting   With adjusting
local clocks        local clocks
   Limitation on Lamport Clocks

From Lamport time values you cannot conclude whether
two events are in the happen-before relationship,e.g.
e11 and e32.
Total Ordering of Events
   Lamport-time only gives us a partial-
    ordering of distributed events.
   To implement the total ordering:
       Each processor is assigned by a unique id
       Given two events e1 and e2, e1 is ordered
        before e2 if L(e1) < L(e2) or L(e1)=L(e2)
        and Id(e1) < Id(e2)
Holding Back Deliveries
   Delay the delivery of messages that
    arrived “too soon”
       Useful when delivering messages from
        kernel to processes
       Hold back the delivery of M to process P
        until there is a guarantee that no message
        M’ with L(M’) < L(M) will arrive at P in the
   Assumption: messages from a particular
    source arrive in the FIFO order
   Each site maintains a set of message queues,
    one for each other site
       When a message arrives, placed in the
        corresponding queue
       When all queues are non-empty, compare the
        timestamps of the messages at the heads of the
        queues, and deliver the messages with the oldest
   All message queues need be non-
       Normally not true.
       Require multicast to solve the problem.
   With Lamport clock, L(a)<L(b) does not
    mean ab.
       Unnecessarily delay some messages.
       Vector Clock.
     Event Counter
                                 P1   P2   P3
                             2                  2
Event counter at Pi :
Initialized at 0 and                  2         3
incremented for each event            3
Vector Time
   Assumption: n tasks (processes) Pi in DS
   Each Pi has its own local clock being a
    n-dimensional vector (initially zeroed)
   Vi(a) is timestamp of event a at process Pi
   Vi[i] = number of local events at Pi
   Vi[j] is Pi best guess of how many events
    have been on Pj
   There is a DS with n distributed processes. n-
    dimensional vector Vi is vector-time of process Pi if it
    is built according to the following rules:
       (1) Initially, Vi = (0, …, 0) for all 1<=i<=n
       (2) For a local event on process Pi: Vi[i]++
       (3) Pi includes the value t = Vi in every msg m
       (4) When Pj receives a message m with timestamp t, it sets
        Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j
   Communication cost?
       Little overhead compared to Lamport clock
          P1         P2         P3
       000             000           000
       100             010           001
       300          240
                          250        242
       450                           243
Time   550      273
   We define global V(e) = Vi(e) if event e
    happens in Pi
   We write V(a)  V(b) if
       V(a)[k]  V(b)[k] for all k.
       Here V(a)[k] denotes the kth component of V(a).
   We write V(a) < V(b) if
       V(a)[k]  V(b)[k] for all k, and
       V(a)[j] < V(b)[j] for at least one j
Vector Time Characteristics
   The following inter relationships
    between causality or the happened-
    before relation and vector-time hold:
       A.) ee’ iff V(e) < V(e)
       B.) e||e’ iff V(e) || V(e’)
   The vector-time is the best known
    estimation for global sequencing that is
    based only on local information.
       Proof                        P1   P2        P3

   a b iff V(a) < V(b)
 Proof :
For A fixed b, a  b iff a is in
  shaded area iff each
  component of V(a)                                    a
  corresponding component          t1                    t3
  of V(b).

                                         b    t2
   A message is sent to all the members of
    a group
       Sending video stream to a set of customers
       Implementing a chat program
       Sending updates to a group of replica
IPv4 Multicast Addresses
   Class D (starts with bit sequence1110)
 to (about
    228268 million)
 is for “all systems on this
 ~ are for
    multimedia conference calls
Causal Ordering of Messages
   Suppose m1 and m2 are two messages
    being received at the same node i. A
    set of messages is causally ordered if
    for all pairs <m1, m2> the following
    send(m1)  send(m2) 
        Causality Violation
             Suppose M1’s sending event happened before M3’s sending
             Causality violation occurs if M1 is delivered after M3
            (In particular, non-FIFO delivery is causality violation).

                                   P1     P2         P3
•Delay the delivery    foo to P2
of M3 to P2 until
M1 arrives.
•ISIS system using

Formal Description of ISIS
Clock ICi
   Pi initializes its clock ICi = [0,…,0].
   For each msg sending event by Pi
       ICi[i]++
       Pi attaches ICi to message it sends.
   Upon receiving msg M from Pj with M.ts, Pi checks if
       1) M.ts[j] == ICi[j] + 1 (M is next msg expected from
       2) ICi[k]  M.ts[k] for all other k (all msgs from Pk that
        sender Pj has received have been received by Pi)
       If both are satisfied, Pi delivers M after ICi[j]++
       Otherwise, Pi puts M in hold-back Q until they are satisfied.
             P1       P2      P3
  foo to P2 100       000   001 “Where is foo?”
         101        M1             M2.ts[1] > IC3[1]+1
         201                M2    Put M2 in Hold-back Q
      “foo is at P2”

                                    M1.ts[1] = IC3[1]+1
                                  IC3  101; deliver M1
                                  IC3  201; deliver M2

     • Note: jth component of M.ts is
       sequence number of latest msg sent
       by Pj that is known to sender of M
   Show that msgs are delivered in timestamp order.
   Suppose not
       Let m (m’) be event of sending message M (M’)
       Assume Pi delivered msg M (from Pk) before M’ (from Pj),
        even though
                    M’.ts (= ICj(m’)) < M.ts (=ICk(m)) …….(A)      (1)
    (a) Just before Pi delivered M’:
                    ICi[j]+1 = ICj(m’)[j] hence ICi[j] < ICj(m’)[j] (2)
    (b) Delivery of M would have resulted in
                   ICi[j]* = ICk(m)[j]
         at time of delivery
       (a) and (b) contradict (A) since (b) took place before (a),
        hence ICi[j]*  ICi[j]
   Show the system starvation-free: no
    message will wait forever in the hold-
    back Q
   Assume Q is the hold-back queue in Pi
    and is non-empty. Let M be a msg in Q
    which is not preceded by any other msg
    in Q. Suppose M was sent by Pj.
   Assume ICi[k] < M.ts[k] for some k (!=j), i.e.,
    condition (2) is violated; want to derive contradiction
    from this.
   Let M’ be latest msg from Pk that Pj delivered prior
    to sending M so that M’.ts < M.ts and M’.ts[k] =
   If Pi hasn’t delivered copy of msg M’ from Pk , then
    M’ with M’.ts < M.ts is in holdback Q of Pi,
    contradicting assumption that M is not causally
    preceded by any other msg in holdback Q of Pi.
   So Pi must have delivered copy of msg M’ from
    Pk.Thus ICi[k]  M’.ts[k] = M.ts[k], contradicting
    ICi[k] < M.ts[k]
   Must give up assumption that Pi cannot deliver M.
Proof Illustration

    Pj                             Pk

         M                   M’

                          Pi already delivered M’
                   Pi      ICi[k]  M.ts[k]
Global state of a DS
   Consists of:
       Local state of each node (task, process)
       Messages in transit
   Why interested in a global state?
       Suppose local computation has stopped on each
        node and there are no pending messages, then
            1. Distributed application has terminated successfully? or
            2. Deadlock?
   Problem: lack of global time
       Consequences?
   How: take a snapshot
 Snapshots (taken at 2:00pm
 by local clocks)
A         B               A          B           A          B
$100          $0                         $0        $100
                   1:59pm                     2:00pm

sum = $100                    sum = $0            sum = $200
    (a)                  (b)                          (c)
          Snapshots taken at
Census Taking in Ancient
    Village            Village



Want to take census counting all people,
some of whom may be traveling on
Census Taking Algorithm
   Close all gates into/out of each village
    (process) and count people (record process
    state) in village
   Open each outgoing gate and send official
    with a red cap (special marker message).
   Open each incoming gate and count all
    travelers (record channel state= messages
    sent but not received yet) who arrive ahead
    of official (with a red cap).
   Tally the counts from all villages.
         Chandy/Lamport Snapshot Algorithm
   All processes are initially white: Messages sent by white(red)
    processes are also white (red)
   MSend [Marker sending rule for process P]
        Suspend all other activities until done
        Record P’s state
        Turn red
        Send one marker over each output channel of P.
   MReceive [Marker receiving rule for P]
    On receiving marker over channel C,
        if P is white { Record state of channel C as empty;
                       Invoke MSend; }
        else record the state of C as sequence of white messages received
         since P turned red.
   Stop when marker is received on each incoming channel
   MSend and MReceive are atomic.
   No process failures, no message loss
   Point to point message delivery is
       How to guarantee it? ISIS clock
   Network is strongly connected.
       Why?
Snapshot (1)
 A          B             A            B
                                                msgs arriving
                                                before maker
 $100           $0                         $0     constitute
                     $0                         channel state

 sum = $100                   sum = $100
     (a)                         (b)
     OK                    OK
           Need not use time.
Snapshots (2)
         A         B             A           B

  $100                    $100                   $0

         sum = $200               sum = $100
             (c)                       (d)
     Cannot happen               Will be like this
   Cut C divides all events to PC (those
    which happened in the past relative to
    C) and FC (future events).
   Cut C is consistent if there is no
    message whose sending event is in FC
    and whose receiving event is in PC.
Progress shown by cuts
       p1      p2            p3                         p4

                        q1            q2       q3
         1 2    3   4             5        7        8

    There are 5*4 = 20 possible cuts.

      p1           p2         p3          p4

                  q2     q3        q4     q5
      q1                      M

  R    r1          r2              r3        r4
            consistent                  inconsistent
              cut                           cut

State recorded by SNAPSHOT consistent cut
   Cut C is consistent  C doesn’t
    contradict sequence of events
    experienced by any site  can assume
    it did exist at the same time
   Can use snapshot as checkpoint, from
    which activity in distributed system can
    be resumed after crash