04.1 clocks

Document Sample
04.1 clocks Powered By Docstoc
					Distributed Time and Clock Synchronization
1.   Time, clocks
2.   UTC
3.   Christian
4.   The Berkeley algorithm
5.   NTP
6.   Logical Clocks
7.   Vector clocks

Time, clocks

            Why Timestamps in Systems?

• Do some “precise” performance measurements
• Guarantee “up-to-date” or recentness of data
• Temporal ordering of events produced by concurrent
• Synchronization between senders and receivers of
• Coordination of joint activities
• Serialization of concurrent accesses to shared objects
• ……

                          Physical time

• Solar time
   – 1 sec = 1 day / 86400
   – Problem: days are of different lengths (due to tidal friction, etc.)
   – mean solar second: averaged over many days
• International atomic time (TAI)
   – 1 sec  time for Cesium-133 atom to make 9,192,631,770 state
   – TAI time is simply the number of Cesium-133 transitions since
     midnight on Jan 1, 1958.
   – Accuracy: better than 1 second in six million years
   – Problem: Atomic clocks do not keep in step with solar time

        Coordinated Universal Time (UTC)

• Based on the atomic time (TAI)
• A leap second is occasionally inserted or deleted to keep in
  step with solar time

                           Computer Clocks

• CMOS clock circuit driven by a quartz
    – battery backup to continue measuring time
      when power is off
• The circuit has a counter and a register. The
  counter decrements by 1 for each oscillation;
  an interrupt is generated when it reaches 0 and
  the number in the register is loaded to the
  counter. Then, it repeats…                                             CPU
• OS catches interrupt signals to maintain a                  counter
  computer clock
    – e.g., 60 or 100 interrupts per second
    – Programmable Interrupt Controller (PIC)                 register
    – Interrupt service procedure increments a counter by 1
      for each interrupt

                   Clock drift and clock skew

• Clock Drift
    – Clocks tick at different rates
         • Ordinary quartz clocks drift
           by ~ 1sec in 11-12 days. (10-6
         • High precision quartz clocks
           drift rate is ~ 10-7 or 10-8
    – Create ever-widening gap in
      perceived time
• Clock Skew (offset)
    – Difference between two clocks
      at one point in time

Perfect clock

Drift with a slow computer clock

Drift with a fast computer clock

                      Dealing with drift

• No good to set a clock backward
   – Illusion of time moving backwards can confuse message ordering
     and software development environments

• Go for gradual clock correction
   – If fast: Make clock run slower until it synchronizes
   – If slow: Make clock run faster until it synchronizes

               Linear compensating function

• OS can do this: Change rate at
  which it requests interrupts
    – e.g.: if the system generates an
      interrupt every 17 ms but clock
      is too slow: generates an
      interrupt at (e.g.) 15 ms

• Adjustment changes slope of
  system time: Linear
  compensating function


• After synchronization period is reached
   – Resynchronize periodically
   – Successive application of a second linear compensating function
     can bring clock closer to the true slope

• Keep track of adjustments and apply continuously
   – UNIX adjtime system call:
   int adjtime(struct timeval *delta, struct timeval *old-delta)
   – adjusts the system's notion of the current time, advancing or
       retarding it, by the amount of time specified in the struct timeval
       pointed to by delta.

   The issue of Time in distributed systems

• A quantity that we often have to measure accurately
   – necessary to synchronize a node’s clock with an authoritative external
     source of time
       • Eg: timestamps for electronic transactions
            – both at merchant’s & bank’s computers
            – auditing

• An important theoretical construct in understanding how
  distributed executions unfold
   – Algorithms for several problems depend upon clock synchronization
       • timestamp-based serialization of transactions for consistent updates of
         distributed data
       • Kerberos authentication protocol
       • elimination of duplicate updates

             Clock Synchronization

• When each machine has its own clock, an event
  that occurred after another event may
  nevertheless be assigned an earlier time.
                   Fundamental limits

 The notion of physical time is problematic in distributed systems:
  - limitations in our ability to timestamp events at different nodes
sufficiently accurately to know the order in which any pair of events
          occurred, or whether they occurred simultaneously.


                        Getting UTC

• Attach GPS receiver to each computer
   – ± 1 ms of UTC
• Attach WWV ( radio receiver
   – Obtain time broadcasts from Boulder or DC
   – ± 3 ms of UTC (depending on distance)
• Attach GOES receiver (Geostationary Operational
  Environmental Satellites,
   – ± 0.1 ms of UTC
Not practical solution for every machine
  – Cost, size, convenience, environment

                        Getting UTC

• Synchronize from another machine
   – One with a more accurate clock

• Machine that provides time information
   – Time server

                Synchronization of physical clocks

• D: synchronization bound
• S: source of UTC time, t I
• External synchronization:
   – |S(t) - Ci(t)| < D
   – Clocks are accurate within the bound D
• Internal synchronization:
   – |Ci(t) - Cj(t)| < D
   – Clocks agree within the bound D
• external sync internal sync
                         Correctness of clocks

• Hardware correctness:
   – (1 - p)(t’ - t) <= H(t’) - H(t) <= (1 + p)(t’ - t)
   – There can be no jumps in the value of H/W clocks
• Monotonicity:
   – t’ > t → C(t’) > C(t)
   – A clock only ever advances
   – Even if a clock is running fast, we only need to change at which updates are
     made to the time given to apps
       • can be achieved in software: Ci(t) = a Hi(t) + b
• Hybrid:
   – monotonicity + drift rate bounded bet. sync. points (where clock value can
     jump ahead)

                      Synchronous systems

• P1 sends its local clock value t to P2
   – P2 can set its clock value to (t + Ttransmit)          In asynchronous systems:
                                                        Ttransmit = min + x, where x >= 0
   – Ttransmit can be variable or unknown               Only the distribution of x may be
       • resource competition bet. processes           measurable, for a given installation
       • network congestion
• u = (max - min)
   – uncertainty in Ttransmit
       • obtained if P2 sets its clock to (t + min) or (t + max)
   – If P2 sets its clock value to t + (max+min)/2, then skew <= u/2
• Optimal bound for N processes: u (1 -                 )

              Clock Synchronization Algorithms

• The relation between
  clock time and UTC when
  clocks tick at different


        Synchronizing Clocks by using RPC

• Simplest synchronization technique
   – Make an RPC to obtain time
   – Set time

                       What’s the time?
              client                      server

  Does not count network or processing latency

                    Cristian's Algorithm

• Getting the current time from a time server.

       Time servers: Christian’s algorithm
                                        Receiver of UTC signals


   p                                           Time server,S
         Tround := total round-trip time
         t := time value in message mt
            estimate := (t + Tround /2)

Time by S’s clock when
   reply msg arrives
                                 [t+min, t+Tround-min]

       Accuracy:        (Tround/2 - min)
             Limitations of Cristian’s algorithm

• Variability in estimate of Tround
   – can be reduced by repeated requests to S & taking the minimum value of
• Single point of failure
   – group of synchronized time servers
       • multicast request & use only 1st reply obtained
• Faulty clocks:
   – f: #faulty clocks, N: #servers
       • N > 3f, for the correct clocks to achieve agreement
• Malicious interference
   – Protection by authentication techniques

                 Cristian’s algorithm

Compensate for network delays (assuming symmetric)
• client sends a request at T0
• server replies with the current clock value Tserver
• client receives response at T1
                                               T T
• client sets its clock to: Tclient  Tserver  1 0

             Cristian’s algorithm: example

• Send request at 5:08:15.100 (T0)
• Receive response at 5:08:15.900 (T1)
   – Response contains 5:09:25.300 (Tserver)
• Round-trip time is T1 − T0
     5:08:15.900 - 5:08:15.100 = 800 ms
• Best guess: timestamp was generated 400 ms ago
• Set time to Tserver + round-trip-time/2
     5:09:25.300 + 400 = 5:09.25.700
• Accuracy: ± round-trip-time/2

       Cristian’s algorithm: error bound

Tmin: Minimum message travel time

                                    (   )

        Problems with Cristian’s algorithm

• Server might fail
• Subject to malicious interference

The Berkeley algorithm

                   The Berkeley algorithm

• Gusella & Zatti (1989)
   – Co-ordinator (master) periodically polls slaves
      • estimates each slave’s local clock (based on RTT)
      • averages the values obtained (incl. its own clock value)
      • ignores any occasional readings with RTT higher than max
   – Slaves are notified of the adjustment required
      • This amount can be positive or negative
      • Sending the updated current time would introduce further uncertainty,
        due to message transmit delay
   – Elimination of faulty clocks
      • averaging over clocks that do not differ from one another more than a
        specified amount
   – Election of new master, in case of failure
      • no guarantee for election to complete in bounded time

                The Berkeley Algorithm (II)

a)   The time daemon asks all the other machines for their clock values
b)   The machines answer
c)   The time daemon tells everyone how to adjust their clock
                  Berkeley Algorithm

• Gusella & Zatti, 1989
• Aim: clocks of a group of machines as close as possible
• Assumes no machine has an accurate time source (i.e., no
  differentiation of client and server)
• Obtains average from participating computers
• Synchronizes all clocks to average

                    Berkeley Algorithm

One machine is elected (or designated) as the master; others
   are slaves:
1. Master polls all slaves periodically, asking for their time
   – Cristian’s algorithm can be used to obtain more accurate clock
     values from other machines by counting network latency
2. When results are in, compute the average
   – Including master’s time
3. Send each slave the offset its clock need be adjusted
   – Avoids problems with network delays if sending a timestamp

                   Berkeley Algorithm

• Algorithm has provisions for ignoring readings from
  clocks whose skew is too great
   – Compute a fault-tolerant average

• Any slave can take over the master if master fails

Berkeley Algorithm: example

Berkeley Algorithm: example

Berkeley Algorithm: example


             Network Time Protocol (NTP)

• NTP is the most commonly used Internet time protocol (RFC 1305, ).
• Computers often include NTP software in OS. The client software
  periodically gets updates from one or more servers (average them).
• Time servers listen to NTP requests on port 123, and reply a UDP/IP
  data packet in NTP format, which is a 64-bit timestamp in UTC
  seconds since Jan 1, 1900 with a resolution of 200 pico-s.
• Many NTP client software for PC only gets time from a single server
  (no averaging). The client is called SNTP (Simple Network Time
  Protocol, RFC 2030), a simple version of NTP.

                    Averaging algorithms

• Divide time into fixed-length re-synchronization
  intervals: [T0 + iR, T0 + (i+1)R]
   – At the beginning of an interval, each machine broadcasts the
     current time according to its clock
      • … and starts a local timer to collect all incoming broadcasts during a
        time interval S
   – When the broadcasts have been received, a new time value is
      • Average
      • Average after discarding the m lowest and the m highest values
          – … tolerate up to m faulty machines
      • May also correct each value based on estimate of propagation time from
        the source machine
  NTP: An      Internet-scale time protocol

              • Statistical filtering of timing data
      – discrimination based on quality of data from different
          • Re-configurable inter-server connections
                        – logical hierarchy
             • Scalable for both clients & servers
           – Clients can re-sync. frequently to offset drift
              • Authentication of trusted servers
            – … and also validation of return addresses

Sync. Accuracy: ~10s of milliseconds over Internet paths
                ~ 1 millisecond on LANs
        NTP Synchronization Subnets
                                         Primary servers

              2                     2             stratum

    3                   3                     3

High stratum #  server more liable to be less accurate

         Node  root RTT as a quality criterion

             3 modes of synchronization:
      •multicast: acceptable for high-speed LAN
    •procedure-call: similar to Cristian’s algorithm
        •symmetric: between a pair of servers
          All modes rely on UDP messages.
           Message pairs bet. NTP peers (I)
Server B                   Ti -2   T i-1

                              m            m'

Server A          Ti - 3                   Ti

    Each message contains the local times when the previous
    message was sent & received, and the local time when the
                   current message was sent.
    •There can be a non-negligible delay bet. the arrival of one
               message & the dispatch of the next.
                     • Messages may be lost

Offset oi : estimate of the actual offset bet. two clocks,
          as computed from a pair of messages
Delay di : total transmission time for the message pair
             Message pairs bet. NTP peers (II)

       T i-2 = T i - 3 + t + o, where o is the true offset

       T i = T i - 1 + t’ - o

        di = t + t’ = T i-2 - T i - 3 + Ti - T i - 1

        o = oi + (t’ - t)/2
        oi = (T i-2 - T i - 3 - Ti + T i - 1 ) / 2

                 oi - di / 2 o  oi + di /2
Delay di is a measure of the accuracy of the estimate of offset
             NTP data filtering & peer selection

• Retain 8 most recent <oi, di > pairs
   – compute “filter dispersion” metric
       • higher values  less reliable data
       • The estimate of offset with min. delay is chosen
• Examine values from several peers
   – look for relatively unreliable values
• May switch the peer used primarily for sync.
• Peers with low stratum # are more favored
   – “closer” to primary time sources
• Also favored are peers with lowest sync. dispersion:
   – sum of filter dispersions bet. peer & root of sync. subnet
• May modify local clock update frequency wrt observed
  drift rate
    NTP synchronization subnet

1st stratum: machines connected directly to accurate time source

2nd stratum: machines synchronized from 1st stratum machines


                             NTP goals

• Enable clients across Internet to be accurately
  synchronized to UTC despite message delays
   – Use statistical techniques to filter data and gauge quality of results
• Provide reliable service
   – Survive lengthy losses of connectivity
   – Redundant paths
   – Redundant servers
• Enable clients to synchronize frequently
   – offset effects of clock drift
• Provide protection against interference
   – Authenticate source of data

               NTP Synchronization Modes

• Multicast (for quick LANs, low accuracy)
    – server sends its actual time to its leaves in the LAN

•   Remote Procedure Call (medium accuracy)
    – server responds to requests with its actual timestamp
    – like Cristian’s algorithm

•   Symmetric mode (high accuracy)
    – used to synchronize between the time servers

    All messages delivered unreliably with UDP
                               Symmetric mode

• The delay between the arrival of a request (at server B) and the
  dispatch of the reply is NOT negligible:

                                               Ti-2 Ti-1
                  Server B
                                         m                 m’
                  Server A
                                        Ti-3               Ti      time

• Delay = total transmission time of the two messages
                      di = (Ti – Ti-3 ) – (Ti-1– Ti-2)
• Offset of clock A relative to clock B:
    – Offset of clock A: oi  (Ti  2  Ti  3 )  (Ti 1  Ti )
    – Set clock A: Ti + oi
    – Accuracy bound: di /2

         Symmetric mode (another expression)

                                              Ti-2 Ti-1
                   Server B
                                        m                  m’
                   Server A
                                       Ti-3                 Ti   time
• Delay = total transmission time of the two messages
    di = (Ti – Ti-3 ) – (Ti-1– Ti-2)
• Clock A should set its time to:
  Ti-1 + di/2, which is the same as Ti + oi

     Symmetric NTP example

                Ti-2 =800 Ti-1 =850
 Server B
                m              m’
 Server A
            Ti-3 =1100     Ti =1200   time

Offset oi=((800 – 1100) + (850 – 1200))/2 = – 325
Set clock A to: Ti + oi = 1200 – 325 = 875

                     Improving accuracy

• Data filtering from a single source
   – Retain the multiple most recent pairs < oi, di >
   – Filter dispersion: oj corresponding to the smallest dj

• Peer-selection: synchronize with lower stratum servers
   – lower stratum numbers, lower synchronization dispersion

Logical Clocks
               Motivation of logical clocks

• Cannot synchronize physical clocks perfectly in distributed
  systems. [Lamport 1978]

• Main function of computer clocks – order events
   – If two processes don’t interact, there is no need to sync clocks.
   – This observation leads to “causality”


• Order events with happened-before () relation
   – ab
       • a could have affected the outcome of b
   – a, b take place on different processes that don’t exchange data
       • Their relative ordering does not matter
       • They are concurrent: a || b

         Formal definition of happened-before

1.   If a and b take place in the same process
     –   a comes before b, then a  b
2.   If a and b take place in the different processes
     –   a is a “send” and b is the corresponding “receive”, then a  b
3.   Transitive: if a  b and b  c, then a  c

Partial ordering – unordered events are concurrent

                         Logical clocks

•   A logical clock is a monotonically increasing software
    counter. It need not relate to a physical clock.
    – Corrections to a clock must be made by adding, not subtracting

•   Assign “time” value to each event
    – if a  b then clock(a) < clock(b)

               Event counting example

• Three processes: P0, P1, P2, events a, b, c, …
• Local event counter in each processes.
• Processes occasionally communicate with each other,
  where inconsistency occurs, …

               Bad ordering: e  h, f  k

                   Lamport’s algorithm, 1978

•    Each process Pi has a logical clock Li which is used to apply logical
     timestamps to events.
1.   Li is initialized to 0;
2.   Update Li:
     –   LC1: Li is incremented by 1 before each event at process Pi
     –   LC2: when process Pi sends message m, it piggybacks t = Li to m
     –   LC3: when Pj receives (m,t) it sets Lj := max{Lj, t} , and then applies LC1
         to increment Lj for event receive(m)

         Problem: Identical timestamps

Concurrent events (e.g., a, g) may have the same timestamp

      Unique timestamps (total ordering)

Append the process ID (or system ID) to the clock value
after the decimal point:
– e.g. if P1, P2 both have L1 = L2 = 40, make L1 = 40.1, L2 = 40.2

        Problem: Detecting causal relations

• If a  b, then L(a) < L(b), however:
• If L(a) < L(b), we cannot conclude that a  b
• It is not very useful in distributed systems.

              L(g) < L(c ), but g || c

• Solution: use a vector clock

              Lamport’s notion of logical time

• For many purposes, it is sufficient that all machines agree on the
  same time
   – … Emphasis on internal consistency
• If two processes do not interact, lack of synchronization will not
  be observable
   – … and thus will not cause problems
• Ordering of events is needed to avoid ambiguities

                        Lamport Timestamps

•   3 processes, each with its own clock. The clocks run at different rates.
•   Lamport's algorithm corrects the clocks.

     Space-Time diagram representation of a
            distributed computation
      a       b       m1

p2                                                  Phy s ic al
                           c       d                 tim e

              a  b

              c d
                               a      f
              b  c
              d  f
               The “happened-before” relation

• We cannot synchronize clocks perfectly across a distributed system
   – cannot use physical time to find out event order
   – Lamport, 1978: “happened-before” partial order
       •   (potential) causal ordering
       •   e → i e’, for process Pi → e → e’
       •   send(m) receive(m), for any message m
       •   e → e’ and e’ → e’’ then e → e’’
       •   concurrent events: a // b
             – occur at different processes &   chain of messages intervening between them

                    Totally-Ordered Multicasting

• Updating a replicated database & leaving it in an inconsistent state.
                            Solution via multicast:
      •Each msg is multicast, with timestamp= current (logical) time
                •Recipient ACKs each message, via multicast
     •Each process puts received messages in its local queue, sorted
                       according to the timestamp
          •A process only delivers a msg when it is at the head and
                     it has been ACKed by all processes
                 Lamport’s Logical Clocks (I)

• Per-process monotonically increasing counters
   – Li := Li + 1, before each event is recorded at Pi
   – Clock value, t, is piggy-backed with messages
   – Upon receiving <m ,t>, Pj updates its clock:
        • Lj :=max {Lj , t}, Lj := Lj + 1
• Total order by taking into account process ID:
   – (Ti, i) < (Tj, j) iff (Ti < Tj or (Ti = Tj and i < j) )

                            e e’              L(e) < L(e’)

             Lamport’s Logical Clocks (II)

p1                                               Physical
     a       b     m1

                        c        d

         e                                   f

                   L(b) > L(e), but b // e

         FIFO delivery is not causal delivery

p1                                              Physical time

     a     b

                c   d            m1

                        e                  f

                       Hidden channels
The relation captures the flow of data intervening bet. events
  Data can flow in ways other than message passing !

p2                                                                Physical
                            b                                      time
                                  m2            m1

                                  c                           d
 a: pipe rapture, detected by sensor #1
b: pressure drop, detected by sensor #2       The pipe acts as comm.
     Controller (P3) increases heat (to increase pressure),
              then receives notification of rapture.
Vector clocks

                 A Vector of Timestamps

Suppose there are a group of people and each one needs to keep
   track of events happened to other people.
Requirement: Given two events, you can tell if they are sequential
   or concurrent.
Solution: you need to have a vector of timestamps, one for each
                                 (3,0,0)     (?,?,?)

                          Vector clocks

Vector clock Vi at process Pi is an array of N integers
• Initialization: for 1 ≤ i ≤ N and 1 ≤ k ≤ N, Vi[k] := 0
• Update Vi :
   – VC1: before Pi timestamps an event it sets Vi[i] := Vi[i] +1
   – VC2: Pi piggybacks t = Vi on every message it sends out
   – VC3: when Pj receives (m,t), for 1 ≤ k ≤ N it sets Vj[k] :=
     max{Vj[k], t[k]}, then applies VC1 to increment Vj[j] for event
Note: Vi[j] is a timestamp indicating that Pi knows all events
  that happened in Pj upto this time.

Vector timestamps: example

Vector timestamps: example

Vector timestamps: example

Vector timestamps: example

Vector timestamps: example

Vector timestamps: example

Vector timestamps: example

             Comparing vector timestamps

• Define
       V = V’ iff V[i] = V’[i]) for i = 1, …, N
       V ≤ V’ iff V[i] ≤ V’[i]) for i = 1, …, N
       V < V’ iff V ≤ V’ and V ≠ V’
• V(e)  timestamp of an event e
• For any two events e and e’,
   – e  e’ iff V(e) < V(e’), e ≠ e’
   – e || e’ iff neither V(e) ≤ V(e’) nor V(e’) ≤ V(e)

Vector timestamps: example

           Summary on vector timestamps

• No need to synchronize physical clocks
• Able to order causal events
• Able to identify concurrent events (but cannot order them)

          An Application of Vector Timestapms:
               causally-ordered multicast
  Multicast: a sender sends a message to a group of receivers. Every
    message in the system must be received by all group members.
  Causally ordered multicast: if m1  m2, m1 must be received before
    m2 by all receivers.
           (1,0,0)                 (2,2,0)



(0,0,0)              (1,1,0)

(0,0,0)                                         (1,0,1) (1,2,2)

                        Causally-Ordered Multicast

Each group member keeps a timestamp vector of n components (n group members), all
   initialized to 0.
1. When Pi multicasts a message, it increments i-th component of its time vector Vi and
   attaches Vi to the msg.
2. When Pj (with Vj) receives msg(m, Vi) from Pi, if (Vj [k]  Vi[k] for all k, k≠ i), then
           Vj [i] := Vi [i]; Vj [j] := Vj [j] + 1;
           deliver msg m;
   otherwise delay the delivery of m until the “if” condition is met.
                       (1,0,0)                 (2,2,0)   (3,2,0)


             (0,0,0)             (1,1,0)                                 (3,3,0)

             (0,0,0)                                        (1,0,1) (1,2,2)   ?
                        Causal-Order Preserved

• If m1  m2, m1 is received by all recipients before m2.
• If m1 || m2, m1 and m2 can be received in arbitrary order by recipients.
• Total ordering: for case of m1 || m2, m1 and m2 must be received in
  the same order by all recipients (i.e., either all m1 before m2, or all m2
  before m1).

              (1,0,0)                     (2,2,0)   (3,2,0)

                                (1,2,0)             (1,3,0)

                          (1,1,0)                                     (3,3,0)

   (0,0,0)                                             (1,0,1)   (1,2,2)    (3,2,3)

                             Vector Clocks

• Mattern, 1989 & Fidge, 1991:
   –   clock := vector of N numbers (one per process)
   –   Vi [i] := Vi [i] + 1, before Pi timestamps an event
   –   Clock vector is piggybacked with messages
   –   When Pi receives <m ,t> :
        • Vi [j] := max{ t[j], Vi [j] }, for j=1, …, N
   – Vi [j], j i: #events that have occurred at Pj and has a (potential) effect on Pi
   – Vi [i]: #events that Pi has timestamped

                         e  e’        V(e) < V(e’)


Shared By: