# CMPT 401

Document Sample

```					   Chapter 7

Synchronization
Topics
   Physical clock synchronization
   Logical clock synchronization
   Causality relation
   Lamport’s logical clock
   Vector logical clock
   Multicast
   ISIS vector clock
   Snapshot
New Issues in DS
   Global time
   Event order
   e1 at 1:00pm on machine m1, e2 at
1:01pm at machine m2. Which event
happens earlier?
   Global state
   snapshot
   Mutual exclusion & Synchronization
Time and Clock
   Two roles of time
- Defines temporal order among events
- Duration (measured by timer)
   UTC (Universal Coordinated Time) is based
on Cesium-133 atom oscillation; located at
over 200 labs in the world
   With satellites, 0.5ms accuracy is possible.
(100 MIPS  50,000 instructions in 0.5ms).
Clock Skew
   Skew: Clock reading (from single
clock) is location-dependent, e.g.,
distance from satellite or clock
source on a circuit board
   Drift: Multiple clocks.
   t = the real time
 Cp(t) = the reading of a clock p

at time t (Cp(t)= t for ideal
clock)
 dCp(t) /dt = ticking rate (dCp(t)

/dt = 1 for ideal clock)
Consequence: An Example
   When each machine has its own clock, an
event that occurred after another event may
nevertheless be assigned an earlier time. But
it is a different story in DS.
Time
Physical Clock Synchronization
   Cristian’s Algorithm
   The Berkeley Algorithm
   Network Time Protocol (NTP)
   OSF DCE
Cristian’s Algorithm: Architecture
   WWV-node receiving UTC-signals, serving as
the central UTC-time server (CUTCS) for the
DS
   Periodically every node in the DS sends a
time request to the central UTC time server
CUTCS.
   The CUTCS responds with its current time
tUTC
   When the client gets the reply, it simply
set its clock to tUTC
   Time may run backward
   Consider the time for message
propagation.
Tp
T0
Tp                  I
T1                          t

Client                        Server
   Estimate Tp (propagation delay) from
T1 – T0 = 2 x Tp + I. where I = processing time.
   Current time = t (server’s time in message) + Tp.
   Assumption?
The Berkeley Algorithm
   In Cristian’s algorithm the central time server
was passive
   Now it’s active, i.e. it periodically polls all
other nodes to hand out their current local
times ci(t).
   Based on the answers it calculates a mean
and tells all other machines to advance or
slow down their clocks accordingly.
Relative Clock Synchronization
   Time server periodically sends its time to
   Clients respond with how far ahead or behind
the server they are
   Time server uses the estimated local times
for building the arithmetic mean
   Deviations from this arithmetic mean are sent
to nodes enabling them to slow down
respectively to speed up.
The Berkeley Algorithm

a)   The time daemon asks all the other machines for their clock values
c)   The time daemon tells everyone how to adjust their clock
Summary
   Cristian’s method and the Berkeley algorithm
are intended for intranets
   Both may be improved with fault tolerance
methods
   Instead of one UTC server in Cristian’s algorithm
use n time servers and always take the first
   Instead of taking the arithmetic mean from all
clients in the Berkeley algorithm take the fault
tolerance mean, i.e. skip deviations with a certain
threshold
Network Time Protocol (NTP)
   Goal
   absolute (UTC)-time service in large nets (e.g. Internet)
   high availability (via fault tolerance)
   protection against fabrication (via authentication)
   Architecture
   time-servers build up a hierarchical synchronization subnet
   all primary servers have an UTC-receiver
   secondary servers are synchronized by their parent primary
server
   all other stations are leaves on level 3 being synchronized by
level 2 time servers
   accuracy of clocks decreases with increasing level number
   the net is able to reconfigure
NTP

Reliability from redundant paths, scalability,
authenticated time sources
Synchronization of Servers (NTP)
   Synchronization subnet can reconfigure if failures
occur,e.g.
   Primary having lost its UTC source can become a secondary
   Secondary having lost its primary can use another one
   Modes of synchronization:
   Multicast mode (for quick LANs, low accuracy)
   A server within a LAN periodically multicasts time to other
leaves in the LAN which set their clocks assuming some delay
   Procedure-call mode (Cristian’s algorithm with medium
accuracy)
   A server responds to requests with its actual timestamp
   Symmetric mode (high accuracy)
   Pairs of servers exchange message containing times
OSF DCE
   Time is an interval [t-e, t+e].

TS1
TS2
TS3
TS4
Reject
New time interval

Two intervals overlap  cannot say which time is earlier
(In case of overlap, Unix make should recompile).
Logical Clock Synchronization
   A powerful building block in DS
   Duplicate detection
   Cache consistency
   Leases
   Commitment
   …
Leslie Lamport

Time, Clocks, and the Ordering of
Events in a Distributed System

Best known for his work on
1)Temporal logic
2)LaTeX
Microsoft
The Paper
   Handles problems of clock drift in a DS
   Identifies main function of computer
clocks, i.e. ordering of events
   Indicates which conditions clocks must
satisfy to fulfill their role
   Introduces logical clocks
   Benefits of logical clocks?
   Needed for determining causality
Logical Time
   In many cases it’s sufficient just to order the
related relevant events, i.e. we want to be
able to position these events relatively, but
not absolutely.
   Interesting: Relative position of an event on
the time axis  no need for any scaling on
this time axis
Ordering Events
   Event ordering linked with concept of
causality:
   Saying that event a happened before
event b is same as saying that event a
could have affected the outcome of event
b
   If events a and b happen on processes that
do not exchange any data, their exact
ordering is not important.
Causality Relationship
Time
p1            p2             p3           p4
P

q2         q3        q4      q5
Q
q1

R         r1         r2                   r3          r4

p1        p2   p1    q2    p1       r3 transitive

• Event changes state of process.
• State remains same till next event occurs.
Formal Definition
   a b defined by:
   If a and b are in the same process, and a occurs
earlier than b, then ab.
   If a is a sending event and b is receiving event of
same message, then a b.
   If ab and bc, then ac. [Transitive]
   If a b, then a causally precedes (or
happened before) b; a and b are causally
related
   a and b are concurrent if neither ab nor
ba.
Message-Related Events
   Sending event
   Receiving event
   Message arrival (at kernel) and delivery
(to user process): Kernel can control
timing of delivery after arrival.
Example
Time

e11e12e21e22e32 , furthermore e31e32,
whereas e31 is neither related (has happened before) to
e11, nor to e12, nor to e21, nor to e22.
e31 is concurrent to e11, e12, e21, and e22.
Lamport Clock
   Suppose E= {events}, each e in E gets a
Lamport time stamp L(e), as follows:
   1. e is a pure local event or a sending-event: if e
has no local predecessor, then L(e) := 1,
otherwise there is a local predecessor e’, thus
L(e) := L(e’) + 1
   2. e is a receiving event, with a corresponding
sending-event s: if e has no local predecessor,
then L(e) = L(s) +1, otherwise there is a local
predecessor e’, thus L(e) := max{L(s),L(e’)} + 1
Example

Note: Each local counter is incremented with each local
event. In a communication we adjust the involved
counters of the two communicating nodes to be consistent
with the happened-before-relation.
Remark: Same mechanism can be used to adjust clocks on
different nodes. The Lamport time is consistent with the
happened-before-relation, i.e. if x y, then L( x)<L( y),
but not vice versa.

local clocks        local clocks
Limitation on Lamport Clocks

From Lamport time values you cannot conclude whether
two events are in the happen-before relationship,e.g.
e11 and e32.
Total Ordering of Events
   Lamport-time only gives us a partial-
ordering of distributed events.
   To implement the total ordering:
   Each processor is assigned by a unique id
(integer)
   Given two events e1 and e2, e1 is ordered
before e2 if L(e1) < L(e2) or L(e1)=L(e2)
and Id(e1) < Id(e2)
Holding Back Deliveries
   Delay the delivery of messages that
arrived “too soon”
   Useful when delivering messages from
kernel to processes
   Hold back the delivery of M to process P
until there is a guarantee that no message
M’ with L(M’) < L(M) will arrive at P in the
future.
Implementation
   Assumption: messages from a particular
source arrive in the FIFO order
   Each site maintains a set of message queues,
one for each other site
   When a message arrives, placed in the
corresponding queue
   When all queues are non-empty, compare the
timestamps of the messages at the heads of the
queues, and deliver the messages with the oldest
timestamp.
Limitation
   All message queues need be non-
empty.
   Normally not true.
   Require multicast to solve the problem.
   With Lamport clock, L(a)<L(b) does not
mean ab.
   Unnecessarily delay some messages.
   Vector Clock.
Event Counter
P1   P2   P3
1
1
2                  2
1
Event counter at Pi :
Initialized at 0 and                  2         3
incremented for each event            3
4
Vector Time
   Assumption: n tasks (processes) Pi in DS
   Each Pi has its own local clock being a
n-dimensional vector (initially zeroed)
   Vi(a) is timestamp of event a at process Pi
   Vi[i] = number of local events at Pi
   Vi[j] is Pi best guess of how many events
have been on Pj
Rules
   There is a DS with n distributed processes. n-
dimensional vector Vi is vector-time of process Pi if it
is built according to the following rules:
   (1) Initially, Vi = (0, …, 0) for all 1<=i<=n
   (2) For a local event on process Pi: Vi[i]++
   (3) Pi includes the value t = Vi in every msg m
   (4) When Pj receives a message m with timestamp t, it sets
Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j
   Communication cost?
   Little overhead compared to Lamport clock
Example
P1         P2         P3
000             000           000
100             010           001
M1
200
220
230
300          240
250        242
450                           243
260
264
Time   550      273
Notation
   We define global V(e) = Vi(e) if event e
happens in Pi
   We write V(a)  V(b) if
   V(a)[k]  V(b)[k] for all k.
   Here V(a)[k] denotes the kth component of V(a).
   We write V(a) < V(b) if
   V(a)[k]  V(b)[k] for all k, and
   V(a)[j] < V(b)[j] for at least one j
Vector Time Characteristics
   The following inter relationships
between causality or the happened-
before relation and vector-time hold:
   A.) ee’ iff V(e) < V(e)
   B.) e||e’ iff V(e) || V(e’)
   The vector-time is the best known
estimation for global sequencing that is
based only on local information.
Proof                        P1   P2        P3

   a b iff V(a) < V(b)
 Proof :
For A fixed b, a  b iff a is in
component of V(a)                                    a
corresponding component          t1                    t3
of V(b).

b    t2
Multicasting
   A message is sent to all the members of
a group
   Sending video stream to a set of customers
   Implementing a chat program
   Sending updates to a group of replica
managers
   Class D (starts with bit sequence1110)
228268 million)
   224.0.0.1 is for “all systems on this
subnet”
   224.2.0.0 ~ 224.2.127.253 are for
multimedia conference calls
Causal Ordering of Messages
   Suppose m1 and m2 are two messages
being received at the same node i. A
set of messages is causally ordered if
for all pairs <m1, m2> the following
holds:
send(m1)  send(m2) 
Causality Violation
     Suppose M1’s sending event happened before M3’s sending
event.
     Causality violation occurs if M1 is delivered after M3
(In particular, non-FIFO delivery is causality violation).

P1     P2         P3
Migrate
•Delay the delivery    foo to P2
of M3 to P2 until
M1 arrives.
•ISIS system using
multicast

Time
Formal Description of ISIS
Clock ICi
   Pi initializes its clock ICi = [0,…,0].
   For each msg sending event by Pi
   ICi[i]++
   Pi attaches ICi to message it sends.
   Upon receiving msg M from Pj with M.ts, Pi checks if
   1) M.ts[j] == ICi[j] + 1 (M is next msg expected from
Pj)
   2) ICi[k]  M.ts[k] for all other k (all msgs from Pk that
   If both are satisfied, Pi delivers M after ICi[j]++
   Otherwise, Pi puts M in hold-back Q until they are satisfied.
Example
P1       P2      P3
Migrate
foo to P2 100       000   001 “Where is foo?”
101        M1             M2.ts[1] > IC3[1]+1
201                M2    Put M2 in Hold-back Q
“foo is at P2”

M1.ts[1] = IC3[1]+1
IC3  101; deliver M1
IC3  201; deliver M2
Time

• Note: jth component of M.ts is
sequence number of latest msg sent
by Pj that is known to sender of M
Safety
   Show that msgs are delivered in timestamp order.
   Suppose not
   Let m (m’) be event of sending message M (M’)
   Assume Pi delivered msg M (from Pk) before M’ (from Pj),
even though
M’.ts (= ICj(m’)) < M.ts (=ICk(m)) …….(A)      (1)
(a) Just before Pi delivered M’:
ICi[j]+1 = ICj(m’)[j] hence ICi[j] < ICj(m’)[j] (2)
(b) Delivery of M would have resulted in
ICi[j]* = ICk(m)[j]
at time of delivery
   (a) and (b) contradict (A) since (b) took place before (a),
hence ICi[j]*  ICi[j]
Liveness
   Show the system starvation-free: no
message will wait forever in the hold-
back Q
   Assume Q is the hold-back queue in Pi
and is non-empty. Let M be a msg in Q
which is not preceded by any other msg
in Q. Suppose M was sent by Pj.
Proof
   Assume ICi[k] < M.ts[k] for some k (!=j), i.e.,
condition (2) is violated; want to derive contradiction
from this.
   Let M’ be latest msg from Pk that Pj delivered prior
to sending M so that M’.ts < M.ts and M’.ts[k] =
M.ts[k].
   If Pi hasn’t delivered copy of msg M’ from Pk , then
M’ with M’.ts < M.ts is in holdback Q of Pi,
contradicting assumption that M is not causally
preceded by any other msg in holdback Q of Pi.
   So Pi must have delivered copy of msg M’ from
Pk.Thus ICi[k]  M’.ts[k] = M.ts[k], contradicting
ICi[k] < M.ts[k]
   Must give up assumption that Pi cannot deliver M.
Proof Illustration

M’
Pj                             Pk
M.ts>M’.ts

M                   M’

Pi      ICi[k]  M.ts[k]
Global state of a DS
   Consists of:
   Local state of each node (task, process)
   Messages in transit
   Why interested in a global state?
   Suppose local computation has stopped on each
node and there are no pending messages, then
   1. Distributed application has terminated successfully? or
   Problem: lack of global time
   Consequences?
   How: take a snapshot
Snapshots (taken at 2:00pm
by local clocks)
A         B               A          B           A          B
\$100          \$0                         \$0        \$100
1:59pm                     2:00pm
\$0
\$100
2:01
2:00pm

sum = \$100                    sum = \$0            sum = \$200
(a)                  (b)                          (c)
Snapshots taken at
Census Taking in Ancient
Kingdom
Village            Village

Village

Village

Want to take census counting all people,
some of whom may be traveling on
highways.
Census Taking Algorithm
   Close all gates into/out of each village
(process) and count people (record process
state) in village
   Open each outgoing gate and send official
with a red cap (special marker message).
   Open each incoming gate and count all
travelers (record channel state= messages
of official (with a red cap).
   Tally the counts from all villages.
Chandy/Lamport Snapshot Algorithm
   All processes are initially white: Messages sent by white(red)
processes are also white (red)
   MSend [Marker sending rule for process P]
   Suspend all other activities until done
   Record P’s state
   Turn red
   Send one marker over each output channel of P.
   MReceive [Marker receiving rule for P]
On receiving marker over channel C,
   if P is white { Record state of channel C as empty;
Invoke MSend; }
   else record the state of C as sequence of white messages received
since P turned red.
   Stop when marker is received on each incoming channel
   MSend and MReceive are atomic.
Assumptions
   No process failures, no message loss
   Point to point message delivery is
ordered
   How to guarantee it? ISIS clock
   Network is strongly connected.
   Why?
Snapshot (1)
A          B             A            B
msgs arriving
before maker
\$100           \$0                         \$0     constitute
\$0                         channel state

sum = \$100                   sum = \$100
(a)                         (b)
OK                    OK
Need not use time.
Snapshots (2)
A         B             A           B

\$100                    \$100                   \$0
\$100
\$100

sum = \$200               sum = \$100
(c)                       (d)
Cannot happen               Will be like this
Cuts
   Cut C divides all events to PC (those
which happened in the past relative to
C) and FC (future events).
   Cut C is consistent if there is no
message whose sending event is in FC
and whose receiving event is in PC.
Progress shown by cuts
Time
p1      p2            p3                         p4
P

Q
q1            q2       q3
1 2    3   4             5        7        8

There are 5*4 = 20 possible cuts.
Example
Time

p1           p2         p3          p4
P

q2     q3        q4     q5
Q
q1                      M

R    r1          r2              r3        r4
consistent                  inconsistent
cut                           cut

State recorded by SNAPSHOT consistent cut
Checkpoint
   Cut C is consistent  C doesn’t