Document Sample

Chapter 7 Synchronization Topics Physical clock synchronization Logical clock synchronization Causality relation Lamport’s logical clock Vector logical clock Multicast ISIS vector clock Snapshot New Issues in DS Global time Event order e1 at 1:00pm on machine m1, e2 at 1:01pm at machine m2. Which event happens earlier? Global state snapshot Mutual exclusion & Synchronization Time and Clock Two roles of time - Defines temporal order among events - Duration (measured by timer) UTC (Universal Coordinated Time) is based on Cesium-133 atom oscillation; located at over 200 labs in the world With satellites, 0.5ms accuracy is possible. (100 MIPS 50,000 instructions in 0.5ms). Clock Skew Skew: Clock reading (from single clock) is location-dependent, e.g., distance from satellite or clock source on a circuit board Drift: Multiple clocks. t = the real time Cp(t) = the reading of a clock p at time t (Cp(t)= t for ideal clock) dCp(t) /dt = ticking rate (dCp(t) /dt = 1 for ideal clock) Consequence: An Example When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time. But it is a different story in DS. Time Physical Clock Synchronization Cristian’s Algorithm The Berkeley Algorithm Network Time Protocol (NTP) OSF DCE Cristian’s Algorithm: Architecture WWV-node receiving UTC-signals, serving as the central UTC-time server (CUTCS) for the DS WWV is a short wave radio station in Colorado. Periodically every node in the DS sends a time request to the central UTC time server CUTCS. The CUTCS responds with its current time tUTC Adjust Time When the client gets the reply, it simply set its clock to tUTC Time may run backward Introduce the change gradually Consider the time for message propagation. Adjust Time (continued) Tp T0 Tp I T1 t Client Server Estimate Tp (propagation delay) from T1 – T0 = 2 x Tp + I. where I = processing time. Current time = t (server’s time in message) + Tp. Assumption? The Berkeley Algorithm In Cristian’s algorithm the central time server was passive Now it’s active, i.e. it periodically polls all other nodes to hand out their current local times ci(t). Based on the answers it calculates a mean and tells all other machines to advance or slow down their clocks accordingly. Relative Clock Synchronization Time server periodically sends its time to clients and asks for theirs. Clients respond with how far ahead or behind the server they are Time server uses the estimated local times for building the arithmetic mean Deviations from this arithmetic mean are sent to nodes enabling them to slow down respectively to speed up. The Berkeley Algorithm a) The time daemon asks all the other machines for their clock values b) The machines answer c) The time daemon tells everyone how to adjust their clock Summary Cristian’s method and the Berkeley algorithm are intended for intranets Both may be improved with fault tolerance methods Instead of one UTC server in Cristian’s algorithm use n time servers and always take the first answer from whatever time serve Instead of taking the arithmetic mean from all clients in the Berkeley algorithm take the fault tolerance mean, i.e. skip deviations with a certain threshold Network Time Protocol (NTP) Goal absolute (UTC)-time service in large nets (e.g. Internet) high availability (via fault tolerance) protection against fabrication (via authentication) Architecture time-servers build up a hierarchical synchronization subnet all primary servers have an UTC-receiver secondary servers are synchronized by their parent primary server all other stations are leaves on level 3 being synchronized by level 2 time servers accuracy of clocks decreases with increasing level number the net is able to reconfigure NTP Reliability from redundant paths, scalability, authenticated time sources Synchronization of Servers (NTP) Synchronization subnet can reconfigure if failures occur,e.g. Primary having lost its UTC source can become a secondary Secondary having lost its primary can use another one Modes of synchronization: Multicast mode (for quick LANs, low accuracy) A server within a LAN periodically multicasts time to other leaves in the LAN which set their clocks assuming some delay Procedure-call mode (Cristian’s algorithm with medium accuracy) A server responds to requests with its actual timestamp Symmetric mode (high accuracy) Pairs of servers exchange message containing times OSF DCE Time is an interval [t-e, t+e]. TS1 TS2 TS3 TS4 Reject New time interval Two intervals overlap cannot say which time is earlier (In case of overlap, Unix make should recompile). Logical Clock Synchronization A powerful building block in DS Duplicate detection Cache consistency Leases Commitment … Leslie Lamport Time, Clocks, and the Ordering of Events in a Distributed System Best known for his work on 1)Temporal logic 2)LaTeX Microsoft The Paper Handles problems of clock drift in a DS Identifies main function of computer clocks, i.e. ordering of events Indicates which conditions clocks must satisfy to fulfill their role Introduces logical clocks Benefits of logical clocks? Needed for determining causality Logical Time In many cases it’s sufficient just to order the related relevant events, i.e. we want to be able to position these events relatively, but not absolutely. Interesting: Relative position of an event on the time axis no need for any scaling on this time axis Ordering Events Event ordering linked with concept of causality: Saying that event a happened before event b is same as saying that event a could have affected the outcome of event b If events a and b happen on processes that do not exchange any data, their exact ordering is not important. Causality Relationship Time p1 p2 p3 p4 P q2 q3 q4 q5 Q q1 R r1 r2 r3 r4 p1 p2 p1 q2 p1 r3 transitive • Event changes state of process. • State remains same till next event occurs. Formal Definition a b defined by: If a and b are in the same process, and a occurs earlier than b, then ab. If a is a sending event and b is receiving event of same message, then a b. If ab and bc, then ac. [Transitive] If a b, then a causally precedes (or happened before) b; a and b are causally related a and b are concurrent if neither ab nor ba. Message-Related Events Sending event Receiving event Message arrival (at kernel) and delivery (to user process): Kernel can control timing of delivery after arrival. Example Time e11e12e21e22e32 , furthermore e31e32, whereas e31 is neither related (has happened before) to e11, nor to e12, nor to e21, nor to e22. e31 is concurrent to e11, e12, e21, and e22. Lamport Clock Suppose E= {events}, each e in E gets a Lamport time stamp L(e), as follows: 1. e is a pure local event or a sending-event: if e has no local predecessor, then L(e) := 1, otherwise there is a local predecessor e’, thus L(e) := L(e’) + 1 2. e is a receiving event, with a corresponding sending-event s: if e has no local predecessor, then L(e) = L(s) +1, otherwise there is a local predecessor e’, thus L(e) := max{L(s),L(e’)} + 1 Example Note: Each local counter is incremented with each local event. In a communication we adjust the involved counters of the two communicating nodes to be consistent with the happened-before-relation. Remark: Same mechanism can be used to adjust clocks on different nodes. The Lamport time is consistent with the happened-before-relation, i.e. if x y, then L( x)<L( y), but not vice versa. Adjusting Clocks Without adjusting With adjusting local clocks local clocks Limitation on Lamport Clocks From Lamport time values you cannot conclude whether two events are in the happen-before relationship,e.g. e11 and e32. Total Ordering of Events Lamport-time only gives us a partial- ordering of distributed events. To implement the total ordering: Each processor is assigned by a unique id (integer) Given two events e1 and e2, e1 is ordered before e2 if L(e1) < L(e2) or L(e1)=L(e2) and Id(e1) < Id(e2) Holding Back Deliveries Delay the delivery of messages that arrived “too soon” Useful when delivering messages from kernel to processes Hold back the delivery of M to process P until there is a guarantee that no message M’ with L(M’) < L(M) will arrive at P in the future. Implementation Assumption: messages from a particular source arrive in the FIFO order Each site maintains a set of message queues, one for each other site When a message arrives, placed in the corresponding queue When all queues are non-empty, compare the timestamps of the messages at the heads of the queues, and deliver the messages with the oldest timestamp. Limitation All message queues need be non- empty. Normally not true. Require multicast to solve the problem. With Lamport clock, L(a)<L(b) does not mean ab. Unnecessarily delay some messages. Vector Clock. Event Counter P1 P2 P3 1 1 2 2 1 Event counter at Pi : Initialized at 0 and 2 3 incremented for each event 3 4 Vector Time Assumption: n tasks (processes) Pi in DS Each Pi has its own local clock being a n-dimensional vector (initially zeroed) Vi(a) is timestamp of event a at process Pi Vi[i] = number of local events at Pi Vi[j] is Pi best guess of how many events have been on Pj Rules There is a DS with n distributed processes. n- dimensional vector Vi is vector-time of process Pi if it is built according to the following rules: (1) Initially, Vi = (0, …, 0) for all 1<=i<=n (2) For a local event on process Pi: Vi[i]++ (3) Pi includes the value t = Vi in every msg m (4) When Pj receives a message m with timestamp t, it sets Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j Communication cost? Little overhead compared to Lamport clock Example P1 P2 P3 000 000 000 100 010 001 M1 200 220 230 300 240 250 242 450 243 260 264 Time 550 273 Notation We define global V(e) = Vi(e) if event e happens in Pi We write V(a) V(b) if V(a)[k] V(b)[k] for all k. Here V(a)[k] denotes the kth component of V(a). We write V(a) < V(b) if V(a)[k] V(b)[k] for all k, and V(a)[j] < V(b)[j] for at least one j Vector Time Characteristics The following inter relationships between causality or the happened- before relation and vector-time hold: A.) ee’ iff V(e) < V(e) B.) e||e’ iff V(e) || V(e’) The vector-time is the best known estimation for global sequencing that is based only on local information. Proof P1 P2 P3 a b iff V(a) < V(b) Proof : For A fixed b, a b iff a is in shaded area iff each component of V(a) a corresponding component t1 t3 of V(b). b t2 Multicasting A message is sent to all the members of a group Sending video stream to a set of customers Implementing a chat program Sending updates to a group of replica managers IPv4 Multicast Addresses Class D (starts with bit sequence1110) 224.0.0.1 to 239.255.255.255 (about 228268 million) 224.0.0.1 is for “all systems on this subnet” 224.2.0.0 ~ 224.2.127.253 are for multimedia conference calls Causal Ordering of Messages Suppose m1 and m2 are two messages being received at the same node i. A set of messages is causally ordered if for all pairs <m1, m2> the following holds: send(m1) send(m2) receive(m1)receive(m2) Causality Violation Suppose M1’s sending event happened before M3’s sending event. Causality violation occurs if M1 is delivered after M3 (In particular, non-FIFO delivery is causality violation). P1 P2 P3 Migrate •Delay the delivery foo to P2 of M3 to P2 until M1 arrives. •ISIS system using multicast Time Formal Description of ISIS Clock ICi Pi initializes its clock ICi = [0,…,0]. For each msg sending event by Pi ICi[i]++ Pi attaches ICi to message it sends. Upon receiving msg M from Pj with M.ts, Pi checks if 1) M.ts[j] == ICi[j] + 1 (M is next msg expected from Pj) 2) ICi[k] M.ts[k] for all other k (all msgs from Pk that sender Pj has received have been received by Pi) If both are satisfied, Pi delivers M after ICi[j]++ Otherwise, Pi puts M in hold-back Q until they are satisfied. Example P1 P2 P3 Migrate foo to P2 100 000 001 “Where is foo?” 101 M1 M2.ts[1] > IC3[1]+1 201 M2 Put M2 in Hold-back Q “foo is at P2” M1.ts[1] = IC3[1]+1 IC3 101; deliver M1 IC3 201; deliver M2 Time • Note: jth component of M.ts is sequence number of latest msg sent by Pj that is known to sender of M Safety Show that msgs are delivered in timestamp order. Suppose not Let m (m’) be event of sending message M (M’) Assume Pi delivered msg M (from Pk) before M’ (from Pj), even though M’.ts (= ICj(m’)) < M.ts (=ICk(m)) …….(A) (1) (a) Just before Pi delivered M’: ICi[j]+1 = ICj(m’)[j] hence ICi[j] < ICj(m’)[j] (2) (b) Delivery of M would have resulted in ICi[j]* = ICk(m)[j] at time of delivery (a) and (b) contradict (A) since (b) took place before (a), hence ICi[j]* ICi[j] Liveness Show the system starvation-free: no message will wait forever in the hold- back Q Assume Q is the hold-back queue in Pi and is non-empty. Let M be a msg in Q which is not preceded by any other msg in Q. Suppose M was sent by Pj. Proof Assume ICi[k] < M.ts[k] for some k (!=j), i.e., condition (2) is violated; want to derive contradiction from this. Let M’ be latest msg from Pk that Pj delivered prior to sending M so that M’.ts < M.ts and M’.ts[k] = M.ts[k]. If Pi hasn’t delivered copy of msg M’ from Pk , then M’ with M’.ts < M.ts is in holdback Q of Pi, contradicting assumption that M is not causally preceded by any other msg in holdback Q of Pi. So Pi must have delivered copy of msg M’ from Pk.Thus ICi[k] M’.ts[k] = M.ts[k], contradicting ICi[k] < M.ts[k] Must give up assumption that Pi cannot deliver M. Proof Illustration M’ Pj Pk M.ts>M’.ts M M’ Pi already delivered M’ Pi ICi[k] M.ts[k] Global state of a DS Consists of: Local state of each node (task, process) Messages in transit Why interested in a global state? Suppose local computation has stopped on each node and there are no pending messages, then 1. Distributed application has terminated successfully? or 2. Deadlock? Problem: lack of global time Consequences? How: take a snapshot Snapshots (taken at 2:00pm by local clocks) A B A B A B $100 $0 $0 $100 1:59pm 2:00pm $0 $100 2:01 2:00pm sum = $100 sum = $0 sum = $200 (a) (b) (c) Snapshots taken at Census Taking in Ancient Kingdom Village Village Village Village Want to take census counting all people, some of whom may be traveling on highways. Census Taking Algorithm Close all gates into/out of each village (process) and count people (record process state) in village Open each outgoing gate and send official with a red cap (special marker message). Open each incoming gate and count all travelers (record channel state= messages sent but not received yet) who arrive ahead of official (with a red cap). Tally the counts from all villages. Chandy/Lamport Snapshot Algorithm All processes are initially white: Messages sent by white(red) processes are also white (red) MSend [Marker sending rule for process P] Suspend all other activities until done Record P’s state Turn red Send one marker over each output channel of P. MReceive [Marker receiving rule for P] On receiving marker over channel C, if P is white { Record state of channel C as empty; Invoke MSend; } else record the state of C as sequence of white messages received since P turned red. Stop when marker is received on each incoming channel MSend and MReceive are atomic. Assumptions No process failures, no message loss Point to point message delivery is ordered How to guarantee it? ISIS clock Network is strongly connected. Why? Snapshot (1) A B A B msgs arriving before maker $100 $0 $0 constitute $0 channel state sum = $100 sum = $100 (a) (b) OK OK Need not use time. Snapshots (2) A B A B $100 $100 $0 $100 $100 sum = $200 sum = $100 (c) (d) Cannot happen Will be like this Cuts Cut C divides all events to PC (those which happened in the past relative to C) and FC (future events). Cut C is consistent if there is no message whose sending event is in FC and whose receiving event is in PC. Progress shown by cuts Time p1 p2 p3 p4 P Q q1 q2 q3 1 2 3 4 5 7 8 There are 5*4 = 20 possible cuts. Example Time p1 p2 p3 p4 P q2 q3 q4 q5 Q q1 M R r1 r2 r3 r4 consistent inconsistent cut cut State recorded by SNAPSHOT consistent cut Checkpoint Cut C is consistent C doesn’t contradict sequence of events experienced by any site can assume it did exist at the same time Can use snapshot as checkpoint, from which activity in distributed system can be resumed after crash

DOCUMENT INFO

Shared By:

Categories:

Tags:
Simon Fraser University, Open Source, term project, computing science, special topics, Operating Systems, Computer Science, data consistency, Data Structures, after April

Stats:

views: | 3 |

posted: | 7/4/2011 |

language: | English |

pages: | 64 |

OTHER DOCS BY shuifanglj

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.