Synchronization Kenneth Chiu Reporting Performance with Percentages • Suppose A takes 100 s and B takes 80 s. Which is correct? – B is 25% faster than A. – B is 20% faster than A. • Notes: 1/100 = .01 1/80 = .0125 Introduction • Introduction • Architectures – Network – Bus • Simple Approaches – Spin on test-and-set – Spin on read • New Software Alternatives – Delays – Queuing • Hardware Solutions • Summary Mutual Exclusion • Pure software solutions are inefficient – Dekker’s algorithm – Peterson’s algorithm • Most ISAs provide atomic instructions – Test-and-set – Atomic increment – Compare-and-swap – Load-linked/store-conditional – Atomic exchange Spinning (Busy-Waiting) • What is spin-lock? – Test – Test-Lock – Release • Isn’t busy-waiting bad? – Blocking is expensive – Adaptive • What should happen on a uniprocessor? Architectures • Introduction • Architectures – Network – Bus • Simple Approaches – Spin on test-and-set – Spin on read • New Software Alternatives – Delays – Queuing • Hardware Solutions • Summary Architectures • Multistage interconnection network – Without coherent private caches – With invalidation-based cache coherence using remote directories • Bus – Without coherent private caches – With snoopy write-through invalidation-based cache coherence – With snoopy write-back invalidation-based cache coherence – With snoopy distributed-write cache coherence Hardware Support for Mutex • Atomic RMW. Requires: – Read – Write – Arbitration – Locking • Usually collapsed into one or two transactions. Multistage Networks • Requests forwarded through series of switches. • Cached copies (recorded in remote directory) invalidated. • Computation can be done remotely. Bus • Bus used for arbitration • Processor acquires the bus and raises a line Simple Approaches • Introduction • Architectures – Network – Bus • Simple Approaches – Spin on test-and-set – Spin on read • New Software Alternatives – Delays – Queuing • Hardware Solutions • Summary Three Concerns • To minimize delay before reacquisition, should retry often. • Retrying often creates a lot of bus activity, which degrades performance. • Complex algorithms may address first two, but then will have high latency for uncontended locks. Performance Under Contention Unimportant? • Could be argued: – Highly parallel application by definition has no lock with contention. – If you have contention, you’ve designed it wrong. • Specious argument – Non-linear behavior. Contention may be worse with bad locks. – Can’t always redesign the program. Spin on Test-And-Set • Code – Initial condition lock := CLEAR; – Lock while (TestAndSet(lock) = BUSY); – Unlock lock := CLEAR; • Performance – Contention on the lock datum. • Architectures don’t allow lock holder to have priority. • What about priority inheritance? – General subsystem contention. Spin on Read • Pseudo-Code – Lock while (lock = BUSY or TestAndSet(lock) = BUSY); • Assume short-circuiting. • Performance – When busy, reads out of cache. – Upon release, either each copy updated (distributed-write) or read miss. – Better? Better? P1 W(i) P2 R(m) T(i) P3 R(m) T(i) R R P4 R(m) R(m) R R Spin on Read • Upon release, each processor will incur a read miss. Each processor will then read the value. One processor will read the new value and do the first test-and-set. • Processors that miss the unlocked window will resume spinning. Processors that see the lock unlocked will try to test-and-test. • Each failing test-and-test will invalidate all other cache copies, causing them to miss again. Quiescence • Memory requests before will be slowed down. – Memory requests after unaffected. • This is a per-critical section overhead, so short critical sections will suffer. • Fixed priority bus arbitration – Lock holder has highest priority so will not be slowed. – Race conditions that will upset the priority. • If released before quiescence, will be processors that are still contending for the lock. Spin on Read Analysis • Several factors – Delay between detecting the lock has been released, and attempting to acquire it with test-and-set. – Invalidation occurs during the test-and-set. – Invalidation-based cache-coherence requires O(P) bus/network cycles to broadcast a value. • Can’t they snoop? • Broadcasting updates exacerbates the problem. (Why?) – Since they all try the test-and-test. Performance • Sequent shared-memory with 20 386 processors. – Write-back, invalidation-based. – Acquiring and releasing normally takes 5.6 microsecs. • Total elapsed time for all processors to execute a CS one million times. – Lock, execute critical section, release, delay – Mean delay is equal to 5X time of the critical section. • What’s the purpose of the delay? – Lock and data in separate cache lines. • What’s “false sharing”? Performance Ideal is with free spin-waiting. Quiescence Time • How to measure? – A critical section that delays, then uses bus heavily. • If delay is long enough, then time to execute the critical section should be the same on one processor as on all processors. • Perform a search. – What kind of search? (How would you pick the times?) Quiescence Time New Software Alternatives • Introduction • Architectures – Network – Bus • Simple Approaches – Spin on test-and-set – Spin on read • New Software Alternatives – Delays – Queuing • Hardware Solutions • Summary Inserting Delays • Two dimensions – Insertion location • After release • After every access – Delay length • static: fixed for a particular processor • dynamic: changes CSMA/CD • Ethernet algorithm – Listen for idle line – Immediately try to transmit when idle – If collision occurs (how do we know?), then wait and retry – Increase wait time exponentially • Bonus question: why does Ethernet have a minimum packet size? Delay After Release Detection • Idea is to minimize unsuccessful test-and-set instructions. • Two kinds of delay: – Static • Each processor statically assigned a slot from 0 to N – 1. • Number of slots can be adjusted. – Few processors » Many slots, high latency » Few slots, good performance – Many processors » Many slots, good performance » Few slots, high contention – Dynamic Dynamic Delay • CSMA/CD collision has fundamentally different properties. – In locking “collision”, the first locker succeeded. – In CSMA/CD collision, no sender succeeded. • What happens with exponential backoff for 10 lockers? • Solution is to bound the delay. Delay between References • Doesn’t work well with dynamic delay. – Backoff continues while locking processor in critical section. – Delay should be tied to number of spinning processors, not the length of the critical section. • Any possible alternatives to estimating the number of spinning processors? Performance • 1 microsecond to execute a test-and-set. • Queuing done with explicit lock. • Ideal time subtracted to show overhead only. • One processor time shows latency. • Static worse with few processors. • With many processors, backoff slightly worse Spin-Waiting Overhead vs No. of Slots • Need lots of slots when lots of processors. Queuing • Use shared counter to keep track of no. of spinning processors. – Two extra atomic instructions per critical section. – Each spinning processor must read counter. • Use explicit queue – Doesn’t really get anywhere, since need a lock for the queue. Use Array of Flags • Each processor spins on its own memory location. • To unlock, signal next memory location. • Use atomic increment to assign memory locations. • Use modular arithmetic to avoid infinite arrays. Queue • Code – Init flags = HAS_LOCK; flags[1..P – 1] := MUST_WAIT; queueLast := 0; – Lock myPlace := ReadAndIncrement(queueLast); while (flags[myPlace mod P] = MUST_WAIT); … flags[myPlace mod P] := MUST_WAIT; – Unlock flags[(myPlace + 1) mod P] := HAS_LOCK; • What happens on overflow of queueLast? How to fix? • Memory barriers needed? Performance • Atomic increment is emulated. • Initial latency is high. Overhead in Achieving Barrier • Barrier – Timestamp is taken on release. – Timestamp taken when last processor acquires the lock. Hardware Solutions • Introduction • Architectures – Network – Bus • Simple Approaches – Spin on test-and-set – Spin on read • New Alternatives – Delays – Queuing • Hardware Solutions • Summary Hardware Solutions • Networks – Combining – Hardware queuing • Bus – Invalidate only if lock value changes. • Still has performance degradation as processors goes up. – More snooping • Snoop read miss data • Snoop test-and-set requests – First read miss (snoop miss data) » If busy, abort » If free, then try locking bus – While waiting, monitor the bus. » Abort if someone else gets the lock Summary • Memory operations are not free. • Memory operations are not independent on shared-memory machines. • Writes are expensive. • Atomic instructions are even more expensive. • Don’t kill the bus.