Synchronization by pengtt



  Kenneth Chiu
    Reporting Performance with
• Suppose A takes 100 s and B takes 80 s.
  Which is correct?
  – B is 25% faster than A.
  – B is 20% faster than A.
• Notes:
  1/100 = .01
  1/80 = .0125
• Introduction
• Architectures
   – Network
   – Bus
• Simple Approaches
   – Spin on test-and-set
   – Spin on read
• New Software Alternatives
   – Delays
   – Queuing
• Hardware Solutions
• Summary
           Mutual Exclusion
• Pure software solutions are inefficient
  – Dekker’s algorithm
  – Peterson’s algorithm
• Most ISAs provide atomic instructions
  – Test-and-set
  – Atomic increment
  – Compare-and-swap
  – Load-linked/store-conditional
  – Atomic exchange
     Spinning (Busy-Waiting)
• What is spin-lock?
  – Test
  – Test-Lock
  – Release
• Isn’t busy-waiting bad?
  – Blocking is expensive
  – Adaptive
• What should happen on a uniprocessor?
• Introduction
• Architectures
   – Network
   – Bus
• Simple Approaches
   – Spin on test-and-set
   – Spin on read
• New Software Alternatives
   – Delays
   – Queuing
• Hardware Solutions
• Summary
• Multistage interconnection network
  – Without coherent private caches
  – With invalidation-based cache coherence using
    remote directories
• Bus
  – Without coherent private caches
  – With snoopy write-through invalidation-based cache
  – With snoopy write-back invalidation-based cache
  – With snoopy distributed-write cache coherence
  Hardware Support for Mutex
• Atomic RMW. Requires:
  – Read
  – Write
  – Arbitration
  – Locking
• Usually collapsed into one or two
        Multistage Networks
• Requests forwarded through series of
• Cached copies (recorded in remote
  directory) invalidated.
• Computation can be done remotely.
• Bus used for arbitration
• Processor acquires the bus and raises a
            Simple Approaches
• Introduction
• Architectures
   – Network
   – Bus
• Simple Approaches
   – Spin on test-and-set
   – Spin on read
• New Software Alternatives
   – Delays
   – Queuing
• Hardware Solutions
• Summary
           Three Concerns
• To minimize delay before reacquisition,
  should retry often.
• Retrying often creates a lot of bus activity,
  which degrades performance.
• Complex algorithms may address first two,
  but then will have high latency for
  uncontended locks.
  Performance Under Contention
• Could be argued:
  – Highly parallel application by definition has no
    lock with contention.
  – If you have contention, you’ve designed it
• Specious argument
  – Non-linear behavior. Contention may be
    worse with bad locks.
  – Can’t always redesign the program.
         Spin on Test-And-Set
• Code
  – Initial condition
     lock := CLEAR;
  – Lock
     while (TestAndSet(lock) = BUSY);
  – Unlock
     lock := CLEAR;
• Performance
  – Contention on the lock datum.
     • Architectures don’t allow lock holder to have priority.
     • What about priority inheritance?
  – General subsystem contention.
             Spin on Read
• Pseudo-Code
  – Lock
    while (lock = BUSY or TestAndSet(lock)
      = BUSY);
    • Assume short-circuiting.
• Performance
  – When busy, reads out of cache.
  – Upon release, either each copy updated
    (distributed-write) or read miss.
  – Better?

P1   W(i)

P2          R(m)      T(i)

P3           R(m)            T(i)      R       R

P4                  R(m)            R(m)   R       R
                Spin on Read
• Upon release, each processor will incur a read
  miss. Each processor will then read the value.
  One processor will read the new value and do
  the first test-and-set.
• Processors that miss the unlocked window will
  resume spinning. Processors that see the lock
  unlocked will try to test-and-test.
• Each failing test-and-test will invalidate all other
  cache copies, causing them to miss again.
• Memory requests before will be slowed down.
   – Memory requests after unaffected.
• This is a per-critical section overhead, so short
  critical sections will suffer.
• Fixed priority bus arbitration
   – Lock holder has highest priority so will not be slowed.
   – Race conditions that will upset the priority.
      • If released before quiescence, will be processors that are still
        contending for the lock.
        Spin on Read Analysis
• Several factors
  – Delay between detecting the lock has been released,
    and attempting to acquire it with test-and-set.
  – Invalidation occurs during the test-and-set.
  – Invalidation-based cache-coherence requires O(P)
    bus/network cycles to broadcast a value.
     • Can’t they snoop?
• Broadcasting updates exacerbates the problem.
  – Since they all try the test-and-test.
• Sequent shared-memory with 20 386
  – Write-back, invalidation-based.
  – Acquiring and releasing normally takes 5.6 microsecs.
• Total elapsed time for all processors to execute
  a CS one million times.
  – Lock, execute critical section, release, delay
  – Mean delay is equal to 5X time of the critical section.
     • What’s the purpose of the delay?
  – Lock and data in separate cache lines.
     • What’s “false sharing”?

Ideal is with free spin-waiting.
            Quiescence Time
• How to measure?
  – A critical section that delays, then uses bus
     • If delay is long enough, then time to execute the
       critical section should be the same on one
       processor as on all processors.
     • Perform a search.
        – What kind of search? (How would you pick the times?)
Quiescence Time
     New Software Alternatives
• Introduction
• Architectures
   – Network
   – Bus
• Simple Approaches
   – Spin on test-and-set
   – Spin on read
• New Software Alternatives
   – Delays
   – Queuing
• Hardware Solutions
• Summary
             Inserting Delays
• Two dimensions
  – Insertion location
     • After release
     • After every access
  – Delay length
     • static: fixed for a particular processor
     • dynamic: changes
• Ethernet algorithm
  – Listen for idle line
  – Immediately try to transmit when idle
  – If collision occurs (how do we know?), then
    wait and retry
  – Increase wait time exponentially
• Bonus question: why does Ethernet have
  a minimum packet size?
 Delay After Release Detection
• Idea is to minimize unsuccessful test-and-set
• Two kinds of delay:
  – Static
     • Each processor statically assigned a slot from 0 to N – 1.
     • Number of slots can be adjusted.
         – Few processors
             » Many slots, high latency
             » Few slots, good performance
         – Many processors
             » Many slots, good performance
             » Few slots, high contention
  – Dynamic
            Dynamic Delay
• CSMA/CD collision has fundamentally
  different properties.
  – In locking “collision”, the first locker
  – In CSMA/CD collision, no sender succeeded.
• What happens with exponential backoff for
  10 lockers?
• Solution is to bound the delay.
   Delay between References
• Doesn’t work well with dynamic delay.
  – Backoff continues while locking processor in
    critical section.
  – Delay should be tied to number of spinning
    processors, not the length of the critical
• Any possible alternatives to estimating the
  number of spinning processors?

•   1 microsecond to execute a test-and-set.
•   Queuing done with explicit lock.
•   Ideal time subtracted to show overhead only.
•   One processor time shows latency.
•   Static worse with few processors.
•   With many processors, backoff slightly worse
 Spin-Waiting Overhead vs No. of

• Need lots of slots when lots of processors.
• Use shared counter to keep track of no. of
  spinning processors.
  – Two extra atomic instructions per critical
  – Each spinning processor must read counter.
• Use explicit queue
  – Doesn’t really get anywhere, since need a
    lock for the queue.
         Use Array of Flags
• Each processor spins on its own memory
• To unlock, signal next memory location.
• Use atomic increment to assign memory
• Use modular arithmetic to avoid infinite
• Code
   – Init
       flags[0] = HAS_LOCK;
       flags[1..P – 1] := MUST_WAIT;
       queueLast := 0;
   – Lock
       myPlace := ReadAndIncrement(queueLast);
       while (flags[myPlace mod P] = MUST_WAIT);
       flags[myPlace mod P] := MUST_WAIT;
   – Unlock
       flags[(myPlace + 1) mod P] := HAS_LOCK;
• What happens on overflow of queueLast? How to fix?
• Memory barriers needed?

• Atomic increment is emulated.
• Initial latency is high.
    Overhead in Achieving Barrier

• Barrier
   – Timestamp is taken on release.
   – Timestamp taken when last processor acquires the lock.
            Hardware Solutions
• Introduction
• Architectures
   – Network
   – Bus
• Simple Approaches
   – Spin on test-and-set
   – Spin on read
• New Alternatives
   – Delays
   – Queuing
• Hardware Solutions
• Summary
              Hardware Solutions
• Networks
  – Combining
  – Hardware queuing
• Bus
  – Invalidate only if lock value changes.
        • Still has performance degradation as processors goes up.
  – More snooping
        • Snoop read miss data
        • Snoop test-and-set requests
            – First read miss (snoop miss data)
                 » If busy, abort
                 » If free, then try locking bus
            – While waiting, monitor the bus.
                 » Abort if someone else gets the lock
• Memory operations are not free.
• Memory operations are not independent
  on shared-memory machines.
• Writes are expensive.
• Atomic instructions are even more
• Don’t kill the bus.

To top