Docstoc

41

Document Sample
41 Powered By Docstoc
					Practical Concerns for Scalable
        Synchronization

       Jonathan Walpole (PSU)
        Paul McKenney (IBM)
   Tom Hart (University of Toronto)
“Life is just one darned thing after another”
  - Elbert Hubbard
“Multiprocessing is just one darned thing
before, after or simultaneously with another”
“Synchronization is about imposing order”
The problem – race conditions

“i++” is dangerous if “i” is global

                                    CPU 0              CPU 0
    load %1,i
    inc %1
    store %1,i




                                               i



                 Jonathan Walpole   www.cs.pdx.edu/~walpole    5
SFU Feb 2004
The problem – race conditions

“i++” is dangerous if “i” is global

                                     CPU 0               CPU 0
    load %1,i                       load %1,i           load %1,i
    inc %1                             i                    i
    store %1,i




                                                 i



                 Jonathan Walpole     www.cs.pdx.edu/~walpole       6
SFU Feb 2004
The problem – race conditions

“i++” is dangerous if “i” is global

                                    CPU 0               CPU 0
    load %1,i                       inc %1              inc %1
    inc %1                           i+1                 i+1
    store %1,i




                                                i



                 Jonathan Walpole    www.cs.pdx.edu/~walpole     7
SFU Feb 2004
The problem – race conditions

“i++” is dangerous if “i” is global

                                      CPU 0               CPU 0
    load %1,i                       store %1,i          store %1,i
    inc %1                             i+1                 i+1
    store %1,i




                                                 i+1



                 Jonathan Walpole      www.cs.pdx.edu/~walpole       8
SFU Feb 2004
The solution – critical sections

Classic multiprocessor solution: spinlocks
    –   CPU 1 waits for CPU 0 to release the lock

                  spin_lock(&mylock);
                  i++;
                  spin_unlock(&mylock);


Counts are accurate, but locks are not free!



                  Jonathan Walpole   www.cs.pdx.edu/~walpole   9
SFU Feb 2004
Critical-section efficiency


                                   Lock Acquisition (Ta )

                                   Critical Section (Tc )

                                      Lock Release (Tr )



                                        Tc
       Critical-section efficiency =
                                     Tc+Ta+Tr

  Ignoring lock contention and cache conflicts in the critical section



                   Jonathan Walpole          www.cs.pdx.edu/~walpole     10
SFU Feb 2004
Critical section efficiency




                                                Critical Section Size
               Jonathan Walpole   www.cs.pdx.edu/~walpole               11
SFU Feb 2004
Performance of normal instructions




               Jonathan Walpole   www.cs.pdx.edu/~walpole   12
SFU Feb 2004
What’s going on?

Taller memory hierarchies
    –   Memory speeds have not kept up with CPU speeds
    –   1984: no caches needed, since instructions were
        slower than memory accesses
    –   2005: 3-4 level cache hierarchies, since
        instructions are orders of magnitude faster than
        memory accesses




                 Jonathan Walpole   www.cs.pdx.edu/~walpole   13
SFU Feb 2004
Why does this matter?

Synchronization implies sharing data across CPUs
    –   normal instructions tend to hit in top-level cache
    –   synchronization operations tend to miss

Synchronization requires a consistent view of data
    –   between cache and memory
    –   across multiple CPUs
    –   requires CPU-CPU communication

Synchronization instructions see memory latency!

                  Jonathan Walpole    www.cs.pdx.edu/~walpole   14
SFU Feb 2004
… but that’s not all!

Longer pipelines
    –   1984: Many clocks per instruction
    –   2005: Many instructions per clock, 20-stage pipelines
Out of order execution
    –   Keeps the pipelines full
    –   Must not reorder the critical section before its lock!

 Synchronization instructions stall the pipeline!



                  Jonathan Walpole   www.cs.pdx.edu/~walpole   15
SFU Feb 2004
Reordering means weak memory consistency



                                  Memory barriers
                                   - Additional synchronization
                                     instructions are needed to
                                     manage reordering




               Jonathan Walpole       www.cs.pdx.edu/~walpole   16
SFU Feb 2004
What is the cost of all this?

  Instruction                                Cost
                                     1.45 GHz       3.06GHz
                                     IBM POWER4     Intel Xeon

 Normal Instruction                    1.0              1.0




                  Jonathan Walpole     www.cs.pdx.edu/~walpole   17
SFU Feb 2004
Atomic increment

  Instruction                                  Cost
                                       1.45 GHz       3.06GHz
                                       IBM POWER4     Intel Xeon

 Normal Instruction                      1.0              1.0
 Atomic Increment                      183.1          402.3




                    Jonathan Walpole     www.cs.pdx.edu/~walpole   18
SFU Feb 2004
Memory barriers

  Instruction                                  Cost
                                       1.45 GHz       3.06GHz
                                       IBM POWER4     Intel Xeon

 Normal Instruction                      1.0              1.0
 Atomic Increment                      183.1          402.3
 SMP Write Memory Barrier              328.6             0.0
 Read Memory Barrier                   328.9          402.3
 Write Memory Barrier                  400.9             0.0




                    Jonathan Walpole     www.cs.pdx.edu/~walpole   19
SFU Feb 2004
Lock acquisition/release with LL/SC

  Instruction                                   Cost
                                       1.45 GHz        3.06GHz
                                       IBM POWER4      Intel Xeon

 Normal Instruction                       1.0              1.0
 Atomic Increment                       183.1          402.3
 SMP Write Memory Barrier              328.6              0.0
 Read Memory Barrier                   328.9           402.3
 Write Memory Barrier                  400.9                0
 Local Lock Round Trip                 1057.5          1138.8




                    Jonathan Walpole      www.cs.pdx.edu/~walpole   20
SFU Feb 2004
Compare & swap unknown values (NBS)

  Instruction                                   Cost
                                       1.45 GHz        3.06GHz
                                       IBM POWER4      Intel Xeon

 Normal Instruction                       1.0              1.0
 Atomic Increment                       183.1          402.3
 SMP Write Memory Barrier              328.6              0.0
 Read Memory Barrier                   328.9           402.3
 Write Memory Barrier                  400.9                0
 Local Lock Round Trip                 1057.5          1138.8
 CAS Cache Transfer & Invalidate        247.1           847.1




                    Jonathan Walpole      www.cs.pdx.edu/~walpole   21
SFU Feb 2004
Compare & swap known values (spinlocks)

  Instruction                                   Cost
                                       1.45 GHz        3.06GHz
                                       IBM POWER4      Intel Xeon

 Normal Instruction                       1.0              1.0
 Atomic Increment                       183.1          402.3
 SMP Write Memory Barrier              328.6              0.0
 Read Memory Barrier                   328.9           402.3
 Write Memory Barrier                  400.9                0
 Local Lock Round Trip                 1057.5          1138.8
 CAS Cache Transfer & Invalidate        247.1           847.1
 CAS Blind Cache Transfer               257.1          993.9



                    Jonathan Walpole      www.cs.pdx.edu/~walpole   22
SFU Feb 2004
The net result?

1984: Lock contention was the main issue
2005: Critical section efficiency is a key issue

   Even if the lock is always free when you try to
   acquire it, performance can still suck!




               Jonathan Walpole   www.cs.pdx.edu/~walpole   23
SFU Feb 2004
How has this affected OS design?

Multiprocessor OS designers search for “scalable”
 synchronization strategies
    –   reader-writer locking instead of global locking
    –   data locking and partitioning
    –   Per-CPU reader-writer locking
    –   Non-blocking synchronization

The “common case” is read-mostly access to linked
 lists and hash-tables
    –   asymmetric strategies favouring readers are good

                  Jonathan Walpole      www.cs.pdx.edu/~walpole   24
SFU Feb 2004
Review - Global locking

A symmetric approach (also called “code locking”)
    –   A critical section of code is guarded by a lock
    –   Only one thread at a time can hold the lock

Examples include
    –   Monitors
    –   Java “synchronized” on global object
    –   Linux spin_lock() on global spinlock_t

Global locking doesn’t scale due to lock contention!

                  Jonathan Walpole    www.cs.pdx.edu/~walpole   25
SFU Feb 2004
Review - Reader-writer locking

Many readers can concurrently hold the lock
Writers exclude readers and other writers
The result?
    –   No lock contention in read-mostly scenarios
    –   So it should scale well, right?
    –   … wrong!




                   Jonathan Walpole       www.cs.pdx.edu/~walpole   26
SFU Feb 2004
Scalability of reader/writer locking

CPU 0




                                                            section
                                                            critical
                                             read-acquire
                                           memory barrier
               memory barrier




                                                                       memory barrier
        lock
               read-acquire




                                                                       read-acquire
                                section
                                critical



CPU 1

 Reader/writer locking does not scale due to critical
 section efficiency!
                   Jonathan Walpole                         www.cs.pdx.edu/~walpole     27
SFU Feb 2004
Review - Data locking

A lock per data item instead of one per collection
    –   Per-hash-bucket locking for hash tables
    –   CPUs acquire locks for different hash chains in
        parallel
    –   CPUs incur memory-latency and pipeline-flush
        overheads in parallel


   Data locking improves scalability by executing
   critical section overhead in parallel


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   28
SFU Feb 2004
Review - Per-CPU reader-writer locking

One lock per CPU (called brlock in Linux)
    –   Readers acquire their own CPU’s lock
    –   Writers acquire all CPU’s locks
In read-only workloads CPUs never exchange locks
    –   no memory latency is incurred

   Per-CPU R/W locking improves scalability by
   removing memory latency from read-lock
   acquisition for read-mostly scenarios


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   29
SFU Feb 2004
Scalability comparison

Expected scalability on read-mostly workloads
    –   Global locking – poor due to lock contention
    –   R/W locking – poor due to critical section efficiency
    –   Data locking – better?
    –   R/W data locking – better still?
    –   Per-CPU R/W locking – the best we can do?




                  Jonathan Walpole    www.cs.pdx.edu/~walpole   30
SFU Feb 2004
Actual scalability

Scalability of locking
strategies using read-
only workloads in a
hash-table benchmark

Measurements taken
on a 4-CPU 700 MHz
P-III system

Similar results are
obtained on more
recent CPUs

                Jonathan Walpole   www.cs.pdx.edu/~walpole   31
SFU Feb 2004
Scalability on 1.45 GHz POWER4 CPUs




               Jonathan Walpole   www.cs.pdx.edu/~walpole   32
SFU Feb 2004
Performance at different update fractions
on 8 1.45 GHz POWER4 CPUs




               Jonathan Walpole   www.cs.pdx.edu/~walpole   33
SFU Feb 2004
What are the lessons so far?

Avoid lock contention !
Avoid synchronization instructions !
    –   … especially in the read-path !




                  Jonathan Walpole    www.cs.pdx.edu/~walpole   34
SFU Feb 2004
How about non-blocking synchronization?

Basic idea – copy & flip pointer (no locks!)
    –   Read a pointer to a data item
    –   Create a private copy of the item to update in place
    –   Swap the old item for the new one using an atomic
        compare & swap (CAS) instruction on its pointer
    –   CAS fails if current pointer not equal to initial value
    –   Retry on failure

        NBS should enable fast reads … in theory!



                  Jonathan Walpole     www.cs.pdx.edu/~walpole    35
SFU Feb 2004
Problems with NBS in practice

Reusing memory causes problems
    –   Readers holding references can be hijacked during
        data structure traversals when memory is reclaimed
    –   Readers see inconsistent data structures when
        memory is reused

How and when should memory be reclaimed?




                 Jonathan Walpole   www.cs.pdx.edu/~walpole   36
SFU Feb 2004
Immediate reclamation?

In practice, readers must either
    –   Use LL/SC to test if pointers have changed, or
    –   Verify that version numbers associated with data
        structures have not changed (2 memory barriers)




  Synchronization instructions slow NBS readers!




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   37
SFU Feb 2004
Reader-friendly solutions

Never reclaim memory ?

Type-stable memory ?
    –   Needs free pool per data structure type
    –   Readers can still be hijacked to the free pool
    –   Exposes OS to denial of service attacks

Ideally, defer reclaiming memory until its safe!
    –   Defer reclamation of a data item until references to
        it are no longer held by any thread

                  Jonathan Walpole   www.cs.pdx.edu/~walpole   38
SFU Feb 2004
How should we defer reclamation?

Wait for a while then delete?
    –   … but how long should you wait?

Maintain reference counts or per-CPU hazard
 pointers on data?
    –   Requires synchronization in read path!


   Challenge – deferring destruction without using
   synchronization instructions in the read path


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   39
SFU Feb 2004
Quiescent-state-based reclamation

Coding convention:
    –   Don’t allow a quiescent state to occur in a read-side
        critical section
Reclamation strategy:
    –   Only reclaim data after all CPUs in the system have
        passed through a quiescent state
Example quiescent states:
    –   Context switch in non-preemptive kernel
    –   Yield in preemptive kernel
    –   Return from system call …


                  Jonathan Walpole    www.cs.pdx.edu/~walpole   40
SFU Feb 2004
Coding conventions for readers

Delineate read-side critical section
    –   Compiles to nothing on most architectures

Don’t hold references outside critical sections
    –   Re-traverse data structure to pick up reference

Don’t yield the CPU during critical sections
    –   Don’t voluntarily yield
    –   Don’t block, don’t leave the kernel …


                   Jonathan Walpole   www.cs.pdx.edu/~walpole   41
SFU Feb 2004
Overview of the basic idea

Writers create new versions
    –   Using locking or NBS to synchronize with each other
    –   Register call-backs to destroy old versions when safe
    –   Call-backs are deferred and memory reclaimed in
        batches

Readers do not use synchronization
    –   While they hold a reference to a version it will not be
        destroyed
    –   Completion of read-side critical sections inferred from
        observation of quiescent states


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   42
SFU Feb 2004
Context switch as a quiescent state
                                                         Can't hold reference to                              Can't hold reference
         May hold reference                              old version, but RCU                                 to old version




                                                                                                 Context
                                                         can't tell




                                                                                                 Switch
CPU 0




                                                                        RCU Read-Side
                                      RCU Read-Side
                                      Critical Section




                                                                        Critical Section




                                                                                                               RCU Read-Side
                                                                                                               Critical Section
               RCU Read-Side




                                                                                           RCU Read-Side
               Critical Section




                                                                                           Critical Section
CPU 1
                                          Remove
                                          Element




                                                                                                                     Context
                                                                                                                      Switch
                                                              Context
                                                               Switch



                                                                         Can't hold reference
                                                                         to old version

                                  Jonathan Walpole                          www.cs.pdx.edu/~walpole                               43
SFU Feb 2004
                                                 CPU 1
                                                                        CPU 0




SFU Feb 2004
                                                                            Context
                                                                            Switch
                                                     RCU Read-Side
                                                     Critical Section
                                                                                                     Grace periods




                                                         RCU Read-Side
                                            Delete       Critical Section
                                           Element




Jonathan Walpole
                            Grace Period
                                       Context
                                        Switch

                                                         RCU Read-Side
                                                         Critical Section
                                                                                      Grace Period




                                                     RCU Read-Side
                                                     Critical Section       Context
                                                                            Switch

                                                         RCU Read-Side
                                       Context           Critical Section
  www.cs.pdx.edu/~walpole
                            Grace Period




                                        Switch
      44
Quiescent states and grace periods

Example quiescent states
    –   Context switch (non-preemptive kernels)
    –   Voluntary context switch (preemptive kernels)
    –   Kernel entry/exit
    –   Blocking call

Grace periods
    –   A period during which every CPU has gone through a
        quiescent state


                   Jonathan Walpole   www.cs.pdx.edu/~walpole   45
SFU Feb 2004
Efficient implementation

Choosing good quiescent states
    –   Occur anyway
    –   Easy to count
    –   Not too frequent or infrequent

Recording and dispatching call-backs
    –   Minimize inter-CPU communication
    –   Maintain per-CPU queues of call-backs
    –   Two queues – waiting for grace period start and end


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   46
SFU Feb 2004
RCU's data structures

         Global CPU                                    Global
          Bitmask                                   Grace-Period
                               Counter                Number




                                Counter             Grace-Period
                               Snapshot               Number




               'Next' RCU                        'Current' RCU
                Callbacks                          Callback



                               End of Previous                 End of Current
 call_rcu()
                             Grace Period (If Any)              Grace Period
                      Jonathan Walpole         www.cs.pdx.edu/~walpole        47
SFU Feb 2004
RCU implementations

DYNIX/ptx RCU (data center)
Linux
    –   Multiple implementations (in 2.5 and 2.6 kernels)
    –   Preemptible and nonpreemptible
Tornado/K42 “generations”
    –   Preemptive kernel
    –   Helped generalize usage




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   48
SFU Feb 2004
Experimental results

How do different combinations of RCU, SMR, NBS
 and Locking compare?
Hash table mini-benchmark running on 1.45 GHz
 POWER4 system with 8 CPUs
Various workloads
    –   Read/update fraction
    –   Hash table size
    –   Memory constraints
    –   Number of CPUs


                  Jonathan Walpole   www.cs.pdx.edu/~walpole   49
SFU Feb 2004
Scalability with working set in cache




               Jonathan Walpole   www.cs.pdx.edu/~walpole   50
SFU Feb 2004
Scalability with large working set




               Jonathan Walpole   www.cs.pdx.edu/~walpole   51
SFU Feb 2004
Performance at different update fractions
(8 CPUs)




               Jonathan Walpole   www.cs.pdx.edu/~walpole   52
SFU Feb 2004
Performance at different update fractions
(2 CPUs)




               Jonathan Walpole   www.cs.pdx.edu/~walpole   53
SFU Feb 2004
Performance in read-mostly scenarios




               Jonathan Walpole   www.cs.pdx.edu/~walpole   54
SFU Feb 2004
Impact of memory constraints




               Jonathan Walpole   www.cs.pdx.edu/~walpole   55
SFU Feb 2004
ADDITIONAL SLIDES

The following slides relate to a different paper.




               Jonathan Walpole   www.cs.pdx.edu/~walpole   56
SFU Feb 2004
Performance and complexity

When should RCU be used?
    –   Instead of simple spinlock?
    –   Instead of per-CPU reader-writer lock?
Under what environmental conditions?
    –   Memory-latency ratio
    –   Number of CPUs
Under what workloads?
    –   Fraction of accesses that are updates
    –   Number of updates per grace period

                  Jonathan Walpole    www.cs.pdx.edu/~walpole   57
SFU Feb 2004
Analytic results

Compute breakeven update-fraction contours for
  RCU vs. locking performance, against:
    –   Number of CPUs (n)
    –   Updates per grace period ()
    –   Memory-latency ratio (r)
Look at computed memory-latency ratio at
  extreme values of  for n=4 CPUs




                  Jonathan Walpole     www.cs.pdx.edu/~walpole   58
SFU Feb 2004
Breakevens for RCU worst case
(f vs. r for Small )




               Jonathan Walpole   www.cs.pdx.edu/~walpole   59
SFU Feb 2004
Breakeven for RCU best case
(f vs. r, Large )




               Jonathan Walpole   www.cs.pdx.edu/~walpole   60
SFU Feb 2004
Validation of analytic results

4-CPU 700MHz P-III system (NUMA-Q quad)
Read-only mini-benchmark
    –   For data structures that are almost never modified
         ●   Routing tables, HW/SW configuration, policies
Mixed workload mini-benchmark
    –   Vary fraction of accesses that are updates
    –   See how things change as read-intensity varies
    –   Expect breakeven point for RCU and locking



                     Jonathan Walpole      www.cs.pdx.edu/~walpole   61
SFU Feb 2004
Benchmark results (read-only)




               Jonathan Walpole   www.cs.pdx.edu/~walpole   62
SFU Feb 2004
Benchmark results for mixed workloads




               Jonathan Walpole   www.cs.pdx.edu/~walpole   63
SFU Feb 2004
Real-world performance and complexity

SysV IPC
    –   >10x on microbenchmark (8 CPUs)
    –   5% for database benchmark (2 CPUs)
    –   151 net lines added to the kernel
Directory-Entry Cache
    –   +20% in multiuser benchmark (16 CPUs)
    –   +12% on SPECweb99 (8 CPUs)
    –   -10% time required to build kernel (16 CPUs)
    –   126 net lines added to the kernel

                  Jonathan Walpole   www.cs.pdx.edu/~walpole   64
SFU Feb 2004
Real-world performance and complexity

Task List
    –   +10% in multiuser benchmark (16 CPUs)
    –   6 net lines added to the kernel
         ●   13 added
         ●   7 deleted




                     Jonathan Walpole   www.cs.pdx.edu/~walpole   65
SFU Feb 2004
Summary and Conclusions (1)

RCU can provide order-of-magnitude speedups for
 read-mostly data structures
    –   RCU optimal when less than 10% of accesses are
        updates over wide range of CPUs
    –   RCU projected to remain useful in future CPU
        architectures
In Linux 2.6 kernel, RCU provided excellent
  performance with little added complexity




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   66
SFU Feb 2004
How does RCU address overheads?

Lock Contention
    –   Readers need not acquire locks: no contention!!!
    –   Writers can still suffer lock contention
         ●   But only with each other, and writers are infrequent
         ●   Very little contention!!!
Memory Latency
    –   Readers do not perform memory writes
    –   No need to communicate data among CPUs for cache
        consistency
         ●   Memory latency greatly reduced


                      Jonathan Walpole      www.cs.pdx.edu/~walpole   67
SFU Feb 2004
How does RCU address overheads?

Pipeline-Stall Overhead
    –   On most CPUs, readers do not stall pipeline due to
        update ordering or atomic operations
Instruction Overhead
    –   No atomic instructions required for readers
    –   Readers only need to execute fast instructions




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   68
SFU Feb 2004
Summary and Conclusions (2)

RCU best when designed in from the start
    –   RCU added late to 2.5 kernel, limited changes feasible
        after Halloween feature freeze
    –   Now doing more sweeping changes
Use of design patterns key to RCU
    –   RCU consistency semantics require transformation of
        some algorithms
    –   Transformational design patterns can be used




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   69
SFU Feb 2004
Future Work

Formal model of RCU's semantics
Formal model of consistency semantics for
  algorithms
Tools to automatically transform algorithms into a
 form consistent with RCU's semantics
Tools to automatically generate RCU from code
 that uses locking
Apply RCU to other computational environments


               Jonathan Walpole   www.cs.pdx.edu/~walpole   70
SFU Feb 2004
         Use
    the right tool
    for the job!!!




               Jonathan Walpole   www.cs.pdx.edu/~walpole   71
SFU Feb 2004
BACKUP




               Jonathan Walpole   www.cs.pdx.edu/~walpole   72
SFU Feb 2004
      “Debugging is twice as hard as writing the code in the
      first place. Therefore, if you write the code as
      cleverly as possible, you are, by definition, not smart
      enough to debug it !”
               –   Brian Kernighan




                       Jonathan Walpole   www.cs.pdx.edu/~walpole   73
SFU Feb 2004
RCU Thesis Publications

McKenney et al. “Making RCU Safe for Deep Sub-Millisecond Response Realtime
  Applications”, USENIX/UseLinux, 6/2004.
McKenney, “Locking performance on different CPUs”, linux.conf.au, 1/2004.
McKenney et al. “Scaling dcache with RCU”, Linux Journal, 1/2004.
McKenney, “Using RCU in the Linux 2.6 kernel”, Linux Journal, 10/2003.
Arcangeli et al. “Using read-copy update techniques for System V IPC in the
  Linux 2.5 kernel”, FREENIX, 6/2003.
Appavoo et al. “Enabling autonomic behavior in systems software with hot
  swapping”, IBM Systems Journal, 1/2003.
McKenney et al. “Read-copy update”, Ottawa Linux Symposium, 6/2002.
McKenney et al. “Read-copy update”, Ottawa Linux Symposium, 7/2001.
24 additional publications, 14 patents, 22 patents pending.

                     Jonathan Walpole          www.cs.pdx.edu/~walpole        74
SFU Feb 2004
Related work

Maintaining multiple versions [Kung, Herlihy]
    –   Changes problem from inconsistency to staleness
Deferring destruction [Kung,Hennessy]
    –   Garbage collection for multiple versions reduces
        complexity
    –   Batched destruction amortizes overhead




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   75
SFU Feb 2004
Double-compare-&-swap


          DCAS (addr1, addr2, old1, old2, new1, new2)
          {    <begin atomic>
               if ((*addr1 == old1) && (*addr2 == old2)) {
                    *addr1 = new1;
                    *addr2 = new2;
                    return(TRUE);
               } else {
                    return(FALSE);
               }
          } <end atomic>



                      Jonathan Walpole         www.cs.pdx.edu/~walpole   76
SFU Feb 2004
Non-blocking synchronization

Basic idea
    –   Data structures have version numbers
    –   Updates committed using a single atomic instruction that fails if
        there are other concurrent updates
    –   Retry on failure

Simple implementations require a double-compare-&-swap
  (DCAS) instruction

No locks/deadlock + synchronization-free readers!

… in theory

                    Jonathan Walpole       www.cs.pdx.edu/~walpole     77
SFU Feb 2004
In practice …

Correctness of DCAS in failure case requires
  type-stable memory management
    –   How is memory of old elements ever reclaimed?

DCAS instruction not available in most hardware
    –   Software implementations complex and costly
    –   Require readers to use a memory barrier!
    –   Performance worse than locking




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   78
SFU Feb 2004
So what now?

How can we remove synchronization instructions
 from the common path in read-mostly scenarios?
The good ideas
    –   asymmetry between readers and writers
    –   Data partitioning and Per-CPU locking
    –   Hiding complex updates behind atomic commit points




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   79
SFU Feb 2004
The problems with NBS in practice

Read-side memory barriers
    –   Why not allow readers to see old versions?
    –   Requires tolerance for small window of inconsistency

Type-stable memory management
    –   Why not reclaim memory safely by garbage collection?
    –   … but can this be done efficiently?

Dependence on obscure hardware
    –   Why not use locking on the write side?
    –   … also fixes write-side performance problems of NBS
                  Jonathan Walpole   www.cs.pdx.edu/~walpole   80
SFU Feb 2004
Instruction costs
          ( Measurements taken on a 4-CPU 700MHz i386 P-III )

      Operation                             Nanoseconds
      Instruction                                   0.7
      Clock Cycle                                   1.4
      L2 Cache Hit                                 12.9
      Atomic Increment                             58.2
      Cmpxchg Atomic Increment                    107.3
      Atomic Incr. Cache Transfer                 113.2
      Main Memory                                 162.4
      CPU-Local Lock                              163.7
      Cmpxchg Blind Cache Transfer                170.4
      Cmpxchg Cache Transfer and Invalidate       360.9


                  Jonathan Walpole         www.cs.pdx.edu/~walpole   81
SFU Feb 2004
Actual scalability

Scalability of locking
strategies using read-
only workloads in a
hash-table benchmark

Measurements taken
on a 1.45 GHz Power
machine




                Jonathan Walpole   www.cs.pdx.edu/~walpole   82
SFU Feb 2004
Design patterns for RCU

Design patterns capture the static and dynamic structures
of solutions that occur repeatedly when producing
applications in a particular context.
Because they address fundamental challenges in software
system development, design patterns are an important
technique for improving the quality of software.
Key challenges addressed by design patterns include
communication of architectural knowledge among
developers, accommodating a new design paradigm or
architectural style, and avoiding development traps and
pitfalls that are usually learned only by (painful)
experience.
      Coplien and Schmidt, 1995


               Jonathan Walpole   www.cs.pdx.edu/~walpole   83
SFU Feb 2004
Two Types of Design Patterns For RCU

For situations well-suited to RCU:
    –   Patterns that describe direct use of RCU
For algorithms that do not tolerate RCU's stale-
  and inconsistent-data properties:
    –   Patterns that describe transformations of algorithms
        into forms that can tolerate stale and/or inconsistent
        data




                  Jonathan Walpole   www.cs.pdx.edu/~walpole   84
SFU Feb 2004
Patterns for Direct RCU Use

Reader/Writer-Lock/RCU Analogy
    –   Routing tables, Linux tasklist lock patch, ...
RCU Readers With WFS Writers
    –   K42 hash tables
RCU Existence Locks
    –   Ensure data structure persists as needed
    –   Linux SysV IPC, dcache, IP route cache, ...
Pure RCU
    –   Dynamic interrupt handlers...
    –   Linux NMI handlers...

                     Jonathan Walpole       www.cs.pdx.edu/~walpole   85
SFU Feb 2004
Reader/Writer-Lock/RCU Analogy

read_lock()                       rcu_read_lock()
read_unlock()                     rcu_read_unlock()
write_lock()                      spin_lock()
write_unlock()                    spin_unlock()
list_add()                        list_add_rcu()
list_del()                        list_del_rcu()
free(p)                           call_rcu(free, p)


               Jonathan Walpole      www.cs.pdx.edu/~walpole   86
SFU Feb 2004
Patterns for Direct RCU Use

Reader/Writer-Lock/RCU Analogy (5)
RCU Readers With WFS Writers (1)
RCU Existence Locks (7)
Pure RCU (4)




                Jonathan Walpole   www.cs.pdx.edu/~walpole   87
SFU Feb 2004
Stale and Inconsistent Data

RCU allows concurrent readers and writers
    –   RCU allows readers to access old versions
         ●   Newly arriving readers will get most recent version
         ●   Existing readers will get old version
    –   RCU allows multiple simultaneous versions
         ●   A given reader can access different versions while traversing
             an RCU-protected data structure
         ●   Concurrent readers can be accessing different versions
Some algorithms tolerate this consistency model,
 but many do not

                      Jonathan Walpole       www.cs.pdx.edu/~walpole    88
SFU Feb 2004
RCU Transformational Patterns

Substitute Copy for Original *
Impose Level of Indirection
Mark Obsolete Objects
Ordered Update With Ordered Read
Global Version Number
Stall Updates




               Jonathan Walpole   www.cs.pdx.edu/~walpole   89
SFU Feb 2004
Substitute Copy For Original

In its pure form, RCU relies on atomic updates of a
  single value
    –   Most CPUs support this
If data structure requires multiple updates that
  must appear atomic to readers
    –   Must hide updates behind a single atomic operation in
        order to apply RCU
To provide atomicity:
    –   Make a copy, update the copy, then substitute the
        copy for the original

                  Jonathan Walpole   www.cs.pdx.edu/~walpole   90
SFU Feb 2004
Substitute Copy Animation


 ipc_ids



  0        1   2        3      4      5       6       7



Sem0                         Sem4          Sem6




                   Jonathan Walpole       www.cs.pdx.edu/~walpole   91
SFU Feb 2004
Substitute Copy Animation


 ipc_ids



  0        1   2        3      4      5       6       7



Sem0                         Sem4          Sem6            Sem8



  0        1   2        3      4      5       6       7      8      ...
                   Jonathan Walpole       www.cs.pdx.edu/~walpole         92
SFU Feb 2004
Substitute Copy Animation


 ipc_ids




Sem0                         Sem4          Sem6            Sem8



  0        1   2        3      4      5       6       7      8      ...
                   Jonathan Walpole       www.cs.pdx.edu/~walpole         93
SFU Feb 2004
RCU Transformational Patterns

Substitute Copy for Original (2)
Impose Level of Indirection (~1)
Mark Obsolete Objects (2)
Ordered Update With Ordered Read (3)
Global Version Number (2)
Stall Updates (~1)




               Jonathan Walpole   www.cs.pdx.edu/~walpole   94
SFU Feb 2004

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:12/18/2011
language:
pages:94