Privatization Techniques for Sof by fjhuangjun

VIEWS: 8 PAGES: 54

									Guilt-free Nonblocking Software
     Transactional Memory



          Virendra J. Marathe
    Department of Computer Science
        University of Rochester
Single Processor Performance
Problems
                                           • Heat wall
                                           • Diminishing ILP
                                             gains




 http://www.tomshardware.com/2005/11/21/

                                                               2
The Concurrency Revolution is
here!
• Multicore chips are already here
  – Intel, AMD Dual and Quad Core chips already in market
  – 8-core versions already on their way
  – Sun’s Niagara and Rock processors


• Parallel programming reaching the masses
  – Applications need to become parallel to leverage the multicore
    performance potential




                                                                     3
The Parallel Programming
Challenges
•   Finding Parallelism
•   Expressing Parallelism
•   Concurrency Control (synchronization)
•   Program Verification & Debugging
•   Performance Debugging




                                            4
Synchronization

• Traditional Approach: Locks
    • Gives mutual exclusion guarantees
    • Tension between locking granularity and scalability

       Coarse-grain locking          Fine-grain locking
        + Easy to maintain            + Good scaling
        – Poor scaling                – Hard to get right
                                         • Data races
                                         • Deadlocks

    • Locking not composable


                                                            5
Transactional Memory (TM)

• TM borrowed from database transactions
  – Memory is the database
  – Marked blocks of code are transactions
• Ensures
  – Atomicity: blocks of code will execute atomically
  – Isolation: blocks of code will not observe mutations in
    the memory
  – Consistency: blocks of code will ensure program
    invariants are guaranteed
  – Durability: not supported (memory isn’t durable)
                                                              6
Transactional Memory to the
Rescue
• Programmer does not have to worry about
  concurrency control
• Best of both coarse and fine grain locking
  – Simplicity of coarse grain locking
  – Scalability of fine grain locking
• Gives Composability
  HashTable hash1, hash2;           class HashTable {
  ...                                 ...
                                      Object remove(Key key) {
  atomic {
                                        atomic { ... } }
    Object o1 = hash1.remove(key1);   void insert(Key key,
    hash2.insert(key1, o1);                       Object obj) {
  }                                     atomic { ... } }
                                                                7
                                    }
A Typical Memory Transaction
• Speculatively executed blocks of code
• Reads and Writes are speculative
• Transactions try to commit updates at the end
   begin   read (A)   read (B) write (C)   write (A)   commit




• The runtime ensures that transactions are atomic and
  isolated
  – Requires conflict detection mechanisms
  – Done entirely in software (STM), hardware (HTM), or a hybrid of
    both (HyTM)
  – Non-atomic or non-isolated transactions abort and re-execute
                                                                      8
Lock-based Transactions
• Use per location locks for conflict detection
• Causes unnecessary waiting in some situations

                                                                            Releases locks
                                                                            for A, B, C
     begin    write (A) read (B)                        write (C)   abort
T1
             conflict              abort T1
     begin          read (A)                                                  commit
T2


                                         Stall till T1 releases
                                               lock for A
                                                                                        9
Nonblocking Transactions

• Nonblocking algorithms
  – Get rid of locks
  – No need of waiting

     begin    write (A) read (B)              write (C)   abort
T1
             conflict              abort T1
     begin          read (A)        commit
T2                                              No stalling



  – Requires non-trivial engineering to get right
  – But payoff is worth it
                                                                  10
Nonblocking Progress and
STM
• Nonblocking Progress – arbitrary delays in
  some threads do not prevent others from
  making forward progress
• Nonblocking STMs
  – Transactions acquire revocable locks for written
    locations
  – Acquired locations are released at commit/abort time
  – Competing transactions need not block for current
    owners of locks

                                                           11
Nonblocking Progress and
STM
• TM research began for nonblocking concurrent
  algorithms [Herlihy&Moss ISCA’93]
• Early software TMs (STMs) were nonblocking, but slow
• Recent shift toward blocking STMs
   – Significant performance improvements

• General argument – nonblocking STMs are
  fundamentally slow

• We argue – one can improve the common case
  performance of nonblocking STMs

                                                         12
Remaining Talk
• Why is nonblocking progress important?

• Background on STM Implementations

• What makes nonblocking STMs slow?

• Making nonblocking STMs fast

• Experimental Results

• Conclusions
                                           13
The Virtues of Nonblocking
Progress
• Tolerance from arbitrary delays due to
   – Preemption,
   – Page faults,
   – Thread faults
• External scheduler support mitigates some
  problems, but
   – Not portable
   – Better to contain the problem within the STM
• Environments where blocking is unacceptable
   – TxLinux interrupt handler transactions

                                                    14
Agenda
• Why is nonblocking progress important?

• Background on STM Implementations

• What makes nonblocking STMs slow?

• Making nonblocking STMs fast

• Experimental Results

• Conclusions
                                           15
STM Speculative Writes
• Two types of implementations for speculative
  writes:
   – Redo Log: writes made to private buffer

       begin    write (A) read (B)           write (C)   commit
  T1                                                              copyback
                                                                  new values


       Status                 A, new value                   A
                              C, new value
    Write Set
                                                             B

                                                             C
        T1                                                           16
STM Speculative Writes
• Two types of implementations for speculative
  writes:
   – Undo Log: writes are made directly to memory

       begin    write (A) read (B)           write (C)   abort
  T1                                                             restore
                                                                 old values


       Status                 A, old value                   A
                              C, old value
    Undo Log
                                                             B

                                                             C
        T1                                                          17
STM Speculative Reads
• Reads are invisible
      – Logged in a private read set
      – Read set validated to ensure isolation
          • Several schemes (e.g. incremental, commit counter,
            timestamp, etc.)
      begin    write (A) read (B)            write (C)   commit
 T1

                                                                  Verifies that B
                         log B
                                                                  hasn’t changed,
      Status                 B, curr-state                        then commits
                                                             A
  Write Set
                                                             B
   Read Set

                                                             C
                                                                          18
       T1
Agenda
• Why is nonblocking progress important?

• Background on STM Implementations

• What makes nonblocking STMs slow?

• Making nonblocking STMs fast

• Experimental Results

• Conclusions
                                           19
What makes Nonblocking
STMs slow?
• Nonblocking STMs require infrastructure to
  avoid waiting during conflicts
  –   Indirection (object-based STMs)
  –   Copying and Cloning
  –   Helping
  –   Stealing
• Incremental Read Set Validation
  – Extremely costly
• These usually lead to overheads in the
  (contention-free) common case                20
What makes Nonblocking
STMs slow?
• Indirection (in object-based STMs), and
• Copying and cloning
        DSTM Transactional Object          RSTM Transactional Object

                                    Txn      Start
               Owner Txn                               Txn 1       Txn 2
Start           Old Data            Old
               New Data             Data   Owner Txn   Owner Txn
                Locator                    Old Data     Old Data
                                    New
                                    Data   New Data     New Data




                                                                    21
What makes Nonblocking
STMs slow?
• Helping: Help the conflicting transaction to
  finish
       begin   write (A) read (B)                     write (C)   commit
  T1

       begin       read (A) help
  T2
                                               help                        Too much
       begin                        read (A)                               contention
  T3
                                                                  help

       begin                                          read (A)
  T4
                                                                                22
What makes Nonblocking
STMs slow?
• Stealing
  – Steal the right to access conflicting location
  – Take over the responsibility of cleanup

       begin   write (A) read (B)                 write (C)   abort
  T1

       begin       write (A) steal A               commit
  T2


       begin           write (A)    steal A   commit
  T3

                                                                      23
Stealing
[Harris & Fraser approach]
• Need infrastructure to
  – Handle the case of multiple stealers
     • reference counters
  – Retrieve correct logical values of stolen locs
     • storing old and new values,
     • expensive memory management for preserving logical values
       in transaction read/write sets
     • helping to restore logical values of stolen locations
  – Manage races among stealers
     • Extra atomic ops (2N for N locs)

• Stealing is still promising                                 24
What makes Blocking STMs
fast?
• Significantly less overhead in the
  common case
  – Simple metadata structure
     • Just 1 word to indicate ownership
  – Streamlined fast path
  – Performance optimizations
     • Timestamp based validation

• We need to incorporate all these features
  in a nonblocking STM to make it
  competitive                               25
Agenda
• Why is nonblocking progress important?

• Background on STM Implementations

• What makes nonblocking STMs slow?

• Making nonblocking STMs fast

• Experimental Results

• Conclusions
                                           26
Our Contributions
• A novel approach for stealing
• Keep the common case simple
  – Resort to complicated case only when stealing
    happens
  – More streamlined common case execution path


• Incorporate recent optimizations (timestamp
  based validation)
  – We are the first to do this in nonblocking STMs

                                                      27
STM Data Structures
• Word-based STM
   – Conflict detection at granularity of contiguous blocks of memory
   – Appropriate for unmanaged languages – C, C++

• A table of ownership records (orecs)
   – Each heap location hashes into a single orec
   – Each orec indicates if currently owned or free, and identifies the
     owner

• Transaction Descriptor
   – Read set
   – Write set (redo log) – a 2D list, each row corresponds to an
     acquired orec
   – Status – Active/Aborted/Committed
                                                                      28
STM Data Structures

                                                   Write Set
                                          T1       locX:11
             hashing                   COMMITTED
        10                        o1
locX
                                  o2
                                                   Read Set
                                  o3

                                  o4
                                  o5




   Shared Heap     Ownership Records
                        (orec)                       29
Common Case Execution
• Algorithm behaves like a blocking STM in the
  absence of contention
  – Log reads, writes of transaction
  – Acquire ownership of write set locations via their
    orecs
  – Ensure that reads are still consistent (timestamp-
    based validation)
  – Copyback updates after commit
  – Release orecs via store instruction (details offline)
     • Ours is the first nonblocking STM with this feature

                                                             30
     Timestamps and Validation
     • A significant optimization to read set
       validation (e.g. TL2)
                             Global Clock
                                                                       TS: 1     o1
                                    10
                                    11               A                           o2
         Check TS(loc)                               B                 TS: 4     o3
     begin   write (A) read (B) write (C)   commit   C
                                                                                 o4
T1
                                                                       TS: 10    o5

T1           ACTIVE

         Begin_TS: 10
                                                                       orecs
                                                         Shared Heap
                                                                                31
Timestamps and Validation
• Ensures that transactions access mutually
  consistent data
• Validation per memory access takes
  constant time
• Assumption that conflicts will be rare


• Results in major performance difference
  – Prior nonblocking STMs required incremental
    validation

                                                  32
Adding Timestamps
• Recall: orec contains a pointer to the owner
• Superimpose a timestamp on this pointer
• A writer releases orec by storing back the
  current global time




                                                 33
Common Case Example
                 Copyback
                 in progress
                 complete                                locX’s logical value

                                                                         Write Set
                                           Release
                               ID, flags   Store        T1
                                                        T1               locX:11
                       ver#
             hashing                                  ACTIVE
                                                     COMMITTED
                                           o1
locX   11
       10
                                           o2

                                           o3

                                           o4

                                   S C o5




   Shared Heap     Ownership Records
                        (orec)                                             34
Uncommon Case Stealing
• Two flags in the orec for the stealing process
  – stolen_orec: for orec’s stolen/unstolen state
  – copier_exists: indicates if there exists an
    owner in cleanup phase




                                                    35
Stealing Example
                 Copyback
                 complete
                 in progress                                locX’s logical value

                                                         OWNER              Write Set
                                             Clear C      T1                locX:11
                       ver#     ID, flags
             hashing                                   COMMITTED
                                       1 1 o1
                                       0 0
locX   10
       12
       11
                                            o2          STEALER 1
                                                                           Write Set
                                            o3             T2
                                                                           locX:11
                                                                           locX:12
                                                         ACTIVE
                                                       COMMITTED
                                            o4

                                       S C o5           STEALER 2
                                                                            Write Set
                                                           T3
                                                                            locX:12
                                                         ACTIVE
                       Redo Copyback

   Shared Heap     Ownership Records
                        (orec)                                                36
Stealing Complexity
• Stealing mechanism quite complex
  – Several corner case race conditions need to be
    handled (happy to talk offline)
  – Invariant: At most 1 transaction does a copyback
    for an orec at any given time
     • Simplifies our design significantly
  – Overhead of accessing stolen locations is quite
    high, requiring a lookup in the last stealer’s write
    set
  – However, we can throttle stealing and make it an
    uncommon case
                                                           37
Undo Log Variant
• We have developed the first nonblocking
  undo log STM through simple modifications
  to a redo log variant
  – Stealing of orecs happens in the redo log STM
    when a committed owner is delayed
  – In undo log STMs stealing largely happens when
    an aborted owner is delayed
     • Logical values of locations are in aborted owner’s undo
       log




                                                                 38
Agenda
• Why is nonblocking progress important?

• Background on STM Implementations

• What makes nonblocking STMs slow?

• Making nonblocking STMs fast

• Experimental Results

• Conclusions
                                           39
Experimental Platform
• Implementation of all STMs done in C
• Throughput tests conducted on microbenchmarks
   – Scalable workloads: hash table, binary search tree
   – Torture tests (no scaling): counter, array of counters
• Tests conducted on a 16 processor Sun Fire machine
• We compared the following STMs
   – TL2,
   – TL2 with schedctl calls to avoid preemption pathologies,
   – Harris and Fraser’s word-based nonblocking STM
   – Our Base blocking and nonblocking variants (do not contain
     store-based release and optimizations), and
   – 3 variants of our Optimized STM (eager redo log, lazy redo
     log, undo log)
                                                                  40
Binary Search Tree
(32K nodes)

             4500000
                                                                    Our Optimized STMs
             4000000
             3500000                                                   Redo Log
             3000000                                                   Undo Log
  Txns/sec




             2500000                                                   TL2 Schedctl
                                                         TL2
             2000000                                                   TL2
             1500000                                                   HF-STM
             1000000                                                   Base NB
             500000
                  0
                                                                     Base NB
                  1

                       8
                           15

                                22

                                     29
                                          36

                                               43

                                                    50
                                                         57

                                                               64
                                      Thread #                      HF-STM

Major performance gap closed                                                          41
Hash Table
(64 buckets, 256 keys)
                                    TL2-Sched           TL2        Our Optimized STMs
            8000000
            7000000
            6000000                                                   Redo Log
                                                                      Undo Log
            5000000
 Txns/sec




                                                                      TL2 Schedctl
            4000000
                                                                      TL2
            3000000
                                                                      HF-STM
            2000000                                                   Base NB
            1000000
                 0
                 1

                      8
                          15

                               22

                                    29
                                         36

                                              43

                                                   50
                                                        57

                                                              64
                                     Thread #

                                                                                     42
Array of 16 Counters
                        TL2-Sched           TL2        Undo Log
           300000

           250000
                                                                      Redo Log
           200000                                                     Undo Log
Txns/sec




                                                                      TL2 Schedctl
           150000
                                                                      TL2
                             Redo Log
           100000                                                     HF-STM
                                                                      Base NB
           50000

               0
                    1

                        8

                            15

                                 22

                                       29

                                            36

                                                  43

                                                       50

                                                            57

                                                                 64
                                      Thread #
                                                                                     43
Conclusion
• We presented several variants of a new STM that
  – Effectively decouples the common case from
    nonblocking infrastructure
  – Enables a more streamlined fast path (comparable to
    state-of-the-art blocking STMs)
  – Enables integration of key optimizations such as
     • Timestamp-based transaction validation
• We have shown that common case performance
  of nonblocking STMs can be made competitive
  with state-of-the-art blocking STMs

                                                          44
My Work during Ph.D.
• Nonblocking STMs
   –   Comparison of nonblocking STMs [LCR’04, URCS TR 839]
   –   Adaptive STM [DISC’05]
   –   Rochester STM [Transact’06, DISC’06, Transact’07]
   –   Word-based STMs [PODC’05, PPoPP’07, PPoPP’08, URCS TR 932]
• Enabling Cooperation among Transactions
   – Transaction Synchronizers [SCOOL’05]
• Hardware Acceleration of STMs
   – RTM [Transact’06, ISCA’07]
• Programming Model aspects of software transactions
   – Privatization [PODC’07, URCS TR 915, submitted]
   – Bag-of-tasks programming model with Transactions [PPoPP’07]
• Interaction with non-transactional code
   – Transaction Safe Nonblocking Algorithms [DISC’07, URCS TR 924]
• Composite Abortable Locks [IPDPS’06]
                                                                      45
Future Goals and Directions
• Short Term
  – Investigate Programming Models centered around TM
     • Language integration and semantics of software transactions
       (a hot research topic)
     • Interaction of TM with data, dataflow parallelism
     • Interaction with traditional lock-based code
  – Investigate workloads to understand usability of TM
  – More aspects of STM runtimes
     • Concurrent nesting, data locality, runtime optimizations, etc.
• Long Term Goal
  – Make parallel programming much more accessible to
                                                     46
    the masses
Thank You!

 Questions?




              47
Array of 16 Counters –
Stealing Rate
                             40

                             35                                                    Undo Log
 Stealing Rate (in % txns)




                             30

                             25                                                        Redo Log Eager
                             20                                                        Redo Log Lazy

                             15                                                        Undo Log

                             10

                             5
                             0
                                                                                   Redo Log
                                  1

                                      8

                                          15

                                               22

                                                     29

                                                          36

                                                               43

                                                                    50

                                                                         57

                                                                              64
                                                    Thread #

                                                                                                        48
My Work during Ph.D.
• Nonblocking STMs
  – Comparison of nonblocking STMs [LCR’04, URCS TR 839]
     • Identified several design tradeoffs
  – Adaptive STM [DISC’05]
     • Adaptation in levels of indirection and ownership acquisition
       technique
  – Rochester STM [Transact’06, DISC’06, Transact’07]
     • Further reduction in levels of indirection
  – Word-based STMs [PODC’05, PPoPP’07, PPoPP’08, URCS TR 932]
     • Really guilt-free nonblocking STMs

                                                                   49
My Work during Ph.D.
• Enabling Cooperation among
  Transactions
  – Transaction Synchronizers [SCOOL’05, ongoing]

             T1                    T2     T_1_2




                      Comm.
                      Channel


                                                    50
My Work during Ph.D.
• Programming Model aspects of STMs
  – Privatization [PODC’07, URCS TR 915, submitted]
     • Races among transactional and non-transactional memory
       accesses
  – Bag-of-tasks programming model with Transactions
    [PPoPP’07, ongoing]
     • Execute and control large numbers of concurrent
       computations
     • Non-trivial data dependencies among computations




                                                                51
My Work during Ph.D.
• Other work
  – Hardware Acceleration of STMs
    • RTM [Transact’06, ISCA’07]
  – Interaction with non-transactional code
    • Transaction Safe Nonblocking Algorithms [DISC’07, URCS TR
      924]

  – Composite Abortable Locks [IPDPS’06]



                                                              52
STM Implementations
• Transactions execute speculatively
• Reads and writes use STM metadata
• Speculative writes typically acquire ownership of
  locations (using atomic ops. e.g. CAS)
• Reads are typically logged in a private read set
  for validation at commit time
• Post-commit/abort cleanup
   – Make speculative updates non-speculative, or rollback
     speculative updates
   – Release ownership of locations
                                              This forces waiting
                                              in blocking STMs

                                                               53
Stealing is still promising

• Most overheads can be mitigated from the
  common (contention free) case execution
  path

• Our approach is a new way of stealing
  – Simpler,
  – Makes common case execution more streamlined,
  – Reduces atomics by half

                                                    54

								
To top