Docstoc

Hardware Acceleration for Software Transactional Memory

Document Sample
Hardware Acceleration for Software Transactional Memory Powered By Docstoc
					McRT-STM:
A High Performance Software
Transactional Memory System
For a Multi-core Runtime


                         Bratin Saha, Intel
              Ali-Reza Adl-Tabatabai, Intel
                        Rick Hudson, Intel
                  Chi Cao Minh, Stanford
             Benjamin Hertzberg, Stanford
Talk Outline


#include “TransactionsAreBetterThanLocks.h”
McRT-STM Overview
McRT-STM Mechanisms
    – Flexible conflict detection
    – Ensuring atomicity
    – Example
McRT-STM API
McRT-STM Results
Conclusions




2    03/31/06   McRT-STM
McRT-STM: Overview


Well defined API
Integrates with language & compiler
    – Java compiler & runtime integration (PLDI ’06)
    – Transaction aware malloc/free (ISMM ’06)

Multiple conflict granularities
    – cache line, object, element, …
Nested transactions with partial rollback
Flexible contention management & diagnostics
Commit and abort handlers




3    03/31/06   McRT-STM
 McRT-STM: Example




      …                                        …
      …                                        …
      atomic {                                 stmStart();
        B = A + 5;                               temp = stmRd(A);
      }                                          stmWr(B, temp + 5);
      …                                        stmCommit();
                                               …

STM read & write barriers before accessing memory inside transactions
STM tracks accesses & detects data conflicts




 4    03/31/06   McRT-STM
Tracking memory accesses: Transaction
Record (TxR)
Tracks data accessed in transaction
Flexible mapping of data to TxR
    – Object-based, cache line based, …
    – Determines conflict detection granularity

Pointer-sized record
2 states
    – Shared: Read-only access by multiple readers
      • Value is odd (low bit is 1)
    – Exclusive: write-only access by single owner
      • Value is aligned pointer to owning transaction’s descriptor




5    03/31/06   McRT-STM
  Transaction Record: Example


  Every data item has an associated transaction record
                                         TxR      Extra transaction
                 Class Foo {    hdr                 record field
 Managed           int x;        x       hdr
(Java/C#)          int y;        y        x
                                          y       Object granularity
                 }


                                         TxR1
                 struct Foo {            TxR2 Address based hash
UnManaged
 (C/C++)
                   int x;         x      TxR3 into table of TxRs
                   int y;         y       …
                 }                            Cache line granularity
                                         TxRn


  6   03/31/06   McRT-STM
Ensuring Atomicity: Options

    Memory Ops 
                             Reads                   Writes
    Mode   ↓


                              Read lock on TxR
    Pessimistic             (inserting into reader
    Concurrency                  list has the
                                 same effect)



    Optimistic                  Use versioning
    Concurrency                     on TxR




7     03/31/06   McRT-STM
Ensuring Atomicity: Options

    Memory Ops 
                                  Reads                Writes
    Mode   ↓


                                  - Caching effects
    Pessimistic                   - Lock operations
    Concurrency




    Optimistic                     + Caching effects
    Concurrency                     + Avoids lock
                                      operations

                            Quantitative results in the paper
8     03/31/06   McRT-STM
Ensuring Atomicity: Options

    Memory Ops 
                            Reads   Writes
    Mode   ↓



    Pessimistic                         Write lock
    Concurrency                          on TxR




                                      Buffer writes &
    Optimistic
                                     acquire locks at
    Concurrency
                                          commit



9     03/31/06   McRT-STM
Ensuring Atomicity: Options

     Memory Ops 
                                   Reads             Writes
     Mode   ↓



     Pessimistic                                     + In place updates
     Concurrency                                      + Fast commits
                                                        + Fast reads


                                                       - Slow commits
     Optimistic                                       - Reads have to
     Concurrency                                          search for
                                                         latest value

                             Quantitative results in the paper
10     03/31/06   McRT-STM
Ensuring Atomicity: McRT-STM Algorithm

     Memory Ops 
                             Reads            Writes
     Mode   ↓



     Pessimistic                                       √
     Concurrency




     Optimistic                      √
     Concurrency


     Other STMs (Harris et al, PLDI ’06) use the same algorithm
11     03/31/06   McRT-STM
Ensuring Atomicity: McRT-STM Algorithm

     Memory Ops 
                             Reads           Writes
     Mode   ↓



     Pessimistic                                      √
     Concurrency




     Optimistic                      √
     Concurrency


         Other HTMs (LogTM, HPCA ’06) use similar algorithm
12     03/31/06   McRT-STM
McRT-STM: Example

                           Class Foo {
                              int x;
                              int y;
                           };
                           Foo bar, foo;
   T1                                          T2
 atomic {                                    atomic {
   t = foo.x;                                  t1 = bar.x;
   bar.x = t;                                  t2 = bar.y;
   t = foo.y;                                }
   bar.y = t;
 }

     • T1 copies foo into bar
     • T2 reads bar, but should not see intermediate values
13   03/31/06   McRT-STM
McRT-STM: Example




   T1                                          T2
 stmStart();                          stmStart();
   t = stmRd(foo.x);                    t1 = stmRd(bar.x);
   stmWr(bar.x,t);                      t2 = stmRd(bar.y);
   t = stmRd(foo.y);                  stmCommit();
   stmWr(bar.y,t);
 stmCommit();

     • T1 copies foo into bar
     • T2 reads bar, but should not see intermediate values
14   03/31/06   McRT-STM
  McRT-STM: Example


                         foo    3      5
                                       T1
                                        7       bar
                               hdr   hdr              Abort
          Commit                     x=0
                                       9
                               x=9
  T1                           y=7   y=07
                                                        T2
stmStart();                                      stmStart();
  t = stmRd(foo.x);                                t1 = stmRd(bar.x);
  stmWr(bar.x,t);                    T2 waits      t2 = stmRd(bar.y);
  t = stmRd(foo.y);                              stmCommit();
  stmWr(bar.y,t);
stmCommit;
Reads <foo, 3> <foo, 3>                    Reads <bar, 5> <bar, 7>

     •T2 should read [0, 0] or should read [9,7]
Writes <bar, 5>
Undo <bar.x, 0> <bar.y, 0>

  15   03/31/06   McRT-STM
Malloc Integration: Object-based TxR in
C/C++
       McRT-malloc uses size segregated allocation blocks
  Object-based TxR for C/C++ leverages McRT’s memory allocator

 Allocation blocks
                              8
    aligned on
                             Size, …   Size, …
                                         16
                                                  1. Masking off lower
 16K boundaries                 TxR     TxR         14 bits provides
                                                    base of the block
 Block metadata
   includes size                                     2. Get size,
                                        TxR          object base,
                                                       and TxR
Allocation adds TxR,
 puts object into its
 proper sized block
                                                    Arbitrary
                                                 ptr into object

                              0x8
  16   03/31/06   McRT-STM
McRT-STM API: Novel Features


 Enumerating and manipulating transactional state
     – Necessary for a garbage collector
Registering commit/abort handlers
     – See our ISMM ’06 paper for example use
RISCfying STM operations
     – Enables compiler optimizations
Partial rollback for nested transactions
     – Enables composition


Can we move to a common TM API?




17    03/31/06   McRT-STM
McRT-STM Results: Hashtable


                           64K operations - 80% updates              64K operations - 20% updates

                 4                                          4
Time (seconds)




                 3                                          3


                 2                                   coarse 2
                                                     fine
                 1                                  STM     1


                 0                                          0
                     0           5         10      15           20
                                                                0    5         10        15         20
                                  Number of processors                Number of processors

Updates include additions & removals into data structure
STM replaces coarse-grain lock acquire/release with STM begin/commit
STM scales like fine-grained locks & beats coarse locks at 2+ processors
STM has overheads at 1 processor
18                   03/31/06   McRT-STM
McRT-STM Results: AVLTree


                                 64K operations - 80% updates            64K operations - 20% updates
                      5                                         5
                                   coarse
     Time (seconds)




                      4                                         4
                                   fine
                      3           STM                           3

                      2                                         2

                      1                                         1

                      0                                         0
                          0          5         10        15         20
                                                                    0     5        10        15         20
                                      Number of processors                Number of processors


STM beats coarse locks for 20% updates due to optimistic read concurrency
Frequent tree rebalancing in 80% update case hurts STM performance at >4
processors due to increased conflicts



19                    03/31/06    McRT-STM
McRT-STM Results: BTree



                                 64K operations - 80% updates          64Koperations - 20% updates

                      0.5                                       0.5
     Time (seconds)




                      0.4                                       0.4

                      0.3                                       0.3

                      0.2                      coarse           0.2

                      0.1                     fine              0.1

                       0
                                              STM                0
                            0          5        10       15       20
                                                                   0    5        10       15         20
                                        Number of processors            Number of processors

STM scales better than locks, beating locks at 4+ processors
STM has overheads (1-2 processors)




20                    03/31/06     McRT-STM
McRT-STM Results: Overhead breakdown


                                    STM time breakdown
          100%
                                                                     Application

            80%                                                      TLS access
                                                                     STM write
            60%                                                      STM commit
                                                                     STM validate
            40%                                                      STM read

            20%

                0%
                     Binary tree   Hashtable   Linked list   Btree

Time breakdown on single processor
STM read & validation overheads dominate
 Good optimization targets
21   03/31/06    McRT-STM
McRT-STM Results: Sendmail

                                            Mail/spam delivery with sendmail
                                      600

                     Time (seconds)
                                      500
                                                               locks
                                      400
                                                               STM
                                      300
                                      200
                                      100
                                       0
                                            0      2     4     6       8       10
                                                Number of concurrent threads

     Converted mutex calls to transaction calls
     10% of time in critical sections
     STM performance similar to lock performance

22    03/31/06   McRT-STM
Conclusions


McRT-STM is a high performance STM
Allows flexible conflict detection
Different mechanisms for ensuring atomicity
Well defined API for compiler and language integration
Efficient first class nested transactions
     – See the paper for details

Allows registering commit & abort handlers
Supports both managed (Java) and unmanaged languages (C)




23    03/31/06   McRT-STM