Dynamic Software Transactional Memory by ert554898


									              Dynamic Software
            Transactional Memory

Idan Igra

Topics in Reliable Distributed Computing (048961)
Technion, Nov 2008
 •    Motivation
 •    Software Transactional Memory
 •    Dynamic Software Transactional Memory
 •    Faser’s STM
 •    Dynamic STM vs. Faser’s STM
 •    A blocking STM implementation
 •    Another obstruction free STM implementation by Faser
 •    DSTM Contention management

     William N. Scherer III                Mark Moir                    Victor Luchangco                 Maurice Herlihy
Department of Computer Science   Sun Microsystems Laboratories   Sun Microsystems Laboratories   Department of Computer Science
    University of Rochester             1 Network Drive                  1 Network Drive                Brown University
  Rochester, NY 14620, USA        Burlington, MA 01803, USA       Burlington, MA 01803, USA        Providence, RI 02912, USA
  scherer@cs.rochester.edu           mark.moir@sun.com             victor.luchangco@sun.com            mph@cs.brown.edu
             Multicore history
• Parallel computing was used for HPCs and
  – PRAM & other shared memory models aren’t realistic.
  – BSP & LogP (message passing models) were used.
     • Only for HPC specialists.
     • Demand complicated system analyze per application.
• HW constraints force multicore architectures.
• Today’s parallel programming based on locks.
  – Coarse grained code prevent parallelism, fine grained
    are hard to use.
  – Code reuse demands exposing internal locks.
  – No conventional way to connect mutex and its data.
Nonblocking liveness properties
• Wait freedom: Every process which tries to
  do an operation will complete it in a finite
  number of steps.
• Lock freedom: If any process tries to do an
  operation, then there is a process which
  will succeed completing an operation.
• Obstruction freedom: Process that runs by
  its own tries to do an operation will
  complete it.
   Atomic hardware primitives
• Load_Linked / Store_Conditional (LL/SC):
  LL(addr) returns the value pointed by addr.
  Next call to SC(addr, val) writes val into
  addr if it was not written since last LL call.
• Compare And Swap (CAS): The operation
  CAS(addr, e, v) swaps the values of addr
  and v if addr == e.
• MCAS: Atomic m CAS operations
  (particular case: DCAS).
       Helping methodology
• A methodology for non-blocking
• Any process which holds a data that other
  process needs is helped by the other.
• Usually recursive help.
• Particularly, used widely in Transactional
  Memory for MCAS software
  implementation (known as k-RMW).
Software Transactional Memory
• First try to catch the whole data it needs.
• If succeeded – compute transaction and
  release the data.
• If failed – release all and retry.
Software Transactional Memory
Why Software Transactional Memory?
• Unexpected delays decreases performances of locking
  method, besides its inherent programming difficulties.
   – Memory allocation and deallocation synchronization conflicts.
• Hardware Transactional Memory lacks the platform
  support, portability and delay anomalies.
• Methods like translating the code to k-RMW actions is
• Working on a copy of the object is not good for large
  data structure.
• Programmable and flexible non-blocking parallel
  programming method is needed.
Software Transactional Memory
Data set pre-acquiring
• Unintuitive programming.
• Reduces parallelism.
  – Common data structures should be acquired
• Dynamic data structures are impossible.
Software Transactional Memory
Hardware support
• LL/SC is not commonly supported by
• Operating system can support it.
  – Much slower.
  – Reduce parallelism (force some scheduling).
  – More useful primitive can be defined.
Software Transactional Memory
Wait freedom cost:
• Complicated acquiring code.
• Not flexible.
• Non-common primitives.
• Long locking time.
             Dynamic STM
• Enables also dynamic transactions – with
  a changing data set.
• Satisfies Obstruction freedom.
• Modular contention manager for progress
  forcing, priorities and application-adapting.
Dynamic STM
              Dynamic STM
Implementation principles:
• A TM object points to Locator which contains an
  old version, a new one and the last transaction
  opened it for writing.
• The right version is determined by the status
  (active / aborted / committed).
• All objects are committed at once by changing
  the status.
• Obstruction free is obtained by aborting a
  conflicting transaction (conditioned by contention
  manager agreement).
            Dynamic STM
DSTM properties and results:
• Much natural to write and convert
  sequential code into DSTM code.
• Releases can significantly increase
• Re-use simpler algorithms for a bigger one
  is easier using DSTM.
• Disadvantage: no way to know that an
  object was opened for reading.
             Dynamic STM
• Obstruction free enables:
  – simplicity,
  – for some application is good enough,
  – enables implementation of priorities,
  – enables separating correctness and progress
  – and most important – prevent the need of
    helping mechanism.
• However, one can consider it is not a real
  progress property.
                 Dynamic STM
• DSTM relates to STM like Coarse-grained to fine-grained.
• But STM meets a real requirement and not weakened one
  (obstruction free).
• Releases as an integral part of the mechanism reduces
  conflicts (compared to locks).

Non-blocking, particularly obstruction free, is better for
  delayed/failed processes won’t stop the whole system (Very
  strong for DSTM).
• DSTM’s implementation might cause loosing that gain for real
  parallelized systems.
• Let the contention manager do the work is exactly like
  assuming the scheduler will do that.
              Faser’s STM
STM should satisfy:
• Small fixed storage overhead per object.
• Small shared memory operations.
• Contention time is short.
  – Reduces time that transactions meet.
Nice to have:
• Supporting varying object sizes.
• Nesting transactions.
             Faser’s STM
• Every object is represented as a pointer to
  object handler, which consists of version
  number and a pointer to the data block.
• Open for read returns the data block
• Open for write returns a pointer to a
  shadow copy.
• Commit is done by acquiring all the
  opened object, MCAS and helping.
Faser’s STM
               Faser’s STM
• Problem: Acquiring and releasing read-
  only object block non-conflicted
  – Critical for single start point data structures
    (head of linked list).
• Solution: not to acquire read-only objects.
  – Add a read-checking state in which the
    transactions checks all the opened read only
    objects, so other transactions don’t update it
    during this time.
               Faser’s STM
• Deadlock Prevention: T1 can abort T2 only
  – both’ status is read-checking
  – T2 holds a location that T1 tries to read
  – T1 < T2 according to a given total order
    between transactions.
          DSTM vs. FSTM
FSTM is much better:
• Lazy acquire exposes a transaction to
  others for a very short time, reduces
  conflict number.
• Indirection levels decrease performances
  (mainly for read-only transactions).
• Obstruction freedom’s contention manager
  has a 5-10% overhead and hard for
            DSTM vs. FSTM
DSTM is much better:
• Eager acquire helps capturing conflicts earlier.
  – Possible thanks to Obstruction freedom weakness.
• Fewer CAS’s (N+1 for DSTM vs. 2N+2 for
• Implementation is simpler and more efficient.
• MCAS causes a lot of cache block trashing.
           DSTM vs. FSTM
DSTM is better for workloads which:
• Opening a lot of locations.
  – Mainly write accesses for the same location
  – Transactions must be serialized (stack).
FSTM is better for workloads which:
• Livelocks are common (RBTree).
• Small Transactions
  – Small conflict probability (IntSetRelease).
          DSTM vs. FSTM
General remarks:
• Not validating repeatedly improves
• How can non-consistent (aborted)
  transactions be avoided?
      Contention Management
Recall – DSTM contention manager should:
• ensure progress.
• eventually returns from every call.
• eventually aborts conflicting transaction.
Management approaches are tested for:
• Various data set
• Visible/Invisible reads (optimistic/non-optimistic).
• Eliminating unnecessary aborts.
      Contention Management
• Aggressive – always abort enemy
  transaction. Good baseline to compare.
• Polite – backoff before aborting. Sensitive to
  preemption, page faults…
• Randomized – (Balanced) coin if aborting or wait
• Eruption – a transaction helps its blocking
  transaction by giving its momentum (Momentum
  = successful open tries + blocked transactions
  – The reasoning is let transactions which hold critical
    data to finish.
       Contention Management
• Karma – the older transaction (in terms of opening tries)
  wins. Also tries on previous aborted runs are accounted.
• Kindergarten – First backoff is used before
  aborting. Later the abort is done by turns.

• KillBlocked – a transaction will abort its blocking if it is
  also blocked (or after fixed time).
• Timestamp – the older transaction wins. Failure detector
  is used.
• QueueOnBlock – blocked transactions
  are released according to a queue when
  the blocking has finished (or after a fixed time).
Contention Management
Contention Management
      Contention Management
• Most of Managers except TimeStamps, are good for
  IntSetRelease with Invisible reads.
• Aggressive, Randomized, Eruption, Polite perform badly.
• QueueOnBlock and KillBlocked has good performance
  only for RBTree with Invisible reads.
• TimeStamps is good only for Counter.
• KinderGarten is excellent, except for IntSetRelease with
  Visible reads and for RBTree.
• Karma is not good for IntSet and for LFUCache with
  visible reads.
Contention Management
Contention Management
      Contention Management
Visible reads vs. Invisible reads:
• In IntSet and Counter there is no difference as
  all the accesses are for writing.
• In IntSetRelease visible reads are better (except
  for Kindergarten which is bad for both).
  – Visible reads let an option to avoid conflicts on short
    time accesses.
• In LFUCache for all managers, and RBTree for
  all but Karma, Invisible reads is much better.
  – Most of conflicts are between a reader which scans
    its path and writer which updates the path to the root.
  Blocking STM implementation
Why not be annoyed about blocking (mainly
  compared to obstruction free)?
• Long transactions must be aborted. Obstruction
  free is forced only for a single transaction.
• Context switch is not a problem
  – Temporary.
  – OS automatic adaption.
  – Platform support (by priorities, etc.).
• Independent failure
  – Not common in multicore.
  – Sequential programs also fail due to a single failure.
 Blocking STM implementation
Non-blocking is bad because:
• Metadata and the object must be stored
  separately in order to satisfy non-blocking.
  – Doubling the cache misses.
• Assume N active transactions on N
  processors: A new transaction mustn’t be
  blocked, the conflict number increases.
Blocking STM implementation
 Blocking STM implementation
• Every transaction has in its private data
  descriptor per opened object (consists of
  the version, pointer and (maybe) a copy).
• Every object has a lock (with deadlock
  prevention) which is used when trying to
• Accesses wait for the object to be
  unlocked. Read accesses are optimistic.
• Priority mechanism.
 Blocking STM implementation
CPU time for various processor number:
 Blocking STM implementation
CPU time for various contention instances:
 Blocking STM implementation
• Context switch IS a problem because of
  long delays.
• Failure are more common on parallel
  programs than on sequential ones.
• Delay is more interesting than throughput?
             Another STM
• Similarly to DSTM, Committing is done by
  changing a state and current version is
  determined by owner transaction state.
• But like FSTM, before committing the
  transaction tries to acquire all of its owned
• Wait method is provided in order to wait an
  acquired data before retrying.
               Another STM
• An Ownership-record (orec) contains either the
  version number of one (or more) objects or a
  pointer to the owner transaction descriptor.
• Before committing, any transaction tries to
  acquire its owned data.
• In case of already acquired data, the transaction
  can abort the other transaction, wait for it to
  finish or awake it (if it sleeps).
•   Robert Ennals (Jan 2006). Software Transactional Memory Should Not Be Obstruction-Free.
    Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report.
•   K. Fraser. Practical Lock-Freedom. Technical Report UCAM-CL-TR-579, Cambridge University
    Computer Laboratory, February 2004.
•   Tim Harris , Keir Fraser. Language support for lightweight transactions. Proceedings of the
    18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and
    applications, October 26-30, 2003, Anaheim, California, USA.
•   Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software
    Transactional Memory for Dynamic-Sized Data Structures. ACM Symposium on Principles of
    Distributed Computing (PODC): 92-101, 2003.
•   Maurice Herlihy , Victor Luchangco. Distributed computing and the multicore revolution. ACM
    SIGACT News, v.39 n.1, March 2008.
•   Virendra J. Marathe and William N. Scherer III and Michael L. Scott (Oct 2004). Design Tradeoffs
    in Modern Software Transactional Memory Systems. In: Proceedings of the 7th Workshop on
    Languages, Compilers, and Run-time Systems for Scalable Computers. Houston, TX.
•   N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, Special
    Issue(10): 99-116, 1997.
•   William N. Scherer III and Michael L. Scott (Jul 2004). Contention Management in Dynamic
    Software Transactional Memory. In: Proceedings of the ACM PODC Workshop on Concurrency
    and Synchronization in Java Programs. St. John's, NL, Canada. In conjunction with PODC'04.
                                 More reading
Ennals’ blocking STM:
•   Robert Ennals. Efficient Software Transactional Memory. Intel Research Cambridge Technical Report: IRC-TR-
    05-051, 2005.
•   S. Fortune and J. Wyllie. Parallelism in Random Access Machines. In Proceedings of the 10th Annual
    Symposium on Theory of Computing, pages 114-118, 1978.
•   Phillip B. Gibbons , Yossi Matias , Vijaya Ramachandran. Can shared-memory model serve as a bridging
    model for parallel computation?. Proceedings of the ninth annual ACM symposium on Parallel algorithms and
    architectures, p.72-83, June 23-25, 1997, Newport, Rhode Island, United States.
•   P. B. Gibbons. A more practical PRAM model. Proceedings of the first annual ACM symposium on Parallel
    algorithms and architectures, p.158-168, June 18-21, 1989, Santa Fe, New Mexico, United States.
Popular message-passing old models:
•   David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh
    Subramonian , Thorsten von Eicken. LogP: towards a realistic model of parallel computation. ACM SIGPLAN
    Notices, v.28 n.7, p.1-12, July 1993.
•   Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, v.33 n.8, p.103-111,
    Aug. 1990.
Memory allocation in multi-core:
•   Andrei Gorine, Konstantin Knizhnik. Tackling memory allocation in multicore and multithreaded applications.
    MCObject LLC, May 29 2006. Available on the internet from
•   Voon-Yee Vee , Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared Memory
    Multiprocessors. Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and
    Networks (ISPAN '99), p.230, June 23-25, 1999.
•   P.R. Wilson, M.S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical
    review. In H.G. Baker, editor, Proceedings of International Workshop on Memory Management (IWMM'95),
    volume 986 of Lecture Notes in Computer Science, pages 1-116, Kirnoss, Scotland, Sept. 1995.

To top