Dynamic Software Transactional Memory Idan Igra Topics in Reliable Distributed Computing (048961) Technion, Nov 2008 Agenda • Motivation • Software Transactional Memory • Dynamic Software Transactional Memory • Faser’s STM • Dynamic STM vs. Faser’s STM • A blocking STM implementation • Another obstruction free STM implementation by Faser • DSTM Contention management William N. Scherer III Mark Moir Victor Luchangco Maurice Herlihy Department of Computer Science Sun Microsystems Laboratories Sun Microsystems Laboratories Department of Computer Science University of Rochester 1 Network Drive 1 Network Drive Brown University Rochester, NY 14620, USA Burlington, MA 01803, USA Burlington, MA 01803, USA Providence, RI 02912, USA email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org Multicore history • Parallel computing was used for HPCs and networking. – PRAM & other shared memory models aren’t realistic. – BSP & LogP (message passing models) were used. • Only for HPC specialists. • Demand complicated system analyze per application. • HW constraints force multicore architectures. • Today’s parallel programming based on locks. – Coarse grained code prevent parallelism, fine grained are hard to use. – Code reuse demands exposing internal locks. – No conventional way to connect mutex and its data. Nonblocking liveness properties • Wait freedom: Every process which tries to do an operation will complete it in a finite number of steps. • Lock freedom: If any process tries to do an operation, then there is a process which will succeed completing an operation. • Obstruction freedom: Process that runs by its own tries to do an operation will complete it. Atomic hardware primitives • Load_Linked / Store_Conditional (LL/SC): LL(addr) returns the value pointed by addr. Next call to SC(addr, val) writes val into addr if it was not written since last LL call. • Compare And Swap (CAS): The operation CAS(addr, e, v) swaps the values of addr and v if addr == e. • MCAS: Atomic m CAS operations (particular case: DCAS). Helping methodology • A methodology for non-blocking algorithms. • Any process which holds a data that other process needs is helped by the other. • Usually recursive help. • Particularly, used widely in Transactional Memory for MCAS software implementation (known as k-RMW). Software Transactional Memory • First try to catch the whole data it needs. • If succeeded – compute transaction and release the data. • If failed – release all and retry. Software Transactional Memory Why Software Transactional Memory? • Unexpected delays decreases performances of locking method, besides its inherent programming difficulties. – Memory allocation and deallocation synchronization conflicts. • Hardware Transactional Memory lacks the platform support, portability and delay anomalies. • Methods like translating the code to k-RMW actions is non-trivial. • Working on a copy of the object is not good for large data structure. • Programmable and flexible non-blocking parallel programming method is needed. Software Transactional Memory Data set pre-acquiring • Unintuitive programming. • Reduces parallelism. – Common data structures should be acquired totally. • Dynamic data structures are impossible. Software Transactional Memory Hardware support • LL/SC is not commonly supported by hardware. • Operating system can support it. – Much slower. – Reduce parallelism (force some scheduling). – More useful primitive can be defined. Software Transactional Memory Wait freedom cost: • Complicated acquiring code. • Not flexible. • Non-common primitives. • Long locking time. Dynamic STM • Enables also dynamic transactions – with a changing data set. • Satisfies Obstruction freedom. • Modular contention manager for progress forcing, priorities and application-adapting. Dynamic STM Dynamic STM Implementation principles: • A TM object points to Locator which contains an old version, a new one and the last transaction opened it for writing. • The right version is determined by the status (active / aborted / committed). • All objects are committed at once by changing the status. • Obstruction free is obtained by aborting a conflicting transaction (conditioned by contention manager agreement). Dynamic STM DSTM properties and results: • Much natural to write and convert sequential code into DSTM code. • Releases can significantly increase performance. • Re-use simpler algorithms for a bigger one is easier using DSTM. • Disadvantage: no way to know that an object was opened for reading. Dynamic STM • Obstruction free enables: – simplicity, – for some application is good enough, – enables implementation of priorities, – enables separating correctness and progress – and most important – prevent the need of helping mechanism. • However, one can consider it is not a real progress property. Dynamic STM Discussion DSTM vs. STM: • DSTM relates to STM like Coarse-grained to fine-grained. • But STM meets a real requirement and not weakened one (obstruction free). • Releases as an integral part of the mechanism reduces conflicts (compared to locks). Non-blocking, particularly obstruction free, is better for delayed/failed processes won’t stop the whole system (Very strong for DSTM). • DSTM’s implementation might cause loosing that gain for real parallelized systems. • Let the contention manager do the work is exactly like assuming the scheduler will do that. Faser’s STM STM should satisfy: • Small fixed storage overhead per object. • Small shared memory operations. • Contention time is short. – Reduces time that transactions meet. Nice to have: • Supporting varying object sizes. • Nesting transactions. Faser’s STM • Every object is represented as a pointer to object handler, which consists of version number and a pointer to the data block. • Open for read returns the data block pointer. • Open for write returns a pointer to a shadow copy. • Commit is done by acquiring all the opened object, MCAS and helping. Faser’s STM Faser’s STM • Problem: Acquiring and releasing read- only object block non-conflicted transactions. – Critical for single start point data structures (head of linked list). • Solution: not to acquire read-only objects. – Add a read-checking state in which the transactions checks all the opened read only objects, so other transactions don’t update it during this time. Faser’s STM • Deadlock Prevention: T1 can abort T2 only if: – both’ status is read-checking – T2 holds a location that T1 tries to read – T1 < T2 according to a given total order between transactions. DSTM vs. FSTM FSTM is much better: • Lazy acquire exposes a transaction to others for a very short time, reduces conflict number. • Indirection levels decrease performances (mainly for read-only transactions). • Obstruction freedom’s contention manager has a 5-10% overhead and hard for designing. DSTM vs. FSTM DSTM vs. FSTM DSTM vs. FSTM DSTM is much better: • Eager acquire helps capturing conflicts earlier. – Possible thanks to Obstruction freedom weakness. • Fewer CAS’s (N+1 for DSTM vs. 2N+2 for FSTM). • Implementation is simpler and more efficient. • MCAS causes a lot of cache block trashing. DSTM vs. FSTM DSTM vs. FSTM DSTM vs. FSTM DSTM is better for workloads which: • Opening a lot of locations. – Mainly write accesses for the same location (IntSet). – Transactions must be serialized (stack). FSTM is better for workloads which: • Livelocks are common (RBTree). • Small Transactions – Small conflict probability (IntSetRelease). DSTM vs. FSTM General remarks: • Not validating repeatedly improves performances. • How can non-consistent (aborted) transactions be avoided? Contention Management Recall – DSTM contention manager should: • ensure progress. • eventually returns from every call. • eventually aborts conflicting transaction. Management approaches are tested for: • Various data set • Visible/Invisible reads (optimistic/non-optimistic). • Eliminating unnecessary aborts. Contention Management • Aggressive – always abort enemy transaction. Good baseline to compare. • Polite – backoff before aborting. Sensitive to preemption, page faults… • Randomized – (Balanced) coin if aborting or wait (64ns). • Eruption – a transaction helps its blocking transaction by giving its momentum (Momentum = successful open tries + blocked transactions momentum). – The reasoning is let transactions which hold critical data to finish. Contention Management • Karma – the older transaction (in terms of opening tries) wins. Also tries on previous aborted runs are accounted. • Kindergarten – First backoff is used before aborting. Later the abort is done by turns. • KillBlocked – a transaction will abort its blocking if it is also blocked (or after fixed time). • Timestamp – the older transaction wins. Failure detector is used. • QueueOnBlock – blocked transactions are released according to a queue when the blocking has finished (or after a fixed time). Contention Management Contention Management Contention Management Results: • Most of Managers except TimeStamps, are good for IntSetRelease with Invisible reads. • Aggressive, Randomized, Eruption, Polite perform badly. • QueueOnBlock and KillBlocked has good performance only for RBTree with Invisible reads. • TimeStamps is good only for Counter. • KinderGarten is excellent, except for IntSetRelease with Visible reads and for RBTree. • Karma is not good for IntSet and for LFUCache with visible reads. Contention Management Contention Management Contention Management Visible reads vs. Invisible reads: • In IntSet and Counter there is no difference as all the accesses are for writing. • In IntSetRelease visible reads are better (except for Kindergarten which is bad for both). – Visible reads let an option to avoid conflicts on short time accesses. • In LFUCache for all managers, and RBTree for all but Karma, Invisible reads is much better. – Most of conflicts are between a reader which scans its path and writer which updates the path to the root. Blocking STM implementation Why not be annoyed about blocking (mainly compared to obstruction free)? • Long transactions must be aborted. Obstruction free is forced only for a single transaction. • Context switch is not a problem – Temporary. – OS automatic adaption. – Platform support (by priorities, etc.). • Independent failure – Not common in multicore. – Sequential programs also fail due to a single failure. Blocking STM implementation Non-blocking is bad because: • Metadata and the object must be stored separately in order to satisfy non-blocking. – Doubling the cache misses. • Assume N active transactions on N processors: A new transaction mustn’t be blocked, the conflict number increases. Blocking STM implementation Blocking STM implementation • Every transaction has in its private data descriptor per opened object (consists of the version, pointer and (maybe) a copy). • Every object has a lock (with deadlock prevention) which is used when trying to commit. • Accesses wait for the object to be unlocked. Read accesses are optimistic. • Priority mechanism. Blocking STM implementation CPU time for various processor number: Blocking STM implementation CPU time for various contention instances: Blocking STM implementation Discussion: • Context switch IS a problem because of long delays. • Failure are more common on parallel programs than on sequential ones. • Delay is more interesting than throughput? Another STM • Similarly to DSTM, Committing is done by changing a state and current version is determined by owner transaction state. • But like FSTM, before committing the transaction tries to acquire all of its owned records. • Wait method is provided in order to wait an acquired data before retrying. Another STM • An Ownership-record (orec) contains either the version number of one (or more) objects or a pointer to the owner transaction descriptor. • Before committing, any transaction tries to acquire its owned data. • In case of already acquired data, the transaction can abort the other transaction, wait for it to finish or awake it (if it sleeps). References • Robert Ennals (Jan 2006). Software Transactional Memory Should Not Be Obstruction-Free. Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report. • K. Fraser. Practical Lock-Freedom. Technical Report UCAM-CL-TR-579, Cambridge University Computer Laboratory, February 2004. • Tim Harris , Keir Fraser. Language support for lightweight transactions. Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications, October 26-30, 2003, Anaheim, California, USA. • Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software Transactional Memory for Dynamic-Sized Data Structures. ACM Symposium on Principles of Distributed Computing (PODC): 92-101, 2003. • Maurice Herlihy , Victor Luchangco. Distributed computing and the multicore revolution. ACM SIGACT News, v.39 n.1, March 2008. • Virendra J. Marathe and William N. Scherer III and Michael L. Scott (Oct 2004). Design Tradeoffs in Modern Software Transactional Memory Systems. In: Proceedings of the 7th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers. Houston, TX. • N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, Special Issue(10): 99-116, 1997. • William N. Scherer III and Michael L. Scott (Jul 2004). Contention Management in Dynamic Software Transactional Memory. In: Proceedings of the ACM PODC Workshop on Concurrency and Synchronization in Java Programs. St. John's, NL, Canada. In conjunction with PODC'04. More reading Ennals’ blocking STM: • Robert Ennals. Efficient Software Transactional Memory. Intel Research Cambridge Technical Report: IRC-TR- 05-051, 2005. PRAM: • S. Fortune and J. Wyllie. Parallelism in Random Access Machines. In Proceedings of the 10th Annual Symposium on Theory of Computing, pages 114-118, 1978. • Phillip B. Gibbons , Yossi Matias , Vijaya Ramachandran. Can shared-memory model serve as a bridging model for parallel computation?. Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.72-83, June 23-25, 1997, Newport, Rhode Island, United States. • P. B. Gibbons. A more practical PRAM model. Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, p.158-168, June 18-21, 1989, Santa Fe, New Mexico, United States. Popular message-passing old models: • David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken. LogP: towards a realistic model of parallel computation. ACM SIGPLAN Notices, v.28 n.7, p.1-12, July 1993. • Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, v.33 n.8, p.103-111, Aug. 1990. Memory allocation in multi-core: • Andrei Gorine, Konstantin Knizhnik. Tackling memory allocation in multicore and multithreaded applications. MCObject LLC, May 29 2006. Available on the internet from http://www.embedded.com/columns/showArticle.jhtml?articleID=188101359 • Voon-Yee Vee , Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '99), p.230, June 23-25, 1999. • P.R. Wilson, M.S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In H.G. Baker, editor, Proceedings of International Workshop on Memory Management (IWMM'95), volume 986 of Lecture Notes in Computer Science, pages 1-116, Kirnoss, Scotland, Sept. 1995.
Pages to are hidden for
"Dynamic Software Transactional Memory"Please download to view full document