Release Consistency

Document Sample
Release Consistency Powered By Docstoc
					        Release Consistency
• Slides by Konstantin Shagin, 2002

The need for Relaxed Consistency Schemes
• In any implementation of Sequential Consistency
  there should be some global control mechanism.
• Either of writes or reads require memory
  synchronization operations.
   – In most implementation writes require some kind of
     memory synchronization:

           w(x)      w(y)       w(x)

 The Idea of Relaxed Consistency Schemes

• The Relaxed Consistency Schemes are designed to
  allow less memory synchronization operations.
  – Writes can be delayed, aggregated, eliminated.
  – This results in less communication and therefore higher

          w(x)      w(y)        w(x)


Software Distributed Shared Memory

 Node 1    Node 2                          Node n

   Mem       Mem                            Mem

               distributed shared memory

page based, permissions, …
single system image,
shared virtual address space, …
              False Sharing
• False sharing is a situation in which two or
  more processes access different variables
  within a page and at least one of the
  accesses is a write.
  – If only one process is allowed to write to a page
    at a time, false sharing leads to unnecessary
    communication, called the “ping-pong” effect.

Understanding False Sharing
    w(x)         w(x)       w(x)           y
             p          p
      p             p         p            p
          r(y)      r(y)          r(y)

    w(x)         w(x)       w(x)
A                                                       y

                                         page p1

B                                                  page p2
    r(y)         r(y)       r(y)

       False Sharing in Relaxed
        Consistency Schemes
• False sharing has much smaller overhead in
  relaxed consistency models.
• The overhead induced by false sharing can be
  further reduced by the the usage of multiple-writer
• Multiple-writer protocols allow multiple processes
  to simultaneously modify their local copy of a
  shared page.
   – The modifications are merged at certain points of
              Release Consistency
        [Gharachorloo et al. 1990, DASH]*
  • Introduces a special type of variables, called
    synchronization variables or locks.
  • Locks cannot be read or written to. They can be
    acquired and released. For a lock L those
    operations are denoted by acquire(L) and
    release(L) respectively
  • We will say that a process that acquired a lock L
    but has not released it, holds the lock L.
  • No more than one process can hold a lock L. One
    process holds the lock while others wait.
(*)   K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency
      and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual
      International Symposium on Computer Architecture, pages 15--26. IEEE, May 1990.                  8
    Using Release and Acquire to define
    execution-flow synchronization primitives

• Let a set of processes release tokens by reaching the operation
  release in their program order.
• Let another set (possibly with overlap) acquire those tokens by
  performing acquire operation, where acquire can
  proceed only when all tokens have already arrived from all
  releasing processes.
• 2-way synchronization = lock-unlock, 1 release, 1
• n-way synchronization = barrier, n releases, n acquires
• PARC’s synch = k-way synchronization
              Model of Atomicity
• A read by Pi is considered performed with respect to
  process Pk at a point in time when the issuing of a write
  to the same address by Pk can not affect the value returned
  by the read.
• A write by Pi is considered performed with respect to
  process Pk at a point in time when an issued read to the
  same address by Pk returns the value defined by this
  write (or a later value).
• An access is performed when it is performed with respect
  to all processes.
• An acquire(L)by Pi is performed when Pi receives
  exclusive ownership of L (before any other requester).
• A release(L)by Pi is performed when Pi gives away
  its exclusive ownership of L.
    Formal Definition of Release
Conditions for Release Consistency:
(A) Before a read or write access is allowed to perform
    with respect to any other process, all previous acquire
    accesses must be performed, and
(B) Before a release access is allowed to perform with
    respect to any other process, all previous read or write
    accesses must be performed, and
(C) acquire and release accesses are sequentially

                     Understanding RC
                                     From this point all processes must
                                           see the value 1 in X
                 w(x)1   rel(L1)

         r(x)0      r(x)?    r(x)1   acq(L1)    r(x)1

It is undefined what
value is read here. It
  can be any value            1 must be read            Programmer is sure
   written by some             according to             that this will return
process. Here it can           rule (B), but            1 according to rules
       be 0 or 1.                   the                     (C) and (A)
                              can not be sure                                   12
                                   of it
              Acquire and Release
• release serves as a memory-synch operation, or a flush
  of the local modifications to the attention of all other
• According to the definition, the acquire and release
  operations are not only used for synchronization of
  execution, but also for synchronization of memory, i.e. for
  propagation of writes from/to other processes.
   – This allows to overlap the two expensive kinds of
   – This turns out also simpler on the programmer from
     semantic point of view.

        Acquire and Release (cont.)
• A release followed by an acquire of the same lock
  guarantees to the programmer that all writes previous to
  the release will be seen by all reads following the
• The idea is to let the programmer decide which blocks of
  operations need be synchronized, and put them between
  matching pair of acquire-release operations.
• In the absence of release/acquire pairs, there is no
  assurance that modifications will ever propagate between

Consistency of synchronization operations
• Note the relations of the release/acquire
  operations to themselves also define an independent
  memory consistency scheme.
   – The rule (C) defined it to be Sequential Consistency.
• There are other flavors of RC in which the
  consistency of synchronization operations defined to
  be some consistency x (e.g., Coherence). Such a
  memory model is denoted by RCx.
• RCx is weaker than RCy if x is weaker than y.
• For simplicity, we deal only with RCsc.
  Happened-Before relation induced by
• Redefine the happened-before relation using
  acquire and release instead of receive
  and send respectively.
• We say that event e happened before event e’ (and
  denote it by e  e’ or e < e’) if one of the
  following properties holds:
   Processor Order: e precedes e’ in the same process
   Release-Acquire: e is a release and e’ is the following
   acquire of the same lock
   Transitivity: exists e’’ s.t. e < e’’ and e’’< e’
Happened-Before relation induced by
   acquire/release (cont.)

     w(x)   rel(L1)              w(x)            acq(L2)     r(y)

     w(y) rel(L2)                 r(x)   r(x)              rel(L1)
                                r(y) w(y)
                    acq(L2)                 rel(L2)

           Competing Accesses
• Two memory accesses are not synchronized if
  they are independent events according to the
  previously defined happened-before relationship.
• Two memory accesses are conflicting if they are
  accesses to the same memory location, and at least
  one of them is a write.
• Conflicting accesses are said to be competing if
  there exists an execution in which they are not
• Competing accesses form a race condition as they
  may be executed concurrently.
                   Data Races in RC
• Release Consistency does not guarantee anything about
  propagation of updates without synchronization. Example:
                   Initially: grades = oldDatabase; updated = false;
     Thread T.A.
  grades = newDatabase;      Thread Lecturer
  updated = true;   while (updated == false);

• If the modification of variable updated is passed to
  Lecturer, while the modification of grades is not, then
  Lecturer looks at the old database!
   • This is possible in Release Consistency, but not in
      Sequential Consistency.
   Expressiveness of Release Consistency
        [Gharachorloo 1990]

Let a properly-labeled (PL) program be such that has no
competing accesses.

Theorem: RCsc = SC for PL programs.

Should make sure there are no data-races.

             Implementing RC
• The first implementation was proposed by the
  inventors of RC and is called DASH.
• DASH combats memory latency by pipelining
  writes to shared memory.
            w(x)   w(y)   w(z)   rel(L)
               x      y      z

• The processor is stalled only when executing a
  release, at which time it must wait for all its
  previous writes to perform.
        Implementing RC (cont.)
• It is important to reduce the number of messages
  exchanges, because every message has additional
  fixed overhead, independent of its size.
• Another implementation of RC, called Munin
  reduces the number of messages by buffering
  writes until a release.
          w(x)   w(y)   w(z)   rel(L)
                       Eager Release Consistency
                      [Carter et al. 1991, Munin]*
• Implementation of Release Consistency (not a new
  memory model).
• Postpone sending modifications to the next release.
• Upon a release send all accumulated modifications
  to all caching processes.
• No memory-synchronization operations on an
• Upon a miss (no local caching of the variable) get latest
  modification from latest modifier (need some more
  control to store its identity, no big deal).
 (*)   John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of MUNIN. In
       Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152--164, October
       1991.                                                                                          23
                 Understanding ERC
                           apply changes                       apply changes
   A   r(x)0 r(x)0                         acq(L1) r(y)1                       r(z)1

                     x,y                                       z
       w(x)1 w(y)1                          r(z)0                    r(z)1
   B                       rel(L1)

                     x,y                              z

       acq(L2) w(z)1                       r(x)1
                           apply changes            rel(L2)

• Release operation does not complete (is not performed) until
  the acknowledgements from all the processes are received.
   Supporting Multiple Writers in ERC

• Modifications are detected by twinning.
   – When writing to unmodified page, its twin is created.
   – When releasing, the final copy of a page is compared to
     its twin.
   – The resulting difference is called a diff.
• Twinning and diffing not only allow multiple
  writers, but also reduce communication.
   – Sending a diff is cheaper than sending an entire page.

           Twinning and Diffing
write P                  twin

                         writable working copy


                                   diff

     Update-based vs. Invalidate-based
     • In update-based protocols the modifications
       are sent whereas in invalidate-based
       protocol only notifications of modifications
       are sent.
           Invalidate-based                             Update-based

      w(x)1 w(y)2 rel(L)                              w(x)1 w(y)2 rel(L)
P1                                               P1
                           “I changed x and y”                             x:=1
P2                                               P2

Update-Based vs. Invalidate-Based
• Invalidations are smaller than the updates.
      – The bigger the coherency unit the bigger is the
• In invalidation-based schemes there can be
  significant overhead due to access misses.

       w(x)1 w(y)2 rel(L)
                            inv(x)                     x=1            y=2
                                              get(x)         get(y)
                              inv(y) acq(L)
                                               r(x)           r(y)
Reducing the Number of Messages
• In DASH and Munin systems all processes (or all
  processes that cache the page) see the updates of a process.
• Consider the following example of execution in Munin:

        w(x) rel(L)
                 acq(L) w(x) rel(L)
                                      acq(L) w(x) rel(L)
                                                           acq(L) r(x)

• There are many unneeded messages. In DASH even more.
• This problem exists in invalidation-based schemes as well.
Reducing the Number of Messages
• Logically, however it suffices to update each processor’s
  copy only when it acquires L.

        w(x) rel(L)
                 acq(L) w(x) rel(L)
                                      acq(L) w(x) rel(L)
                                                           acq(L) r(x)

• Therefore, a new algorithm, called Lazy Release
  Consistency (LRC) for implementing RC was proposed.
• LRC is aimed at reducing both the number of messages
  and the amount of data exchanged.                                      30
            Lazy Release Consistency
       [Keleher et al., Treadmarks 1992]*

• The idea is to postpone sending of modifications
  until a remote processor actually needs them.
• Invalidate-based protocol
• The BIG advantage: no need to get modifications
  that are irrelevant, because they are already masked
  by newer ones.
• NOTE: implements a slightly more relaxed
  memory model than RC!
(*)   P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenopol. Treadmarks: Distributed shared memory on
      standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference,
      pages 115--132, Jan. 1994.                                                                      31
      Formal Definition of Lazy
        Release Consistency
Conditions for Lazy Release Consistency:
(A) Before a read or write access is allowed to perform
    with respect to any other process, all previous acquire
    accesses must be performed with respect to that other
    process, and
(B) Before a release access is allowed to perform with
    respect to any other process, all previous read or write
    accesses must be performed with respect to that other
    process, and
(C) acquire and release accesses are sequentially
 Understanding the LRC Memory Model

                 w(x)1   rel(L1)

         r(x)0      r(x)?    r(x)?   acq(L1)   r(x)1

         r(x)0       r(x)?   r(x)?   acq(L2)   r(x)?

• It is guaranteed that the acquirer of the same lock sees the
  modification that precede the release in program order.

 Understanding the LRC Memory Model:
       w(x)1     rel(L1)

       acq(L2)        acq(L1) w(y)1 rel(L2)               rel(L1)

                                              acq(L2)   r(x)1   r(y)1

• The process C sees the modification of x by A.

        Implementation of LRC
• Satisfying the happened-before relationship
  between all operations is enough to satisfy
  – Maintenance and usage of such a detailed
    ordering would be expensive.
• Instead, the ordering is applied to process
  – Intervals are segments of time in the execution
    of a single process.
  – New interval begins each time a process
    executes a synchronization operation.
               rel(L1)                          acq(L3)
         1                        2                       3

     acq(L2)            acq(L1)           rel(L2)             rel(L1)
     1         2                      3             4                   5

                   rel(L3)                     acq(L2)
         1                            2                       3

   Happened-before of Intervals
• A happened before partial order is defined
  between intervals.

  An interval i1 precedes an interval i2
  according to happened-before of intervals,
  if all accesses in i1 precede accesses in i2
  according to the happened-before of
          Vector Timestamps
• An interval is said to be performed at a
  process if all interval’s accesses have been
  performed at that process.
• Each process p has vector timestamp Vp that
  tracks which intervals have been performed
  at that process.
  – A vector timestamp consists of a set of interval
    indices, one per process in the system.
         Management of Vector
• Vector timestamps are managed like vector clocks.
   – send and receive events are replaced by release
     and acquire (of the same lock) respectively.
   – A lock grant message (that is sent from releaser to
     acquirer to give acquire the exclusive ownership)
     contains the current timestamp of the releaser

1. Just before executing a release or acquire in p:
   Vp[q]:= Vp[q] + 1
2. A lock grant message m is time-stamped with t(m)=Vp.
3. Upon acquire for every q: Vp[q]:= max{ Vp[q], t(m)[q] }
     Vector Timestamps (cont.)
• A process updates its vector timestamp at the end
  of an interval. Therefore during an interval the
  process’ timestamp does not change.
• We denote the vector timestamp of process p at
  interval i by Vpi.
• The entry for process q  p is denoted by Vpi[q].
   – It specifies the most recent interval of process q that has
     been performed at process p.
   – Entry Vpi[p] is always equal to i.
• An interval x of process q is said to be covered by
  Vpi if Vpi[q]  x
                  Write Notices
• Write notice is an indication that a given page has
  been modified.
• Each process keeps a table of intervals covered by
   – An entry in this table represents an interval. It contains
     a write notice for every page that was modified during
     the segment of time corresponding to the interval.
• Write notices are sent in the lock grant message
  along with the vector timestamp of the releaser.
           Write Notices (cont.)
• It is not necessary to send to acquirer the write notices
  belonging to intervals covered by its vector
• In order to let releaser know what intervals are
  covered by the acquirer, the acquirer sends the release
  its timestamp inside a lock request message.
• When the releaser sends a lock grant message to the
  acquirer, it sends only the write notices belonging to
  interval covered by itself, but not covered by the
• When the acquirer receives the lock grant message, it
  invalidates all the pages for which a write notice is
  included in the message.
              Write Notices (cont.)

A   acq(L) w(x) w(y) rel(L)

    write      generate                  write notices for       request   diffs
    notices    write notices             intervals not covered   diffs
                               lock      by VCB
                  acq(L)                  invalidate according   r(y)
                                 x,y      to write notices

                   Access Misses
• When accessing an invalidated page, all the
  modifications made to it in the intervals that
  happened before the current interval must be
   – Note that this is true even if the access is a write.
• A process can identify those intervals and the
  processes that performed the modification by the
  write notices it has for the page.
   – A write notice is saved along with the id of the process
     from which it was received and its vector timestamp.
• How do we merge modifications performed by
  concurrent writers to a page?
    Tracking Modifications with
         Multiple Writers
• It is possible that several processes make modifications to
  different variables at the same page.
              P1                           P2

               X                           Y

• If the intervals in which the modifications are performed
  are independent (according to happened-before), we cannot
  just bring a page from one of the processes.
• What should we do? Employ the twinning and diffing
  technique again!
Twinning and Diffing (reminder)
write P             twin

                    writable working copy


                              diff

       Tracking Modifications with
         Multiple Writers (cont.)

       w(x)           rel(L1)
                                                              page P
                         acq(L2)           acq(L1) r(x)
 P2                                                           x
                   rel(L2)                                t

• Note that twinning and diffing not only allows multiple
  independent writers but also significantly reduces the
  amount of data sent.                                            47
                    Access Misses (cont.)
• Consider the following scenario, in which P3 has a miss on
  a page containing variables x, y and z:
       w(x)   rel
                    acq   inv(x) w(y) rel
                                            acq   inv(x,y) r(z)   mod(x,y)

• When accessing z, P3 sees that according to the locally
  stored write notices there has been two previous
• They are ordered by happened before relationship therefore
  P3 can request both modifications from P2.

            Access Misses (cont.)
• More generally, if processor q modified page P at its
  interval x, then q is guaranteed to have any diffs of P
  created intervals that “happened-before” the interval x.
• Therefore even if diffs from multiple writers need to be
  retrieved, it is usually only necessary to communicate with
  very few processors.
• How long should a process keep the diffs ?
• How long should a process keep the write notices ?
• Clearly, not forever! A garbage collection needs to be
               Garbage Collection
• A diff needs to be retained until it is clear it will never
  be requested.
   – This happens when a diff has already been sent to every
• When a process sees it is running out of memory it
  initiates garbage collection, which is invoked at the next
• Garbage collection piggybacks on the barrier to “stop
  the world”. Each process receives all write notices in
  the system and uses them to validate all of its cached
  pages. As a result, all write notices and diffs are
                     Lazy Diffing
1. Don’t diff on every release; do it only if there’s
   communication with another node.
2. Delay generation of a diff until somebody asks for the
   – When passing a lock token, send write notices for
      modified pages, but leave pages write-enabled.
   – If somebody asks for diff, diff and mark clean.
       •   Diff may include updates from later intervals (e.g., under the
           scope of other locks).
3. Must also generate diff if a write notice arrives.
   • Must invalidate the page but keep modifications.
        LRC with Lazy Diffing

     acq(L) w(x)        rel(L)
            make twin
                                            apply diff

      Benefits of Lazy Diffing
• The gain is considerable.
  – The eventual diff may include modifications
    that would have been split over several diffs.
  – Lock acquisitions are faster – no need to wait
    for diffs.
  – Reducing the number of diffs reduces overall
    amount of transmitted data.

       Drawbacks of Traditional LRC
1. At access miss a node may have to obtain the
   diffs from several nodes.
   –   This happens when there is a substantial write-write
       false sharing.
2. The same diff may be applied many times.
   –   Once at each node that fetches the diff
3. The need to save all diffs seen by a node
   significantly increases memory consumption.
   –   A node that creates a diff needs to store it locally.
   –   A node stores diffs it fetched from other nodes.
4. Garbage collection is an expensive global
   operation.                                                  54
  Home-based Lazy Release Consistency*

• HLRC is a simple home-based multiple-writer protocol
  that implements LRC
• Each page has a designated node, called the home node of
  the page, which contains its master copy.
• Diffs are computed at lock transfer time, sent to the home
  nodes of the corresponding pages, and then discarded.
• On access miss (read or write) an entire page is fetched
  from home.
• HLRC solves the mentioned drawbacks of LRC.

 (*)   L. Iftode. Home-based Shared Virtual Memory. PhD thesis, Dept. of Computer Science, Princeton
       University, June 1998.
             Understanding HLRC
• Assume x is a variable on a page p whose home is
       acq(L) w(x)        rel(L)
              make twin
                            acq(L)                   r(x)
                                            apply diff

• What happens if P2 tries to fetch p before the diff
  arrives to P3?
Guaranteeing Update Completion
        Before a Fetch
• There are several techniques to ensure that the
  home’s copy of a page contains the required
   – Write flushing
   – Page versions (scalar timestamps)
   – Vector timestamps
• All the techniques require that the network
  delivers the messages in the order that they are
             Write Flushing
• The simplest approach.
• Delay the completion of the release events
  until all the updates are propagated to the
  corresponding home are completed
  – The completion is ensured by having home
    acknowledge the receipt of diffs.

            Write Flushing (cont.)
       acq(L) w(x)        rel(L)
              make twin
                            acq(L)                       r(x)
                                            apply diff

There are two drawbacks:
  1. Latency increases due to the need to wait for the
     completion of update operation.
  2. Page prefetching: a page fetched from home may contain
     more recent updates than the ones required by LRC.
                            Page prefetching
      w(x) rel(L1)
P2    w(y) rel(L2)

                            (y)                 (x)             inv(p)
                                                                           p                         p
                        apply diff              apply diff
      acq(L1)                                                      r(x)        acq(L2)        r(y)
     acq(L2)                                            rel(L2)                          inv(p)

                                                                                      No need to
                                                                                     bring p again
                Page Versions
• A version number is attached to each page.
• Page version number is incremented at home
  whenever the home receives the update performed
  by a non-home writer within an interval.
• The home sends the page version either in reply to
  a diff message, or in reply to a fetch request along
  with the page itself.
• The page version numbers are included in the write
• A local page is not invalidated if the local page
  version is greater than or equal to the required page
  version included in the write notice.
                     Page Versions (cont.)
      w(x) rel(L1)
P1                                                                               p is not invalidated,
                     diff                                                     because the version of the
P2    w(y) rel(L2)                                                                  local copy is 2
                            (y)                  (x)
                                    1                              inv(p,2)
                        apply diff               apply diff
      acq(L1)                                                        r(x)       acq(L2)          r(y)
     acq(L2)                                             rel(L2)                          inv(p,1)

• Using page versions avoid unnecessary
  invalidations, but still require waiting for updates
  to complete at home.                                 62
         Vector Timestamps
• Vector timestamps represent page versions,
  but avoid the need to wait for completion of
• The lock can be transferred immediately,
  because the vector timestamp representing
  the new page version can be calculated
  without cooperation of home.
• The prefetching is detected in same way it
  is detected with scalar page versions.
     Vector Timestamps (cont.)
• A vector timestamp is attached to a valid page and
  indicates the current version of the page.
   – Such timestamp is called flush timestamp.
   – At home, flush timestamps are updated each time
     updates corresponding to a remote interval are
   – At non-home node, flush timestamp are updated either
      • at the end of intervals during which the page was written, or
      • when the page is fetched from home.

     Vector Timestamps (cont.)
• A vector timestamp is attached to an invalid page
  and indicates the page version that the node has to
  fetch from home.
   – Such timestamp is called lock timestamp.
   – Lock timestamp is updated at acquire time as a result of
     applying the write notices received from the last
• The lock timestamp is presented to the home as a
  part of a fetch request.
   – The home delays answering a fetch request if the
     required version is not available (because the
     corresponding updates are not completed).
                     Vector Timestamps (cont.)
                                                      1                   0       1         1
                                                      0                   1       1         1            p is not
                                                      0                   0       0         0
                                                      0                   0       0         0           invalidate
                                                      0                   0       0         0                d
      w(x) rel(L1)
                      diff              inv(p,[])     (x)
P2    w(y) rel(L2)

                             (y)                            apply diff
                           apply diff
      acq(L1)                                       r(x)                       acq(L2)           r(y)
                      inv(p,[])                                                          inv(p,[])
     acq(L2)                                rel(L2)

                       0                        read is delayed until the                               1
                       1                      home node P3 applies the diff                             0
                       0                                from P1                                         0
    Invalidation of a modified page
• There are situations that require that a modified
  page is invalidated. Example:
                                     A and B write to two
      acq(L) w(P) rel(L)
A                                    different locations of
                           inv(P)    the page P (no data
                  w(P)     acq(L)

• After invalidating P at acquire, B cannot discard
  the local copy of P, because it contains B’s recent
    Invalidation of a modified page (2)
• What happens when B fetches P from its home?
  (Let C be the home of P)
                                                How do we
      acq(L) w(P) rel(L)                        merge B’s
A                                               local copy
                  diff(P)   inv(P)
                                                 with the
B                                                 fetched
           w(P)             acq(L)                 one?

• B’s local copy is combined with the new one by
  two-way diffing.                                      68
             Two-way diffing
• Let Pold be the modified invalidated copy and let
  Told be its twin.
• Let Pfetched be the fetched copy.
• The new local copy Pnew is calculated by applying
  modification in Pold to Pnew:
             Pnew := Pfetched + Pold  Told
• In addition, the twin of P is replaced by Pfetched:
             Tnew := Pfetched
• Therefore, the next time the diff of P is calculated,
  both old and new modifications are detected.