Docstoc

Directory based Cache Coherence

Document Sample
Directory based Cache Coherence Powered By Docstoc
					Directory-based Cache
      Coherence
 Large-scale multiprocessors
            Directory contents
• for each memory block, the directory shows:
  – is it cached?
  – if cached, where?
  – if cached, clean or dirty?


• full directory has complete info for all blocks
  – n processor coherence => (n+1) bit vector
                  Terminology
• cache line, L
• home node, j: every block has a fixed “home”
• directory vector: a bit vector, V
  – really, V(L)
  – bit 0 set if dirty, reset if clean
  – bit i set => block is cached at processor i
                 The easy cases
• read hit (clean or dirty), or
• write hit (dirty)
  – no directory action required – serve from local
    cache
Handling read misses at processor i
• read miss => send request to home node, j
  – L is not cached, or cached and clean => send it to
    processor i, set bit i in vector
  – L is cached and dirty => read owner from vector
  – if cached on the home node, write back to memory,
    reset dirty bit, send data to i, set bit i
  – if cached elsewhere, j asks owner to write back to
    home node memory, resets dirty bit on receipt, then j
    sends line directly to i, home node sets bit i
    Handling write misses at processor i
• i sends the request for block L to the home node, j
• if L is not cached anywhere, j sends L to i and sets
  bits 0 and i
• if L is cached and clean, j sends invalidation to all
  sharers, resets their bits, sends L to i, and sets bits 0
  and i
• if L is cached and dirty, it can only be dirty at one
  node.
   – if L is cached at the home node, write back to memory,
     invalidate L in cache j, clear bit j, send data to i, set bits 0
     and i
   – if L is cached at node k, request a write back from k to
     home node memory, clear bit k, send data to i, set bits 0
     and i
 Handling write hits at processor i
• if clean, same process as write miss on a clean
  line – except data does not need to be
  forwarded from j to i. So: invalidate, reset bits,
  and set bits 0 and i
• if dirty, it can only be dirty in one cache – and
  that must be cache i – just return the value
  from the local cache
  Replacement (at any processor)
• if the line is dirty, write back to the home
  node and clear the bit vector
• if the line is clean, just reset bit i
  – this avoids unnecessary invalidations later
           Directory organization
• Centralized vs. distributed
  – centralized directory helps to resolve many races, but
    becomes a bandwidth bottleneck
  – one solution is to provide a banked directory
    structure: associate each directory bank with its
    memory bank
  – but: memory is distributed, so this leads to a
    distributed directory structure where each node
    holds the directory entries corresponding to the
    memory blocks for which it is the home
         Hierarchical organization
• organize the processors as the leaves of a logical
  tree (need not be binary)
• each internal node stores the directory entries for
  the memory lines local to its children
• a directory entry indicates which of its children
  subtrees are caching the line and whether a non-
  child subtree is caching the line
• finding a directory entry requires a tree traversal
• inclusion is maintained between level k and k+1
  directory nodes
• in the worst case may have to go to the root
• hierarchical schemes are not used much due to high
  latency and volume of messages (up and down the
  tree); also the root may become a bottleneck
     Format of a directory entry
• many possible variations
• in-memory vs. cache-based are just two
  possibilities
• memory-based bit vector is very popular:
  invalidations can be overlapped or
  multicast
• cache-based schemes incur serialized
  message chain for invalidation
        In-memory directory entries
• directory entry is co-located in the home node
  with the memory line
  – most popular format is a simple bit vector
  – with 128 B lines, storage overhead for 64 nodes is
    6.35%, for 256 nodes 25%, for 1024 nodes 100%
    Cache-based directory entries
• directory is a distributed linked-list where the
  sharer nodes form a chain
   – cache tag is extended to hold a node number
   – home node only knows the ID of the first sharer
   – on a read miss the requester adds itself to the head
     (involves home and first sharer)
   – a write miss requires the system to traverse list and
     invalidate (serialized chain of messages)
   – distributes contention and does not make the home
     node a hot-spot, and storage overhead is fixed; but
     very complex (IEEE SCI standard)
           Directory overhead
• quadratic in number of processors for bit vector
  – assume P processors, each with M bytes of local
    memory (so total shared memory is M*P)
  – let coherence granularity (memory block size) = B
  – number of memory blocks per node = M/B = number
    of directory entries per node
  – size of one directory entry = P + O(1)
  – total size of directory across all processors
         = (M/B)(P+O(1))*P = O(P2)
 Reducing directory storage overhead
• common to group a number of nodes into a
  cluster and have one bit per cluster – for
  example, all cores on the same chip
• leads to useless invalidations for nodes in the
  cluster that aren’t sharing the invalidated
  block
• trade-off is between precision of information
  and performance
              Overflow schemes
• How can we make the directory size independent
  of the number of processors?
  – use a vector with a fixed number of entries, where
    the entries are now node IDs rather than just bits
  – use the usual scheme until the total number of
    sharers equals the number of available entries
  – when the number of sharers overflows the directory,
    the hardware resorts to an “overflow scheme”
     • DiriB: i sharer bits, broadcast invalidation on overflow
     • DiriNB: pick one sharer and invalidate it
     • DiriCV: assign one bit to a group of nodes of size P/i;
       broadcast invalidations to that group on a write
              Overflow schemes
• DiriDP (Stanford FLASH)
  – DP stands for dynamic pointer
  – allocate directory entries from a free list pool maintained
    in memory
  – how do you size it?
     – may run into reclamation if free list pool is not sized properly at
       boot time
  – need replacement hints
  – if replacement hints are not supported, assume k sharers
    on average per memory block (k=8 is found to be good)
  – reclamation algorithms?
     • pick a random cache line and invalidate it
             Overflow schemes
• DiriSW (MIT Alewife)
  – trap to software on overflow
  – software maintains the information about sharers
    that overflow
  – MIT Alewife has directory entry of five pointers plus a
    local bit (i.e. overflow threshold is five or six)
     • remote read before overflow takes 40 cycles and after
       overflow takes 425 cycles
     • five invalidations take 84 cycles while six invalidations take
       707 cycles
              Sparse directory
• Observation: total number of cache lines in all
  processors is far less than total number of
  memory blocks
     • Assume a 32 MB L3 cache and 4 GB memory: less than 1%
       of directory entries are active at any point in time
• Idea is to organize directory as a highly
  associative cache
• On a directory entry “eviction” send invalidations
  to all sharers or retrieve line if dirty
      When is a directory useful?
• before looking up the directory you cannot decide
  what to do - even if you start reading memory
  speculatively
• directory introduces one level of indirection in every
  request that misses in processor’s cache
• snooping is preferable if the system has enough
  memory controller and router bandwidth to handle
  broadcast messages; AMD Opteron adopted this
  scheme, but target is small scale
• directory is preferable if number of sharers is small
  because in this case a broadcast would waste
  enormous amounts of memory controller and router
  bandwidth
• in general, directory provides far better utilization of
  bandwidth for scalable MPs compared to broadcast
 Extra slides (not covered in class)
• more details about directory protocols (mostly
  to do with implementation and performance)
  on the following slides for those interested
            Watch the writes!
• frequently written cache lines exhibit a small
  number of sharers; so small number of
  invalidations
• widely shared data are written infrequently;
  so large number of invalidations, but rare
• synchronization variables are notorious:
  heavily contended locks are widely shared and
  written in quick succession generating a burst
  of invalidations; require special solutions such
  as queue locks or tree barriers
                Interventions
• interventions are very problematic because they
  cannot be sent before looking up the directory;
  any speculative memory lookup would be useless
• few interventions in scientific applications due to
  one producer-many consumer pattern
• many interventions for database workloads due
  to migratory pattern; number tends to increase
  with cache size
            Optimizing for sharing
• optimizing interventions related to migratory sharing
  has been a major focus of high-end scalable servers
  – AlphaServer GS320 employs few optimizations to quickly
    resolve races related to migratory hand-off
  – some research looked at destination or owner prediction
    to speculatively send interventions even before consulting
    the directory (Martin and Hill 2003, Acacio et al 2002)
  – another idea was to generate an early writeback so the
    next request can find it in home memory instead of
    coming to the owner to get it (Lai and Falsafi 2000)
            Path of a read miss
• assume that the line is not shared by anyone
  – load issues from load queue (for data) or fetcher accesses
    icache; looks up TLB and gets PA
  – misses in L1, L2, L3,… caches
  – launches address and request type on local system bus
  – request gets queued in memory controller and registered
    in OTT or TTT (Outstanding Transaction Table or
    Transactions in Transit Table)
  – memory controller eventually schedules the request
  – decodes home node from upper few bits of address
  – local home: access directory and data memory
  – remote home: request gets queued in network interface
           Path of a read miss
• from the network interface onward
  – eventually the request gets forwarded to the router
    and through the network to the home node
  – at the home the request gets queued in the network
    interface and waits for scheduling by the memory
    controller
  – after scheduling, the home memory controller looks
    up directory and data memory
  – reply returns through the same path
            Correctness issues
• serialization to a location
  – schedule order at home node
  – use NACKs creating extra traffic and possible livelock
  – or smarter techniques like back-off (NACK-free)
• flow control deadlock
  – avoid buffer dependence cycles
  – avoid network queue dependence cycles
  – virtual networks multiplexed on physical networks
              Virtual networks
• consider a two-node system with one incoming and
  one outgoing queue on each node


      P0                                          P1


                  Virtual channels



                Physical network

• single queue is not enough to avoid deadlock
  – single queue forms a single virtual network
             Virtual networks
• similar deadlock issues as multi-level caches
  – incoming message may generate another message
    e.g., request generates reply, ReadX generates reply
    and invalidation requests, request may generate
    intervention request
  – memory controller refuses to schedule a message if
    the outgoing queue is full
  – same situation may happen on all nodes: deadlock
  – one incoming and one outgoing queue is not enough
  – what if we have two in each direction? One for
    request and one for reply
  – what about requests generating requests?
               Virtual networks
• what is the length of the longest transaction in
  terms of number of messages?
   – this decides the number of queues needed in each
     direction
   – one type of message is usually assigned to a queue
   – one queue type connected across the system forms a
     virtual network of that type e.g. request network, reply
     network, third party request (invalidations and
     interventions) network
   – virtual networks are multiplexed over a physical network
• sink message type must get scheduled eventually
   – resources should be sized properly so that scheduling of
     these messages does not depend on anything
   – avoid buffer shortage (and deadlock) by keeping reserved
     buffer for the sink queue
          Three-lane protocols
• quite popular due to simplicity
  – let the request network be R, reply network Y,
    intervention/invalidation network be RR
  – network dependence (aka lane dependence) graph
    looks something like this

     R                                 Y




                      RR
           Performance issues
• latency optimizations
  – overlap activities: protocol processing and data
    access, invalidations, invalidation acknowledgments
  – make critical path fast: directory cache, integrated
    memory controller, smart protocol
  – reduce occupancy of protocol engine
• throughput optimizations
  – pipeline the protocol processing
  – multiple coherence engines
  – protocol decisions: where to collect invalidation
    acknowledgments, existence of clean replacement
    hints

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:9
posted:10/9/2011
language:English
pages:32