EE 382

Document Sample
EE 382 Powered By Docstoc
					                     EE382
                Processor Design

                 Winter 1998
             Chapter 8 Lectures
            Multiprocessors Part II



EE 382 Processor Design   Winter 98/99   Michael Flynn   1
                          Illinois




EE 382 Processor Design   Winter 98/99   Michael Flynn   2
                 Write-invalidate




EE 382 Processor Design   Winter 98/99   Michael Flynn   3
        Synchronization/coherency
• Synchronization....means to insure that multiple
  processors have the same (coherent) view of
  critical values in memory
• Coherency....property that ensures that value
  returned after a read is the same value as the latest
  write
• Consistency..degree to which (or part of memory
  over which) coherency is maintained



    EE 382 Processor Design   Winter 98/99   Michael Flynn   4
        Consistency of memory ops
• Sequential consistency (strong ordering)
   – all memory ops execute in some sequential order.
     Memory ops of each processor appear in program
     order.
• Processor consistency ( buffered writes)
   – LD sequences appear in program order also ST
     sequences, but LD may proceed ST.
   – different processors may see different op order
   – require explicit synchronization



    EE 382 Processor Design   Winter 98/99   Michael Flynn   5
                  Weak consistency
• Other forms possible, e.g. weak ordering
  – all pending mem ops are completed before a
    synchronization op (forced completion is called
    a fence op)
  – synch ops are completed before any other
    memory ops.
  – synch ops are sequentially consistent.



   EE 382 Processor Design   Winter 98/99   Michael Flynn   6
                               Outline
• Partitioning                     • Scalable MP
   – Granularity                        – Cache directories
   – Overhead and efficiency            – Interconnection networks
• Multi-threaded MP                     – Trends and tradeoffs

• Shared Bus                       • Additional References
   – Coherency                          – Hennessy and Patterson, CAQA,
                                          Chapter 8
   – Synchronization
                                        – Culler, Singh, Gupta, Parallel
   – Consistency                          Computer Architecture
                                          A Hardware/Software Approach
                                           • http://HTTP.CS.Berkeley.EDU
                                              /~culler/book.alpha/index.html




    EE 382 Processor Design     Winter 98/99       Michael Flynn     7
                            Scalable MP
• Bandwidth for single bus limits scalability
  – Can use two (or more buses) for even/odd cache lines
  – Extends system size incrementally at substantial cost
  – Use low-degree MP on shared bus as a cluster with scalable
    interconnect




       EE 382 Processor Design   Winter 98/99   Michael Flynn   8
         Coherency for Scalable MP
• Maintain single, coherent memory address space
   – There is no longer a shared bus accessed by all processors
     for synchronization and communication through memory
   – Use a directory to track processors using memory lines
       • central directory: with memory module
       • distributed directory: with individual caches
   – Shared lines can be invalidated or updated on write
   – 4 possible protocols: CD-INV, CD-UP, DD-INV, DD-UP
       • CD-INV and DD-INV (Scalable Coherent Interconnect) are most
         common



     EE 382 Processor Design    Winter 98/99      Michael Flynn   9
                Central Directory




EE 382 Processor Design   Winter 98/99   Michael Flynn   10
                        Central Directory
• Typically use a bit vector stored with each line in
  memory
   – Each bit indicates whether the corresponding cluster has cached a copy of
     the line
   – Various optimizations to reduce storage overhead are possible
   – Used in Stanford DASH/FLASH, MIT Alewife, SGI Origin
• When a processor needs to write a line it does not own
   –   It requests the line from memory
   –   CD sends invalidates to all caches that hold the line
   –   All relevant caches invalidated the line and acknowledge
   –   Requesting processor is allowed to take ownership and modify the line



        EE 382 Processor Design    Winter 98/99     Michael Flynn   11
            Distributed Directory (Part I)
• Linked-list used to keep track of caches holding a line
   – Singly- or doubly-linked (SCI) lists used
   – Pointer to head of list is stored with line in memory
   – Used in IEEE-SCI and Sequent NUMA-Q

• When a processor (P) needs to write a line it does not own
   – If P holds a shared copy in its cache, P removes itself from the linked list of
     caches for the line
   – P notifies the memory of its intention to write the line and becomes the head
     of the list
   – P sends an invalidation signal to the next cache on the list
        • The next cache invalidates the line and returns an acknowledge to P along with a
          pointer to the next cache on the list
   – When all the caches have been invalidated, P can take ownership and write
     the line
         EE 382 Processor Design       Winter 98/99          Michael Flynn   12
  Distributed Directory (Part II)




EE 382 Processor Design   Winter 98/99   Michael Flynn   13
              Distributed Directory (Part II)
• Performance Issues
  – Linked lists generally short for shared data being modified
  – When data is shared, important to minimize synchronization and
    communication overhead
      • Queue on Lock Bit (QOLB)
          –  Hardware maintains queue of caches waiting on lock
          –  Software spins on shadow copy of line in local cache
          –  Lock and data stored in same cache line
          –  Single line transfer required for each processor to
             synchronize/communicate
          – “An Analysis of Synchronization Mechanisms in Shared-Memory
             Multiprocessors”, Woest and Goodman, URL:
             http://www.cs.wisc.edu/~arch/www/
      • Efficient algorithms can be quite complex
          – FLASH uses programmable protocol processor
      EE 382 Processor Design   Winter 98/99     Michael Flynn   14
               Interconnect Networks
• Each network node consists of processor, cache, and
  part of global memory
• May also include switch (direct).
   – For indirect networks switches are removed from
     nodes
• Networks may be static (fixed links between nodes) or
  dynamic (switches configure path)
• Only direct-static and indirect-dynamic commonly
  used.


     EE 382 Processor Design   Winter 98/99   Michael Flynn   15
          Interconnect Networks
        Direct                              Indirect




EE 382 Processor Design   Winter 98/99   Michael Flynn   16
           Static, Direct Networks
• Includes ring, linear array, star, mesh, ...
• We consider only hypertorus (k,n) topologies
   – n-dimensions, k-elements per dimension
   – k-ary n cubes with end around connection
• Terms
   – distance
        • smallest no. links/hops between 2 nodes
   – diameter
        • largest distance between 2 nodes
   – number of nodes
        • N = kn for a (k,n) network

  EE 382 Processor Design   Winter 98/99     Michael Flynn   17
         Static, Direct Networks


           Linear Array              Grid (2D Mesh)




                Ring
                                         2D torus



EE 382 Processor Design   Winter 98/99        Michael Flynn   18
     Links (Channels) and Nodes
• Link characteristics
  – cycle time: Tch =1/BW of a link wire
  – width of link: w = no. wires in the link
  – directionality:unidirectional or bidirectional
    links
• Node buffering (static networks)
  – Store and Forward
  – Wormhole (cut-through) routing


 EE 382 Processor Design   Winter 98/99   Michael Flynn   19
       Links (Channels) and Nodes



Store and Forward




   Wormhole




   EE 382 Processor Design   Winter 98/99   Michael Flynn   20
   Communication Latency for Static Network
Assume a (k,n) network with dimensional closure and
bidirectional links; if message has H header bits and l “payload”
bits, number of channel cycles to transmit message over one link
is (l + H)/w. If the distance between source and destination
nodes is d links and h= H/w, then
       Tstore-and-forward= Tch [d (l + H)/w] = Tch [d (l /w) +d h]
For wormhole routing, once a message header is received at a
node the message proceeds to an output channel and is
transmitted, so
                        Twormhole = Tch [d h + (l /w)]
Note: Both formulas above refer to communication latency in the absence of
contention (i.e., no queuing delay).


      EE 382 Processor Design   Winter 98/99      Michael Flynn   21
       Dynamic, Indirect Networks
• Switches are separate from the nodes and
  centralized as a MIN (Multistage
  Interconnection Network)
  – A switch is a k x k crossbar with no storage
  – An N-node (1 channel/node) network has (N/k)w
    switches per stage.
  – Min. no stages to connect N to N is [logkN]




   EE 382 Processor Design   Winter 98/99   Michael Flynn   22
    Dynamic, Indirect Networks




    Multi-Stage Network                       Crossbar Switch




EE 382 Processor Design   Winter 98/99   Michael Flynn   23
       Baseline Dynamic Network
• Destination node address sets switch routing for
  each stage
• Simpler baseline network we can have message
  blocking
   – No storage in the switch
• Cost for a baseline network is
  w x (N/k) x [logk N] in kxk switches
• Assume each switch has a delay of one channel
  cycle = Tch

   EE 382 Processor Design   Winter 98/99   Michael Flynn   24
       Baseline Dynamic Network




EE 382 Processor Design   Winter 98/99   Michael Flynn   25
            Other Dynamic Networks
• Other MIN configurations include additional stages
  and switches for less blocking (redundant paths ) but
  more cost
• Dynamic networks generally have Uniform Memory
  Access (UMA)
   – Equal time to access any part of memory
      • Can optimize for memory local to processor
   – Static networks are generally NUMA




     EE 382 Processor Design   Winter 98/99   Michael Flynn   26
       Other Dynamic Networks




EE 382 Processor Design   Winter 98/99   Michael Flynn   27
              Network Tradeoffs
– Direct Networks
  + Enables placement for communication affinity
  (NUMA)
  + Low incremental costs for small systems and
  expansion
  - Requires closely-coupled processor/switch design
  - High-dimensional networks have inefficient
  mapping to physical wiring




EE 382 Processor Design   Winter 98/99   Michael Flynn   28
                 Network Tradeoffs
• Indirect Networks
 + Can be built from standard processors and
 switches
 - Large fixed cost in switches, even for small
 systems




 Trend is Toward Direct Networks With Low Dimensionality



   EE 382 Processor Design   Winter 98/99   Michael Flynn   29
        Dynamic Network Analysis
• Time to transmit message without contention (Tc)
   – n is number of stages
   – Tc = n + (l /w) +1 (for h = 1)
        • usually n + (l /w) >>1 so
   – Tc = n + (l /w ) network cycles
• Model contention with MB/D/1
   –   p = /k (going to k inputs)
   –    =  (probability that processor is sending a message)
   –    = m x (l /w) (service time = l /w)
   –   m = prob(a particular node makes a request in a cycle)

   EE 382 Processor Design     Winter 98/99   Michael Flynn   30
       Dynamic Network Analysis
• Queing Delay: Tdynamic = Tc + Tw
  – Tw = ( l /w)(1 - 1/k)/(2(1 - ))
  – Tc = n + l /w
  – All expressed in network cycles = Tch




   EE 382 Processor Design   Winter 98/99   Michael Flynn   31
           Static Network Analysis
• For a static (k,n) network
  – let kd be average no of network hops for message
    to transit a single dimension
       • for bidirectional network with closure kd = k/4 ,
         (k even)
• Time to transmit message without contention
  (Tc)
  – Tc = n x kd + (l/w) in network cycles (for h = 1)


   EE 382 Processor Design   Winter 98/99   Michael Flynn   32
           Static Network Analysis
• Model contention with M/G/1 for k large
  (k > 8) and M/D/1 for k smaller
  –  = mnkd (nkd is the average no. hops for a
    message)
  –  = 2nw/l (each node has 2n channels)
  –  = mkd(l /2w)
  – For M/G/1
                   Tw = (/1-)(l /w)((kd-1)/kd2)(1+1/n)
  – For M/D/1
                             Tw = (/2(1-)(l /w)
   EE 382 Processor Design     Winter 98/99   Michael Flynn   33
 Static vs. Dynamic Network Example
N = 1024 processing elements
l = 200 bits
Pins per switch = 64 (fan-in + fan-out)



                  With locality                                  Without locality




        EE 382 Processor Design           Winter 98/99   Michael Flynn    34
                     Bisection Width
• Bisection Width is the minimum no. of wires cut when a
  network is divided into two equal halves
• If links (rather than nodes) dominate cost then network
  comparisons should be based on equivalent bisection
  width, B.
   – For static (k,n) B(k,n) = 2wN/k
   – For dynamic with kxk = 2x2; B = wN
• So higher-dimensional static networks have shorter
  “virtual” latency (no. hops) than lower-dimensional
  networks, but the planar (or even 3D) realization of
  physical wiring reduces performance
   – w is reduced for the same no. interconnect layers
   – wires are longer/slower

    EE 382 Processor Design   Winter 98/99      Michael Flynn   35
                  Hotspots and Combining
• Network traffic (especially
  synchronization) may be
  directed to a single location
  in memory creating a hotspot
• Hotspots can be mitigated by
  adding logic to the switch
   – Fetch and Add instructions
     directed to a hotspot can be
     combined and later the fetched
     result updated and split in the
     switch



               t = fraction of references going to hotspot


          EE 382 Processor Design        Winter 98/99        Michael Flynn   36
               Multiprocessing Summary
• Multi-Threaded
   – Area of research and potential future practical application
   – Driven by diminishing returns of single-threaded performance/cost and
     emerging programming environments
• Shared-Bus
   – Established mainstream technology for all but the most cost-sensitive
     applications
   – Building block for scalable MP
• Scalable
   – Technology in stages of advanced research and early adoption
   – Static, direct networks with low dimensionality are winning
• Massively Parallel
   – Remains the “holy grail”
      EE 382 Processor Design    Winter 98/99     Michael Flynn   37

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/14/2011
language:English
pages:37