Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Diamonds are a Memory Controller's Best Friend

VIEWS: 3 PAGES: 16

									         Diamonds are a Memory Controller’s
         Best Friend*
        Dennis Abts                   Natalie Enright Jerger                   John Kim
          Google                      University of Toronto                     KAIST




          Dan Gibson
        Univ of Wisconsin

          Mikko Lipasti
        Univ of Wisconsin


*Also known as: Achieving Predictable Performance through Better Memory Controller
Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.
Executive Summary ®
• On what tiles should memory controllers reside?
  – Three-tiered simulation approach
     • Heuristic-guided search
     • Detailed network simulation
     • Full-system simulation
• Diamond MC placement works well for on-chip
  meshes and tori
  – Diamonds minimize maximum channel load
  – Diamonds deliver lower and more predictable
    runtimes
Background
• Diverse on-chip communication
  – Cache-to-cache
  – LD/ST to Memory
  – Off-chip traffic (e.g., I/O)
• Processors/chip on the rise
  – Pins available for memory not rising as fast: Memory
    bandwidth becomes more precious
  – Reality: Many Cores, Few Memory Controllers
• Tiled architectures gaining popularity
  – Commonly employ on-chip meshes or tori
The Problem
• What Memory Controller placement is best
  overall?
  – Flip-chip packaging allows flexible escape routes
  – n tiles and m ports:
                                   n 
     • Don’t worry, there are only
                                    configurations!
                                   m  Slight Simplification: Assume   n=
                                                      k2 and m = 2k
  – What are the characteristics of the best
    configuration?
                   
     • Performance: Low runtime for a set of objective workloads
      • Throughput: Low latency as a function of offered load
      • Fairness: Similar (low) average memory latency across all
        nodes.
      • Predictability: Low latency and runtime variance
Baseline Placement: row0_7

                 • Ports to MCs located at
                   top and bottom of chip
                 • Conceptually similar to
                       X-Dimension Traffic
                   real parts: Congestion on
                    Encounters
                        Rows Tile64
                   – Tilera’s with Memory
                        • 64Controllers (4 ports
                            cores, 4 MCs
                         each, top/bottom of chip)
                   – Intel TeraFLOPs
                       • 80 cores, 2 MCs (8 ports
                         each, top/bottom of chip)
Three-Tiered Approach

      Link Contention
         Simulation
     Detailed Network




                        More Runs


                                    Shorter Runtimes


                                                       More Detail
        Simulation
       Full System
Tier 0.5: Exhaustive Search
               k2 
• It turns out   is tractable for k<7
                2k 
                   
   – (At least on the link contention simulator – only
     3,268,760 possibilities for k=5)
             Patterns Emerge!               Another Contender
Tier 1: Heuristic-Guided Search
• k>6: Intractable to search all configurations
   – Use search heuristics and random search
• Genetic Algorithm:
   – Represent designs as a population of strings (Bit
     Vectors)
   – Generate new designs by combining members of the
     population via genetic crossover (Bit Selection)
   – Occasionally, mutate new population members (Swap
     adjacent bits)
   – Reduce population size by removing least-fit
     members – Survival of the Fittest
Genetic MC Placement
                 0x00AA550000AA5500
                 0x0000FF0000FF0000
                 0x00AAF00000F25100
                           Mutate

                 0x00AAF00000F25080
Link Contention Results k=8
               Max Channel Load
    Config.
              Mesh        Torus
row0_7         13.5        9.25
X              8.93        7.72
Diamond        8.90        7.72


• GA Selected Diamond as
  most fit solution for 8x8
     – Minimizes MCs in a single
       row/column
                                   Sanity Check: GA also prefers
     – Spreads DOR load            Diamond for 4x4, 5x5, and 6x6
Network Simulation: Open-Loop
Evaluation
• Detailed simulation of all network events
  (buffers, links, etc.)
• Cores are Bernoulli injection processes, uniform
  random traffic
• Measure latency vs. offered load
     Parameters           Values
     Router latency       1 cycle (aggressive)
     Inter-router Delay   1 cycle
     Buffers              32-flit sized per port
     Packet size          Request: 1 flit
                          Reply: 4 flit
     Virtual Channels     4 (XY-YX routing)
 Open-Loop Results

                   25

                   20
                                                                         row0_7
Latency (cycles)




                   15                                                    row2_5
                                                                         Diamond
                   10
                                                                         X
                    5

                    0
                        0   0.2         0.4         0.6        0.8   1
                                  Offered load (flits/cycle)
Closed-Loop Evaluation

• Each processor executes N memory operations
• Up to r operations outstanding at a time
  – Models MSHRs
• Uniform Random requests, and real request
  streams with ‘hot spot’ behavior
          Closed-Loop Results
                  20
Number of Processors




                  16



                  12



                       8



                       4



                       0

                           3500   4000   4500   5000   5500   6000   6500 8000 8500 9000 9500 10000 10500 11000

                                  Diamond                      Completion Time             row0_7
Full System Results

                                       17.5
    Average Network Latency (cycles)
    for Request to Memory Controller
                                                         JBB
                                                                 WEB
                                        17
                                                                         TPC-W+H
                                                           TPC-W                   TPC-H
                                       16.5


                                        16                                                       R ow0_7

                                                   JBB                                           Diamond
                                       15.5                    WEB

                                                                     TPC-H
                                        15              TPC-W                            Diamond placement
                                                         TPC-W+H                       yields lower latency and
                                       14.5                                            lower latency variance.
                                              0   0.2      0.4         0.6   0.8   1       1.2
                                                         Standard Deviation
Conclusion
• MC Placement Matters!
  – Diamond reduces contention, improves latency, and
    reduces latency/runtime variance
  – X does fairly well

								
To top