VIEWS: 3 PAGES: 16 POSTED ON: 7/5/2011
Diamonds are a Memory Controller’s Best Friend* Dennis Abts Natalie Enright Jerger John Kim Google University of Toronto KAIST Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked. Executive Summary ® • On what tiles should memory controllers reside? – Three-tiered simulation approach • Heuristic-guided search • Detailed network simulation • Full-system simulation • Diamond MC placement works well for on-chip meshes and tori – Diamonds minimize maximum channel load – Diamonds deliver lower and more predictable runtimes Background • Diverse on-chip communication – Cache-to-cache – LD/ST to Memory – Off-chip traffic (e.g., I/O) • Processors/chip on the rise – Pins available for memory not rising as fast: Memory bandwidth becomes more precious – Reality: Many Cores, Few Memory Controllers • Tiled architectures gaining popularity – Commonly employ on-chip meshes or tori The Problem • What Memory Controller placement is best overall? – Flip-chip packaging allows flexible escape routes – n tiles and m ports: n • Don’t worry, there are only configurations! m Slight Simplification: Assume n= k2 and m = 2k – What are the characteristics of the best configuration? • Performance: Low runtime for a set of objective workloads • Throughput: Low latency as a function of offered load • Fairness: Similar (low) average memory latency across all nodes. • Predictability: Low latency and runtime variance Baseline Placement: row0_7 • Ports to MCs located at top and bottom of chip • Conceptually similar to X-Dimension Traffic real parts: Congestion on Encounters Rows Tile64 – Tilera’s with Memory • 64Controllers (4 ports cores, 4 MCs each, top/bottom of chip) – Intel TeraFLOPs • 80 cores, 2 MCs (8 ports each, top/bottom of chip) Three-Tiered Approach Link Contention Simulation Detailed Network More Runs Shorter Runtimes More Detail Simulation Full System Tier 0.5: Exhaustive Search k2 • It turns out is tractable for k<7 2k – (At least on the link contention simulator – only 3,268,760 possibilities for k=5) Patterns Emerge! Another Contender Tier 1: Heuristic-Guided Search • k>6: Intractable to search all configurations – Use search heuristics and random search • Genetic Algorithm: – Represent designs as a population of strings (Bit Vectors) – Generate new designs by combining members of the population via genetic crossover (Bit Selection) – Occasionally, mutate new population members (Swap adjacent bits) – Reduce population size by removing least-fit members – Survival of the Fittest Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 Mutate 0x00AAF00000F25080 Link Contention Results k=8 Max Channel Load Config. Mesh Torus row0_7 13.5 9.25 X 8.93 7.72 Diamond 8.90 7.72 • GA Selected Diamond as most fit solution for 8x8 – Minimizes MCs in a single row/column Sanity Check: GA also prefers – Spreads DOR load Diamond for 4x4, 5x5, and 6x6 Network Simulation: Open-Loop Evaluation • Detailed simulation of all network events (buffers, links, etc.) • Cores are Bernoulli injection processes, uniform random traffic • Measure latency vs. offered load Parameters Values Router latency 1 cycle (aggressive) Inter-router Delay 1 cycle Buffers 32-flit sized per port Packet size Request: 1 flit Reply: 4 flit Virtual Channels 4 (XY-YX routing) Open-Loop Results 25 20 row0_7 Latency (cycles) 15 row2_5 Diamond 10 X 5 0 0 0.2 0.4 0.6 0.8 1 Offered load (flits/cycle) Closed-Loop Evaluation • Each processor executes N memory operations • Up to r operations outstanding at a time – Models MSHRs • Uniform Random requests, and real request streams with ‘hot spot’ behavior Closed-Loop Results 20 Number of Processors 16 12 8 4 0 3500 4000 4500 5000 5500 6000 6500 8000 8500 9000 9500 10000 10500 11000 Diamond Completion Time row0_7 Full System Results 17.5 Average Network Latency (cycles) for Request to Memory Controller JBB WEB 17 TPC-W+H TPC-W TPC-H 16.5 16 R ow0_7 JBB Diamond 15.5 WEB TPC-H 15 TPC-W Diamond placement TPC-W+H yields lower latency and 14.5 lower latency variance. 0 0.2 0.4 0.6 0.8 1 1.2 Standard Deviation Conclusion • MC Placement Matters! – Diamond reduces contention, improves latency, and reduces latency/runtime variance – X does fairly well
"Diamonds are a Memory Controller's Best Friend"