Docstoc

23

Document Sample
23 Powered By Docstoc
					Decoupled DIMM: Building
High-Bandwidth Memory
System Using Low Speed
DRAM Devices
   Hongzhong Zheng1, Jiang Lin3, Zhao Zhang2,
              and Zhichun Zhu1
     1Department     of ECE                    2Department of ECE
University of Illinois at Chicago             Iowa State University

                            3Austin   Research Lab
                                    IBM Corp.
                                                                      1
                 Outline
Challenges in DRAM memory system designs
  Bandwidth, capacity, thermal and power
Motivation and background
Decoupled DRAM architecture
  Memory performance, cost, and/or power optimization
Experimental methodology
Result analysis
Conclusion



                                                    2
Challenges in DRAM memory
      system designs
Multi-core processors
  Increasing demands on memory’s
 Bandwidth      Power and Thermal
 Capacity
Advancements on memory systems
 DDR/DDR2/DDR3, Rambus XDR
 FB-DIMM, MetaRAM, Registered DIMM
                                     3
 Memory Channel Design Challenges
              Channel                        DRAM                                  Example

               BW/CH     Device    4GB-x4-   4GB-x4-   I/O Road    I/O Road      Total      Total
               (GB/s)    (MT/s)    DR (W)     DR ($)   map (1Gb)   map (4Gb)   Power (W)   Cost ($)
  DDR2-667       5.3      667       10.8       83        2004        2005         65        498
  DDR2-800       6.4      800       12.9      109        2006        2007         78        654
  DDR3-800       6.4      800        8.0      133        2007        2008         48        800
 DDR3-1066       8.5      1066       9.9      180        2008        2009         59        1080
 DDR3-1333      10.6      1333       11       243        2009        2010         66        1458
 DDR3-1600      12.8      1600      N/A       N/A        2010        2011      3-Channel, 24GB
 DDR3-2133       17       N/A       N/A       N/A        2012        2013      Xeon 2.66GHz:$1000

Kingston 4GB registered ECC DIMM; Power based on 2Gbit-x4 Micron device, 80% channel utilization

      Expensive for building high bandwidth channel
         High bandwidth channel   Costly and high power
         High density DRAM device   Costly and late
      Limited by DRAM device technology
         Channel bandwidth evolvement ≤ DRAM device evolvement
                                                                                                      4
                            Conventional Memory Channel
                                    Organization
                            1066MT/s
                             8.5GB/s
                                       Memory controller   Channel speed bind with DRAM
                                                           devices speed
DDRx data and command bus




                                          x64                Rank BW = Channel BW
                                                             Not necessary when multi-rank
                                         1066MT              per channel
                                          Rank

                                           x64             Multiple ranks per channel
                                         1066MT
                                                             ∑Ranks BW > Channel BW
                                          Rank               NOT fully utilize the DRAM
                                                             devices
                                                             Bandwidth bottleneck: Channel
2DIMMs/Channel, 2Ranks/DIMM

                       ∑Ranks BW (34GB/s) > Channel BW (8.5GB/s)
                                                                                             5
High Speed I/O Technology Available
                     DRAM I/O bandwidth vs. High-speed I/O
                               bandwidth (ITRS)
           16

           14
                        DRAM I/O
                                                                           15Gb/s
           12
                        High-speed I/O
           10                                      11Gb/s
Gb/s/pin




            8
                                     6.4Gb/s                              6.4Gb/s
            6

            4                                      1333Mb/s
                                         667Mb/s
            2

            0
            1995          2000            2005       2010        2015       2020        2025




           ↑High speed I/O > ↑DRAM speed
                   Slow evolvement of DRAM speed              bottleneck for building high
                   bandwidth memory channel
           DRAM is optimized for capacity and cost, NOT for speed
                                                                                               6
         Decoupled DIMM
High bandwidth Channel + Low speed DRAM
device ?
  Memory channel design without DRAM evolving
  bottleneck
  Benefits on performance, cost and/or power efficiency
Design considerations
  No changes to DRAM devices
Decoupled DIMM
  Adding a bridge chip (Synchronization Buffer) to
  each DIMM/Rank
    Breaking unnecessary bandwidth matching
    Separating two clock domains: Channel vs. DRAM
    Decoupling DRAM I/O tech. with Channel I/O tech.

                                                          7
                                   Decoupled DIMM Design
                                           Building high bandwidth channel
                                           using low-speed DRAM devices
                            2133MT/s
                                        Memory controller
                                       req                  Synchronization buffer (SYB)
DDRx data and command bus




                                 Single DDR2/3 Channel         Separating two clock domains
                                         x64                   Buffering data and command
                                           req                 Introducing small latency
                                            SYB
                                                              penalty
                                       1066MT/s/rank
                                                            Breaking BW matching
                                                                Channel BW > Rank BW
                                          x64                    DDR3-1066 devices
                                                                 2133MT/s/channel
                                             SYB
                                                            DRAM Freq. : Channel Freq.
                                        1066MT/s/rank
                                                             1:m   1:2, 1:3
Channel BW > Rank BW                                         n:m   2:3, 3:5
                                                                                           8
             Significantly Increasing Memory
                       Throughput
              Channel Throughput and
                  Rank Utilization
                                                                                 Example:
                                                                                  2CH-2D-2R, DDR3-1066,
                         Channel Throughput                                       Channel 1066MT/s vs.
                         Rank Utilization
                                                                                  Channel 2133MT/s

                                                    Average Rank Utilization
                    20                        50%
                                                                                 Significantly improving memory
Throughput (GB/s)
 Average Channel




                                              40%
                    15                                                         throughput
                                              30%                               2 x Channel BW
                    10
                                              20%                               ↑88% throughput (6.7GB/s)
                     5                        10%                                Increasing ranks utilization
                     0                        0%
                                                                                  22% (1066MT/s/CH)
                         D1066- D1066-                                            41% (2132MT/s/CH)
                         B1066 B2133

                    swim+applu+art+lucas
                                                                                                                9
   Benefits: Building high bandwidth
channel using low-speed DRAM devices
  High performance with high bandwidth
    Channel BW > DRAM BW
  Low cost and high density
    Low-speed DRAM devices      Low cost and high density
      High BW channel
  Power/energy efficiency
    Operating DRAM at low speed but keeping high channel BW
  More DIMMs per channel
    Reducing electrical load of each DIMM by buffering CMD/data
  Good Reliability
    Using standard voltage supply   High BW channel
                                                                  10
       Synchronization Buffer Design
                                                                              CMD/Address to              Clock to
                                                                              DRAM devices                DRAM devices
               CMD/Address to
                                                                            Control interface with




                                                                                                        DLL
               DRAM devices




                                                                                                                           clock
                                                      Data interface with
                                                                               DRAM devices
Data to/from




                                                       DRAM devices
DRAM devices                                     x8
        Synchronization




                                                                                               CMD
                                  DRAM devices




                                                                                     WR




                                                                                                       DDRx contol
                                                                                RD
            Buffer




                                  Data to/from




                                                                                                        interface
x8




                                                                                                                     from DDRx bus
                   x64




                                                                                                                     CMD/Address
CMD/Address        Data to/from
from DDRx bus      DDRx bus                                                        DDRx data
                                                                               interface with BUS
                                                                             Data to/from        x64
                                                                             DDRx bus


     DDRx data interface with bus                     Data interface with DRAM devices
     DDRx control interface                           Control interface with DRAM devices
     Delay/Phase Loop Lock                            Data/CMD entries inside SYB

                                                                                                                     11
    Memory Access Scheduling




             2133MT/s Channel &
             DDR3-1066 devices


Two level bus with SYB extends the data transfer time
  SYB relays command and data
  For example, DRAM devices : Channel = 1 : 2
  2 device cycles latency penalty = 1 cycle CMD delay +
                                    1 cycle data delay
                                                      12
     Power Saving of Decoupled DIMM with
          Given Channel Bandwidth
                           DIMM Power Break Down
                            of a Memory Intensive                Background
                           Workload (2CH-2D-2R-x8)
                 10
                                                                  related to power state
                                                                  transition and power
Average Power (Watt)




                       8             ↓15%     SYB overhead        management policies
                       6
                                     765mW    I/O with Channel   Operation
                                              read/write          Activation + Precharge
                       4             ↓24%
                                     ↓31%     operation          Read/write
                       2
                                     ↓23%
                                              background         I/O power
                       0                                          Driving output +
                                              2GB DIMM
                            D1600-    D800-                       termination
                            B1600     B1600
             22GB/s 20GB/s                                       SYB Overhead
         swim+applu+art+lucas                                                          13
    Energy Saving by Decoupled DIMM
                            1600MT/s    1600MT/s
                            Channel &   Channel &
                            DDR3-1600   DDR3-800    Comments
BW (MB/s/channel)               12800       12800   Same Channel BW
Devices Freq. (MHz)               800         400   DRAM devices operating at low speed
Tpre,Tact,Tcol (ns)             13.75          15   Small change on operation delay
Operating Cur. (mA)               120          90   25% power reduction on each operation
Background:                                         >23% power reduction on background,
Active Standby Cur. (mA)           65          50   applied most of time
Tbl Data burst Time (ns)            5          10   2 x data burst time by low speed devices
Read/Write Cur. (mA)              250         130   Nearly half of read/write power
SYB Latency Overhead (ns)           0         2.5   SYB latency overhead for one more I/O
SYB Power Overhead (mW)             0    382/rank   SYB power overhead for one more I/O

      Operation energy saving
         25% power reduction + slight change on operation delay
      Background energy saving
        >23% power reduction + most of time
                                                                                               14
 Experimental Methodology
M5 + detailed memory performance and power
simulator
Multi-programming workloads formed by SPEC
CPU2000
Power model based on Micron power calculator
Power management policy
  Transiting to low power mode when no pending requests on the
  rank after 7.5ns
  CC-Slow: Cache line interleaving, close page mode, and with
  precharge power-down slow low power mode (128mWatt,
  11.25ns exit latency)
   PO-Fast: Page interleaving, open page mode, and with active
  power-down low power mode (578mWatt, 7.5ns exit latency)




                                                             15
      Major Simulation Parameters
Parameters               Values

Processor                4 cores, 3.2 GHz, 4-issue per core, 16-stage pipeline

Functional units         4 IntALU, 2 IntMult, 2 FPALU, 1 FPMult

IQ, ROB and LSQ size     IQ 64, ROB 196, LQ 32, SQ 32

Physical register num    228 Int, 228 FP

Branch predictor         Hybrid, 8k global + 2K local, 16-entry RAS, 4K-entry and 4-way BTB
                         64KB Inst/64KB Data, 2-way, 64B line, hit latency: 1-cycle Inst / 3-cycle
L1 caches (per core)        Data

L2 cache (shared)        4MB, 4-way, 64B line, 15-cycle hit latency

MSHR entries             Inst:8, Data:32, L2:64

Memory                   4/2/1 channels, 2-DIMMs/channel, 2-ranks/DIMM, 8-banks/rank, 1GB/rank

Memory controller        128-entry buffer, 15ns overhead

DDR3 channel bandwidth   800/1066/1333/1600 MT/s (Mega Transfer/s), 8byte/channel
                         DDR3-800: 6-6-6, DDR3-1066: 8-8-8,
DDR3 DRAM latency        DDR3-1333: 10-10-10, DDR3-1600: 11-11-11

                                                                                                     16
                              Workloads
Workload   Applications

MEM-1      swim,applu,art,lucas
                                          Multiprogramming
                                          workloads randomly
MEM-2      fma3d,mgrid,galgel,equake
                                          selected from SPEC
MEM-3      swim,applu,galgel,equake
                                          2000
MEM-4      art,lucas,mgrid,fma3d
                                            MEM (memory-intensive)
MDE-1      ammp,gap,wupwise,vpr
                                            MDE (moderate)
MDE-2      mcf,parser,twolf,facerec
                                            ILP (compute-intensive)
MDE-3      apsi,bzip2,ammp,gap
                                          Simulation points are
MDE-4      wupwise,vpr,mcf,parser
                                          picked up by SimPoint
ILP-1      vortex,gcc,sixtrack,mesa
                                          Performance metrics
ILP-2      perlbmk,crafty,gzip,eon
                                            Weighted Speedup
ILP-3      vortex,gcc,gzip,eon
                                            Harmonic mean of
ILP-4      sixtrack,mesa,perlbmk,crafty
                                            normalized IPCs
                                                               17
                      Average Performance of Decoupled
                        DIMM with Given DRAM Device
                            Average Performance Impact of Decoupled
                            DIMM with Different Memory Configurations
                                  D1066-B1066           D1066-B2133             D2133-B2133

                      2.5
Normalized Weighted




                              -10%
                       2
                                                        -9%
     Speedup




                      1.5
                                                                                -8%
                                     12%
                            79%                   55%         5%                      5%
                       1
                                                                          25%

                      0.5

                       0
                             MEM    MDE     ILP    MEM        MDE   ILP    MEM        MDE     ILP

                             1CH-2D-2R              2CH-2D-2R                   4CH-2D-2R

                                                                                                    18
Trade-offs of Decoupled DIMM Design
                                     Performance Comparision of Decoupled DIMM Design
                                       with Conventional DDR3-1066/1333/1600 Deisgn
                                            D1066-B1066            D1333-B1333         D1600-B1600
                                            D1066-B2133            D1333-B2667
Normalized Weighted Speedup




                              2.5
                                        111%
                                               D1066-B2133 vs. D1333-B1333 : 36%
                               2      83%                                      76%
                                               D1333-B2667 vs. D1600-B1600 : 28%
                                                                             55%
                              1.5
                                      47%                                  37%                   Small
                                    19%                                                   7%
                                                                         16%                    impact
                               1


                              0.5


                               0
                                     MEM-1         MDE-1         ILP-1      MEM-AVG   MDE-AVG    ILP-AVG

                                                            2CH-2D-2R

                                                                                                           19
                    Power and Performance Impact with
                        Given Channel Bandwidth
                    Power of Decoupled DIMM with                                           Performance of Decoupled DIMM
                      Given Channel Bandwidth                                               with Given Channel Bandwidth
                        D1600-B1600    D1333-B1600                                            D1600-B1600     D1333-B1600
                        D1066-B1600    D800-B1600                                             D1066-B1600     D800-B1600

                                                                                                                      -0.7%
                                                                                      1




                                                      Normalized Weighted Speedup
               30                                                                                           -2.5%
                         16%
                                                                                    0.95
Power (Watt)




                                                                                                  -8.1%
               20

                                                                                     0.9


               10                     10%
                                                                                    0.85

                                                 8%
               0                                                                     0.8
                    MEM-AVG    MDE-AVG      ILP-AVG                                         MEM-AVG    MDE-AVG      ILP-AVG


                           2CH-2D-2R                                                               2CH-2D-2R
                                                                                                                              20
                                Performance Impact with
                                Given System Bandwidth
                                 Performance Impact of Decoupled DIMM
                                     with 34GB/s System Bandwidth
                                          34GB/s 4CH-2D-1R D1066-B1066
                                          34GB/s 2CH-2D-2R D1066-B2133
                                1
Normalized Weighted Speedup




                                                                            -3.5%

                                                          -4.1%
                                         -4.4%
                              0.95




                               0.9
                                     MEM-AVG         MDE-AVG             ILP-AVG

                                                                                    21
Related Works of Decoupled DIMM
 Novel memory architecture --- Most related work
   Mini-Rank [Zheng:MICRO2008], Threaded Memory Module
   [Ware:ICCD2006], Fully-Buffered DIMM [Intel2005], Register DIMM,
   MetaRAM [http://www.metaram.com]
 Memory system performance evaluation and analysis
   DRAM/RAMBUS [Burger:ISCA1996, Cuppu:ISCA1999,
   Cuppu:ISCA2001], FBD [Ganesh:HPCA2007]
 Memory access scheduling for performance and fairness
   Memory access reordering [McKee:HPCA1995, Rixner:ISCA2000,
   Hur:MICRO2004, Zhu:HPCA2005, Nesbit:MICRO2006,
   Mutlu:MICRO2007, Mutlu:ISCA2008, Ipek:ISCA2008]
 DRAM Low power modes optimizations.
   Low power mode management for optimizing background power
   [Lebeck:ASPLOS2000, Delaluz:HPCA2001, Fan:ISLPED2001,
   Delaluz:DAC2002, Huang:USENIX2003, Li:ASPLOS2004,
   Zhou:ASPLOS2004, Pandey:HPCA2006]



                                                                      22
  Decoupled DIMM Summary
Cost effective high bandwidth memory
system design
 Using low-speed DRAM devices building high bandwidth
 memory channel
Significant benefits on performance, cost and
power efficiency
 Given DRAM devices     high bandwidth channel
 Given channel bandwidth    power/energy saving
 Given system bandwidth    cost effectiveness with few
 channels
Small changes
 Synchronization Buffer on DIMM
 DRAM devices design untouched
 Small changes on memory requests scheduling
                                                         23

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:3/29/2012
language:English
pages:23