A Performance Comparison of Contemporary DRAM Architectures by pengxuebo


									     A Performance Comparison of
   Contemporary DRAM Architectures
Vinodh Cuppu, Bruce Jacob    Brian Davis, Trevor Mudge
    University of Maryland       University of Michigan
About the Authors

                              Trevor Mudge
               •Professor of EE and CS at University of
               •Ph.D: University of Illinois
                    •Comp. Systems design
                    •Parallel Processing
                    •Comp. Aided Design
                    •Impact of Technology on Comp.
About the Authors

                              Brian Davis
               •Professor of E & C Engineering at
               Technical University of Michigan
               •Ph.D: University of Michigan, Nov 2000
               •M.S. in CE at University of Michigan, Nov
                    •New types of Hardware Description
                    Language; specifically to enable more
                    systematic methods for designing
                    powerful DRAM architectures.
About the Authors

                              Bruce Jacob
               •Professor of E & C Engineering at Institute
               for Advanced Comp. Studies at University
               of Michigan
               •Ph.D: University of Michigan, 1997
               •M.S. in CS & E at University of Michigan,
               Nov 1995
               •A.B. in Math, cum laude at Harvard
               University, 1988
               •Current Research:
                    •Energy usage and voltage scaling in
                    embedded systems
About the Authors

                              Vinodh Cuppu
               •Digital IC Logic Designer at
               Xtremespectrum, Inc.
               •M.S. in E & C Engineering at University of
               Maryland, Aug 2000
               •B.E. in E & Communication Engineering at
               Unversity of Madras, India, May 1997
                    •Has published many well-regarded
                    papers on DRAM and continues to
                    model DRAM in different environments,
                    specifically to see if it could be used in
                    embedded applications
       In response to the growing gap between processor speed and main
memory access time, many new DRAM architectures have been created.
         This paper tests the performance of a representative set of the
architectures to see how all they respond to this trend.

                         The architectures tested are:
                              • Fast Page Mode
              •Extended Data Out            • Synchronous Link
              • Synchronous                 • Rambus
              • Enhanced Synchronous        • Direct Rambus
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
1. What is the effect of improvements in DRAM technology on
   the memory latency and bandwidth problems?
2. Where is time spent in the primary memory system? What is
   the performance benefit of exploiting the page mode of
   contemporary DRAM?
3. How much locality is there in the address stream that reaches
   the primary memory system?
1. There is a one-time tradeoff between cost, bandwidth and latency…
• multiple DRAMs on same bus with bus optimizations (|request| ~> |transfer|)
   anything better requires faster bus and core

2. future bus technologies will expose row access time as the
   primary performance bottleneck…
• widening buses present a clearer view of locality, so row hits are vital

3. buses …cannot halve the latency of a bus half as wide
• even though the best latencies are seen from buses as wide as the L2
   cache, they aren’t quite cost effective

4. ...critical word first does not mix well with burst mode
• burst mode is likely to deliver unneeded data using a starting block out of
   address order

5. …the refresh mechanism used can significantly alter the average
   memory access time
• can add wait cycles to row and column access
Architectures: Fast Page Mode
                                • Holds row open after
                                first column is sent, in
                                optimistic hope that the
                                next access will be for
                                a different column in
                                the same row.
Architectures: Extended Data Out

• Added data latch holds column
data immediately after sensing. This
allows another transaction or a
refresh to begin as soon as the
column access is done
Architectures: Synchronous DRAM

• Often has a
buffer so it can
return data
over multiple
cycles per
making data
every clock

• Transmits on clock cycles, making timing
strobes from the memory controller
Architectures: Enhanced SDRAM and
                            Synchronous Link DRAM

         Enhanced SDRAM                     Synchronous Link DRAM
• faster internal timing              • open architecture, supplied by IEEE
• SRAM row caches added to allow      • uses a packetized split
EDO-like behavior, namely the         request/response protocol
ability to satisfy requests for the
                                      • most significantly, it can support
cached row while freeing the bank
                                      multiple concurrent transactions (if
up to do other things.
                                      they reference unique banks)
Architectures: Rambus DRAM
• Uses a multiplexed address/data bus, so it limits communication to once every 4 cycles.
• Transmits on both the rising and falling clock edges, reaching a theoretical maximum of
600 Megabytes per second.
• Due to internal division of banks, up to 4 rows can remain open at once.
Architectures: Direct Rambus DRAM
•Faster core and transmission on both clock edges yields a
theoretical maximum bandwidth of 1.6 Gigabytes per second.
• Divided into 16 banks, employing 17 half-row buffers shared
between each pair, limiting the amount of banks that can process
transactions in parallel but also reducing the product size.
•Uses a 3-byte-wide channel as opposed to Rambus’ single-byte-
wide channel and sends instructions over one byte width and
data over the other two.
• Most importantly, Direct Rambus does not multiplex its bus and
has its internal structures arranged in such a manner that it can
service 3 up to transactions at the same time.
  Methodology: Basis
• Extensions written for SimpleScalar, an aggressive out-of-order processor
simulator, so that it would model the DRAM architectures described.
• A lot of the memory access time is overlapped with instruction execution in
SimpleScalar, so two extra simulations were run; one where bus transmission
was instantaneous and another where memory operation is instantaneous, and
the following formulae applied to the results:
Tp = time processing, Tl = memory latency stalls, To = overlapped mem. access
Tu = exec. time with instantaneous bandwidth, Tm = total mem. access time
Tb = memory bandwidth stalls, T = total real execution time
    • Tl = Tu – Tp
    • Tb = T – Tu
     • To = Tp – (T-Tm)
Now memory access time can be
separated out into different categories of
stalls and the amount of time bandwidth
and latency were overlapped.
Methodology: Simulated architecture
• Timing information for DRAM parts was found in technical reports.
• Ran the simulated L2 cache at speeds of 100ns, 10ns and 1ns, scaling the
CPU speed to match (CPU speed = 10x L2 speed).
• Simulated architecture:
         Processor: eight-way superscalar, out of order
         Caches: L1: Lockup-free split (64K/64K), 2-way set
                 associative with 64-byte linesizes
                  L2: unified 1MB, 4-way set associative with a 128-byte
                  linesize and write back, lockup-free, but only allows one
                  outstanding request at a time
• This represents a common workstation
of the time (1999).
Methodology: Balancing the architectures
• Since the request size is 8 times the transfer size in the simulated organization
chosen, DRAM access is a pipelined operation. The other DRAMs would gain an
unfair advantage over FPM and EDO DRAM since both are not interleaved. The
authors modeled interleaved versions that could fill the memory data bus as
much as possible separately. These versions are labeled FPM3 and EDO2.
•FPM1 is ‘pessimistic’, it closes the accessed row and precharges immediately.
•FPM2 is ‘optimistic’, it holds the accessed row open and delays precharge.

Bus Structure:
• SLDRAM, RDRAM, and DRDRAM all use narrower, higher-speed buses and
are simulated on a single-width bus in serial. This incurs an extra bit of latency
since the simulated memory controller has to coalesce bus packets into properly-
sized blocks to send over the common bus used for the rest of the simulations,
which is wider. To ameliorate this, transfer time over the narrow channel is taken
to be instantaneous.
Preliminary Results: Refresh Handling
• DRAM refresh can affect
performance dramatically
• All DRAMs but Rambus
have 64ms refresh time
• Rambus has a 33ms
refresh time and can
refresh internal banks
individually rather than an
entire matrix at a time.
• This is the basis for
observation 5.

• Since the time-interspersed scheme is so much better, it
was used for all the DRAMs. This puts all the architectures
on a more even footing.
Results: Total Execution time
                                                  • Interleaved DRAMs do
                                                  much better (FPM3 &
                                                  • Pessimistic FPM1 does
                                                  better than Optimistic
                                                  FPM2 since refresh takes
                                                  a little longer than row
                                                  • Are newer DRAMS
                                                  having trouble keeping up
                                                  with CPU speed?
                                                  • Is memory bandwidth
                                                  really the biggest
                                                  contributor to DRAM

A lot has been done to increase memory bandwidth, but what about latency?
Results: Performance breakdown
• FPM is the slowest                        • SLDRAM and Rambus have
                                            higher access time compared with
• Interleaving is good, as is pessimistic
                                            SDRAM and ESDRAM due to bus
• EDO uses basically the same
                                            • SLDRAM and RDRAM make twice
technology as FPM, but is faster due
                                            as many data transfers as
to better architecture
                                            DRDRAM, and if “…they had been
• SDRAM is faster still and ESDRAM          organized… to put them on an even
is even better since it tweaks timing       footing with DRDRAM… their
and adds a SRAM cache to improve            latencies would be 20 to 30%
concurrency                                 lower.”

“The parallel-channel results demonstrate the
failure of a 100MHz 128-bit bus to keep up with
today’s fastest parts.”
Results: Parallel channel DRAM and bandwidth
                                               • The parallel bus
                                               architectures (SLDRAM,
                                               RDRAM and DRDRAM)
                                               have a much larger
                                               proportion of their access
                                               time tied up in Bus
                                               Transmission Time.
                                               • Speeding up the bus
                                               would make these run
                                               faster, and has been done
                                               fourfold since this paper’s

• What is the effect though?
    • With Bus Transmission Time decreased,
    latency becomes the largest proportional
    slowdown… and fixing it is much harder.
Conclusion: Questions answered
1, Effect of DRAM improvements?
    • Bandwidth problem Is being addressed, as newer architectures support
    multiple concurrent transactions, multiple concurrent accesses and/or
    multiple bus channels.
    • Latency is not being addressed and will become more of a problem.
2. Where is time spent?
    • Most time is spent in bus transmission, which needs to be improved.
  How much does Page Mode help?
    • There is a significant degree of locality in addresses accessed, so
    DRAMs that are internally multi-banked (so they have more than one row
    buffer) seem to do better… Page Mode then is useful, but other factors can
    get in the way. (FPM1 vs FPM2)
Conclusion: Questions answered
3. How much locality is there is the address stream?
    • Quite a lot, actually, but the effect doesn’t scale well with large buffers.

                       Hits in Victim-Row FIFO Buffer for
                                    FPM DRAM
Conclusion: Brass Tacks
• Bandwidth is a major slowdown for modern DRAM, but we today know that
this is easily fixable as we have system buses of 400MHz. According to the
results in this paper, DRAM latency is now the big problem and it isn’t very easy
or cheap to fix. This paper’s prophecy has come to pass.
• The box of tricks is getting empty… Techniques like interleaving, multiple
transactions, multiple channels and other such bandwidth-dependant speedups
are starting to become harder to find.
• Need to devise ways to improve latency now:
    • different technology?
    • better exploitation of locality (address prediction) ?
    • more internal division?
    • multilevel internal caches?
    • a different type of storage matrix?
The…. End?
Epilogue: The bleeding edge
        One of the newest made-for-mainstream memories, GDDR3

• Micron delivered the first samples of GDDR3 to Nvidia and ATI on August 8,
• The word is that ATI and Nvidia will have new top of the line graphics cards
out, exploiting the new GDDR3 DRAM, by Q4 2003 or Q1 2004.
• So, what does it do better?

    • .11 micron process (!)           • Variable Write Latency
    • On-Die Termination               • 6.4 GBs data rate
    • Posted CAS                       • operating voltage is half that of
Epilogue: Revolutionary or evolutionary?
Warning… high data rate, ready skepticism…

• On-Die Termination: drops out reflection caused by signals hitting their
terminals… safe.
• .11 micron process: fit more logic in a small area and drive it with less
power… safe
• Low operating voltage: pleasant effect of small process size… safe
• 6.4 GBs data rate, Variable Write Latency, Posted CAS, and clock rate of up
to 800 MHz… danger!
Epilogue: Revolutionary or evolutionary?
Posted CAS:
Adds latency cycles to the Column Select so READ/ACT commands (Row
Select) don’t collide with Column Selects to allow faster internal clocking.

Variable Write Latency:
Adds latency to the Write operation so that it doesn’t corrupt a Read operation
(RAW dependency). Write latency = CAS speed + AL (Posted CAS) -1. Not
entirely as bad as it sounds (due to interleaving), so could only triple Write
latency instead of sextupling it!

High DRAM clock:
Makes all this additive latency necessary. Now
architects are intentionally adding latency.
Epilogue: Verdict?

Evolutionary… at best

• GDDR3 is sending us in apparently the wrong direction, pumping up
bandwidth at the expense of latency by upping the clock speed and adding
tweaks to make sure the data stays consistent.
• Bandwidth may be preferable over latency in a graphics processor to deliver
increased frame rate, but the GDDR2 architecture acted as a roadmap for the
DDR2 primary memory architecture.

• Are there any low-latency primary memory
architectures in development… and if so do any of
them have a chance at survival in the market?
• Ask me after December 8th.
  The End
(and this time I mean it)

To top