CACHEDDRAM FORILP PROCESSOR MEMO RYACCESS LATENCY REDUCTION by g4509244

VIEWS: 6 PAGES: 11

									         CACHED DRAM FOR ILP
      PROCESSOR MEMORY ACCESS
          LATENCY REDUCTION
                           CACHED DRAM ADDS A SMALL CACHE ONTO A DRAM CHIP TO REDUCE
                           AVERAGE DRAM ACCESS LATENCY. THE AUTHORS COMPARE CACHED DRAM

                           WITH OTHER ADVANCED DRAM TECHNIQUES FOR REDUCING MEMORY

                           ACCESS LATENCY IN INSTRUCTION-LEVEL-PARALLELISM PROCESSORS.


                                      As the speed gap between processor    appears on the main memory side. The
                           and memory widens, data-intensive applica-       DRAM core can transfer a large block of data
                           tions such as commercial workloads increase      to the on-memory cache in one DRAM cycle.
                           demands on main memory systems. Conse-           This data block can be several dozen times
                           quently, memory stall time—both latency          larger than an L2 cache line. The on-memo-
                           time and bandwidth time—can increase dra-        ry cache takes advantage of the DRAM chip’s
                           matically, significantly impeding the perfor-     high internal bandwidth, which can be as high
                           mance of these applications. DRAM latency,       as a few hundred gigabytes per second.
                           or the minimum time for a DRAM to physi-            Hsu and Smith classify cached DRAM
                           cally read or write data, mainly determines      organizations into two groups: those where
         Zhao Zhang,       latency time. The data transfer rate through     the on-memory cache contains only a single
                           the memory bus determines bandwidth time.        large line buffering an entire row of the mem-
     Zhichun Zhu, and      Burger, Goodman, and Kägi show that mem-         ory array, and those where the on-memory
                           ory bandwidth is a major performance bot-        cache contains multiple regular data cache
      Xiaodong Zhang       tleneck in memory systems.1 More recently,       lines organized as direct-mapped or set-asso-
                           Cuppu et al. indicate that with improvements     ciative structures.3 In a third class combining
              College of   in bus technology, the most advanced mem-        these two forms, the on-memory cache con-
                           ory systems, such as synchronous DRAM            tains multiple large cache lines buffering mul-
       William and Mary    (SDRAM), enhanced SDRAM, and Rambus              tiple rows of the memory array organized as
                           DRAM, have significantly reduced bandwidth        direct-mapped or set-associative structures.
                           time.2 However, DRAM speed has improved          Our work and other related studies belong to
                           little. DRAM speed is a major factor in deter-   this third class.4-5
                           mining memory stall time, which significant-         Cached DRAM improves memory access
                           ly affects the performance of data-intensive     efficiency for technical workloads on a rela-
                           applications such as commercial workloads.       tively simple processor model with small data
                              In a cached DRAM, a small or on-memory        caches (and in some cases, even without data
                           cache is added onto the DRAM core. The on-       caches).4-7 In a modern computer system, the
                           memory cache exploits the locality that          CPU is a complex instruction-level parallelism


22                                                                                           0272-1732/01/$10.00  2001 IEEE
(ILP) processor, and caches are hierarchical            called a page) containing the desired data is
and large. Thus, a cached DRAM’s architec-              loaded into the row buffer. During column
tural context has dramatically changed and              access, data is read or written according to its
evolved. Koganti and Kedem investigated the             column address. Repeatedly accessing the data
performance potential of cached DRAM in                 in the same row buffer only requires column
systems with ILP processors and found it                access. However, if the next access goes to a
effective.4 To further investigate the ILP effects      different row in the memory array, the
and compare cached DRAM with other                      DRAM bank must be precharged before the
advanced DRAM organizations and inter-                  next access. Periodically reading and writing
leaving techniques, we studied cached DRAM              back each row refreshes the DRAM bank. The
in the context of processors with full ILP capa-        DRAM is not accessible during precharge or
bilities and large data caches.                         refresh. The access time of a request is not a
   Conducting simulation-based experiments,             constant. It depends on whether the access is
we compare cached DRAM with several com-                a page hit, if a precharge is needed, or if the
mercially available DRAM schemes. Our                   DRAM is being refreshed.
results show that the cached DRAM outper-                  Several recent commercial DRAM variants
forms other DRAM architectures for these                can reduce latency and/or improve the data
applications. Cached DRAM is not only effec-            transfer rate.
tive with simple processors but also with mod-
ern ILP processors. Our focus is on further               • SDRAM. The data access operations are
investigating the ILP effects.                              synchronized with the processor under
                                                            an external clock. This variant supports
DRAM background                                             burst mode data access that reads or
   Basic DRAM structure is a memory cell                    writes continuously allocated data blocks
array, where each cell stores one bit of data as            in the same row sequentially without idle
the charge on a single capacitor. Compared                  intervals. SDRAMs normally have two
with static RAMs, DRAMs have several tech-                  or four independent data banks, provid-
nical limitations.8 The DRAM simple cell                    ing an opportunity to overlap concurrent
structure with one capacitor and one transis-               data accesses.
tor makes the row access latency longer than              • Enhanced SDRAM. A single-line SRAM
that of SRAMs, which use multiple transis-                  cache is integrated with each SDRAM
tors to facilitate a cell. Also, the original signals       memory bank’s row buffer. If an access is
in DRAM cells are destructive during read                   a hit to the buffer, the access time is
operations. The signals must be written back                equivalent to that of accessing the fast
to the selected memory cells. In contrast, the              SRAM cache. ESDRAM can overlap
signals in SRAM cells restore themselves after              memory precharging and refreshing
read operations. Finally, each DRAM cell                    operations with cache accesses.
must be refreshed periodically to charge the              • Rambus DRAM. A high-speed but nar-
capacitor. In contrast, SRAMs hold their data               row (1-byte wide) bus connects the
bits using flip-flop gate circuits, retaining               processor and multiple memory banks.
memory content as long as the power is on.                  This bus is multiplexed for transferring
   SRAMs are fast but expensive due to their                addresses and data. Both edges of the bus
low density. DRAMs are relatively slow but                  clock signal transfer data to double the
offer high density and low cost. Designers have             data transfer rate. RDRAM memory
widely used DRAMs to construct the main                     banks can be independently accessed in
memory for most types of computer systems.                  a pipelining mode. Currently, RDRAMs
The only exceptions are some vector comput-                 support 8 or 16 banks.
er systems that use SRAMs. All contemporary               • Direct Rambus DRAM. This is an
workstations, multiprocessor servers, and PCs               improved version of RDRAM that pro-
use DRAMs for main memory modules and                       vides a 1-byte wide address bus and a 2-
SRAMs for cache construction.                               byte wide data bus. Currently, DRDRAMs
   DRAM access consists of row and column                   support 16 or 32 memory banks. Adjacent
access stages. During row access, a row (also               banks share half-page row buffers.


                                                                                                           JULY–AUGUST 2001   23
              CACHED DRAM



          CPU                      Cached DRAM adds a small SRAM cache               is not accessed. The on-memory cache out-
        L1 cache                onto the DRAM core. It takes advantage of            puts the data in one bus cycle and sends it
                                the huge internal bandwidth existing in the          back to the memory controller and processor
                                DRAM core, so that the cache block can be as         after another bus cycle. If the memory con-
        L2 cache                large as a page. The processor usually has large     troller receives consecutive memory requests
                                caches, but the cache block is much smaller          that are read hits, it processes them in a
                                because of the limited bandwidth between the         pipelined mode.
       Memory bus               processor and the main memory. In general,              A read miss takes two processing steps. First,
                                a small cache with large block size can have a       the row in the DRAM core that contains the
                                comparable miss rate to a large cache with a         data is read and transferred to the on-memory
     On-memory cache            small block size.7 The cached DRAM has a             cache. Next, the data is read from the on-mem-
                                higher cache hit rate than other DRAM archi-         ory cache as if it were a read hit (a read miss can
                                tectures because of its fully associative cache      be overlapped with read hits). For the first step,
       DRAM core
                                organization. Figure 1 shows the general con-        the memory controller sends the DRAM a read
                                cept of cached DRAM.                                 command along with the row address and the
 Cache DRAM                        Contemporary processors aggressively              replaced block index on the command/address
                                exploit ILP by using superscalar, out-of-order       bus. This step activates the row access of the
Figure 1. The lower box         execution, branch prediction, and nonblock-          DRAM core. The memory controller uses a
shows the cached DRAM           ing caches. As a result, the processor may issue     modified least recently used (LRU) policy to
where an on-memory cache        multiple memory requests before the previ-           find the block for a replacement.
connecting with the DRAM        ous request is finished. Although the proces-            A write hit employs a modified write-back
core interfaces with the        sor can keep running before the memory               policy in which data is only written into the
memory bus.                     requests are finished, its ability to tolerate long   on-memory cache. The memory controller
                                memory latency is limited. Cached DRAM               sends the write command along with the
                                can provide fast response for a single memo-         block index and the column address onto the
                                ry request, and pipeline multiple requests to        address/command bus, and sends the data
                                achieve high throughput.                             onto the data bus. At the same time, the dirty
                                                                                     flag is set for that block on the memory con-
                                Cached-DRAM operations                               troller. The row address is not needed, because
                                   A cached DRAM is an integrated memory             the DRAM core is not accessed. The process-
                                unit consisting of an on-memory cache and a          ing of each write hit in a sequence can overlap
                                DRAM core. Inside the cached DRAM, a                 with one another and with the processing of
                                wide internal bus connects the on-memory             read requests.
                                cache to the DRAM core.                                 In a write miss, the memory controller uses
                                   The processor sends memory requests to            the modified LRU policy to select a block for
                                the memory controller when an L2 cache miss          a replacement. The write-allocate policy is
                                happens. To take advantage of the low access         enhanced with two steps. The memory con-
                                latency of the on-memory cache, the memo-            troller sends the DRAM read command along
                                ry controller maintains the on-memory cache          with the block index and the row address onto
                                tag and compares each tag with the tag por-          the address/command bus. The row in the
                                tion of the address for every memory request.        DRAM core that contains the writing address
                                For each on-memory cache block, the mem-             is first read from the DRAM core and trans-
                                ory controller also maintains the dirty flag,        ferred to the on-memory cache. The next step
                                which indicates whether the block has been           writes the data into the on-memory cache,
                                modified after it is loaded from the DRAM             operated as a write hit. These two steps can
                                core.                                                overlap with other read or write requests.
                                   For on-memory cache, a memory request                Table 1 shows an example of pipelining
                                is one of four types. In a read hit, the memo-       three continual read hits. Table 2 shows an
                                ry controller sends the on-memory cache a            example of a read miss that is overlapped with
                                read command along with the block index              two read hits. The pipelining operations of
                                and the column address via the address/com-          write hits and write misses are similar.
                                mand bus. This takes one bus cycle. The row             The replacement policy of on-memory
                                address is not needed because the DRAM core          cache is a modified LRU policy that avoids


24                 IEEE MICRO
choosing a dirty block for             Table 1. Example of pipelining three continual read hits (R1, R2, and R3).*
replacement. If a dirty block
were chosen for replacement,                                                                      Bus cycle
the block would have to be        Location                        0       1         2        3       4        5        6       7       8
written back into the DRAM        Address/command bus            R1                R2               R3
core first and would increase      Cached DRAM data                                 D1       D1      D2       D2       D3      D3
the latency of the memory         Processor data                                            D1      D1       D2       D2      D3      D3
request that causes the
replacement. To increase the         *An R1 on the address/command bus indicates that the first read’s block index and column
number of clean blocks avail-        address are sent on the address/command bus. A D1 on the cached DRAM data indicates that a
able for replacement, the            block of data for the first read is available in the cached DRAM. A D1 on the processor data
memory controller will               means that a block of data for the first read is available for the processor. R2 and R3 correspond to
schedule write-back requests         the second and third read commands/addresses. D2 and D3 correspond to the data items for the
for the dirty cache blocks as        second and third reads.
soon as the DRAM core is
not busy.
   ILP processors do not stall
                                   Table 2. Example of pipelining a read miss (R1) and two read hits (R2 and R3).*
for a single L2 miss, so they
may issue more memory                                                                            Bus cycle
requests when a previous          Location                       0        1       2      3      4        5      6       7      8       9
request is processing. An out-    Address/command bus            R1      R2              R1             R3
standing read request may pre-    Cached DRAM data                                       D2     D2      D1      D1      D3     D3
vent dependent instructions in    Processor data                                                D2      D2      D1      D1     D3     D3
the instruction window from
being issued to execution            *The first R1 in the address/command bus indicates that the DRAM read command, the row
units, which is likely to reduce     address, and the block index are sent on the address/command bus. The second R1 indicates that
the ILP or fill the instruction       a cache read command, the column address, and the block index are sent; R2 and R3 are signals of
window. On the other hand,           the two read hits on the bus. D1, D2, and D3 on the cached DRAM data indicate that the data
outstanding writes do not            for the reads are available on the cached DRAM. D1, D2, and D3 on the processor data mean the
influence ILP processors as           data for these reads are available for the processor.
long as the write buffer is not
full. Therefore, the memory
controller should schedule read requests prior to test, evaluate, and demonstrate their prod-
to write requests.                                  ucts’ performance. TPC-C is an online trans-
                                                    action processing benchmark. It is a mixture
Experimental environment                            of read-only and update-intensive transactions
   We used SPEC95 and TPC-C benchmarks that simulate a complex computing environ-
as the workloads, and SimpleScalar as the base ment where a population of terminal opera-
simulator.9 We used the PostgreSQL (version tors execute transactions against a database.
6.5) database system to support the TPC-C
workload. This is the most advanced open Simulations and architecture parameters
source database system for basic research.             The SimpleScalar tool set is a group of sim-
Researchers and manufacturers extensively use ulators for studying interactions between
the SPEC95 benchmark to study processor, application programs and computer architec-
memory system, and compiler performance. ture. In particular, the sim-outorder simula-
This benchmark is representative of comput- tor emulates typical ILP processors with the
ing-intensive applications. We ran the com- features of out-of-order execution, branch pre-
plete set of SPECint95 and SPECfp95 in our diction, and nonblocking cache. It produces
experiment, using the precompiled SPEC95 comprehensive statistics of the program exe-
programs in the SimpleScalar tool set.              cution. We have modified sim-outorder to
   TPC benchmarks represent commercial include cached DRAM and other types of
workloads, which are widely used by com- DRAM architecture simulations. In addition
puter manufacturers and database providers to the on-memory cache, we emulated the


                                                                                                         JULY–AUGUST 2001               25
              CACHED DRAM



                                                                                    a 1-byte-wide, high-speed bus. The DRDRAM
        Table 3. Architectural parameters of our simulation.
                                                                                    connects by a 1-byte-wide address bus and a 2-
Structure                              Parameter                                    byte-wide data bus, and the bus speed is 400
CPU clock rate                         500 MHz                                      MHz. Data is transferred on both edges of the
L1 instruction cache                   32 Kbytes, two-way, 32-byte block            block signal. For single-channel DRDRAM,
L1 data cache                          32 Kbytes, two-way, 32-byte block            the effective bandwidth is 1.6 Gbytes/s, which
L1 cache hit time                      6 ns                                         is not as large as the 2.6 Gbytes/s bandwidth
L2 cache                               2 Mbytes, two-way, 64-byte block             of the bus used in our simulation. To make a
L2 cache hit time                      24 ns                                        fair comparison, we simulate the internal struc-
Memory bus width                       32 bytes                                     ture of the RDRAM and the DRDRAM, but
Memory bus clock rate                  83 MHz                                       set their bus simulation the same as other
On-memory cache block number           1 to 128                                     DRAMs. We will show that the advantage of
On-memory cache block size             2 to 8 Kbytes                                cached DRAM is on its on-memory cache
On-memory cache associativity          One- to full-way                             structure, not on its bus connection. In fact,
On-memory cache access time            12 ns                                        the cached DRAM could also be connected to
On-memory cache hit time               36 ns                                        the processor by a high-speed, narrow bus.
On-memory cache miss time              84 ns
DRAM precharge time                    36 ns                                        Overall performance comparison
DRAM row access time                   36 ns                                           We measured the memory access portion
DRAM column access time                24 ns                                        of cycles per instruction (CPI) of the TPC-C
                                                                                    workload and all the SPEC95 programs. To
                                                                                    show the memory stall portion in each bench-
                                memory controller and a bus with contention.        mark program, we used a method similar to
                                We also considered DRAM precharge,                  the one that both Burger and Cuppu pre-
                                DRAM refresh, and processor/bus synchro-            sented.1-2 We simulated a system with an infi-
                                nization in the simulation.                         nitely large L2 cache to eliminate all memory
                                   We used sim-outorder to configure an             accesses. The application execution time on
                                eight-way processor, set the load/store queue       this system is called the base execution time.
                                size to 32, and set the register update unit size   We also simulated a system with a perfect
                                to 64 in the simulation. The processor allows       memory bus as wide as the L2 cache line,
                                up to eight outstanding memory requests, and        which separates the portion of the memory
                                the memory controller accepts up to eight           stall caused by the limited bandwidth. The
                                concurrent memory requests. Table 3 gives           CPI has three portions: the base or number
                                other architectural parameters of our simula-       of cycles spent for CPU operations and cache
                                tion. We used the processor and data bus para-      accesses; the latency or number of cycles spent
                                meters in a Compaq Workstation XP1000.              accessing main memory; and the bandwidth
                                The on-memory cache access time is the same         or number of cycles lost due to the limited bus
                                as that in Hart’s paper.7 The on-memory cache       bandwidth. The memory access portion of the
                                hit time is the sum of the time for transferring    CPI is the sum of the latency and bandwidth
                                the command/address to the cached DRAM              portions.
                                (one bus cycle), the on-memory cache access            We compared the cached DRAM with four
                                time, and the time for the first chunk of data       DRAM architectures: SDRAM, ESDRAM,
                                to be sent back (one bus cycle). The on-            RDRAM, and DRDRAM. We used the
                                memory cache miss time is the sum of the            TPC-C workload and eight SPECfp95 pro-
                                time for transferring the command/address to        grams: tomcatv, swim, su2cor, hydro2d,
                                the cached DRAM, the DRAM precharge                 mgrid, applu, turb3d, and wave5. We found
                                time if the accessed memory bank needs              that the memory access portions of the CPI
                                precharge, the DRAM row access time, the            of all SPECint95 programs and the two other
                                time to transfer a row from the DRAM core           SPECfp95 programs are very small. As a
                                to the on-memory cache (one bus cycle), the         result, the programs’ performances are not
                                on-memory cache access time, and the time           sensitive to the improvement of the main
                                for the first chunk of data to be sent back.         memory system. Although the memory access
                                   The RDRAM connects to the processor by           time reduction from using cached DRAM is


26              IEEE MICRO
                         1.00

                                                                                                                                                                  Bandwidth
                         0.80                                                                                                                                     Latency
                                                                                                                                                                  Base
Cycles per instruction




                         0.60



                         0.40



                         0.20



                         0.00
                                SDRAM

                                        CDRAM


                                                SDRAM

                                                        CDRAM


                                                                SDRAM

                                                                        CDRAM


                                                                                  SDRAM

                                                                                          CDRAM


                                                                                                  SDRAM

                                                                                                          CDRAM


                                                                                                                  SDRAM

                                                                                                                          CDRAM


                                                                                                                                  SDRAM

                                                                                                                                          CDRAM


                                                                                                                                                  SDRAM

                                                                                                                                                          CDRAM


                                                                                                                                                                     SDRAM

                                                                                                                                                                             CDRAM
                                TPC-C           tomcatv           swim             su2cor         hydro2d          mgrid            applu          turb3d             wave5

Figure 2. Cycles per instruction for the TPC-C workload and selected SPECfp95 programs.



also significant on those programs, the total                                    the results reported in Koganti.4 Because
execution time reductions are not significant.                                   increasing the block size and number of
                                                                                blocks increases the on-memory cache’s space
On-memory cache organizations                                                   requirement on the memory chip, there is a
   We investigated the effects of changing the                                  trade-off between performance and cost. A
cache size and the cache associativity on the                                   fully associative on-memory cache of 16 × 4
performance of the TPC-C workload and                                           Kbytes is very effective for all workloads. The
the eight selected SPECfp95 programs. Our                                       on-memory cache miss rates of the TPC-C
experiments show that a small cache block                                       workload and six of the SPECfp95 programs
size is not effective for the on-memory cache.                                  are below 5 percent, and the miss rates of the
The miss rates for TPC-C workload on a                                          two other SPECfp95 programs are below 20
fully associative on-memory cache of 32                                         percent. Therefore, we used this on-memo-
Kbytes with cache block sizes of 128 bytes,                                     ry cache configuration in the following
256 bytes, and 512 bytes are 62, 36, and 22                                     experiments.
percent. On the other hand, a small block
number is also not effective for the on-mem-                                    Cached DRAM and SDRAM
ory cache. The miss rate for su2cor on a fully                                     Figure 2 presents the CPIs and their decom-
associative on-memory cache having four                                         positions for the TPC-C workload and the
blocks of 4 Kbytes is as high as 80 percent.                                    eight SPECfp95 programs on both the cached
When the number of cache blocks increases                                       DRAM and the SDRAM. CPI reductions
to eight, the miss rate is still more than 40                                   using the cached DRAM range from 10 to 39
percent. Only after the number of cache                                         percent. The effectiveness of the cached
blocks increases to 16 is the miss rate effec-                                  DRAM for reducing CPI is mainly deter-
tively reduced to 5 percent. The experiments                                    mined by the percentage of memory access
also show that the advantage of full associa-                                   portion in each program’s total CPI, which
tivity is significant. Direct-mapped, on-                                       Table 4 (next page) lists. We show that CPI
memory caches, even with many blocks, have                                      reduction by the cached DRAM mainly
high miss rates. This finding is important                                      comes from reducing the latency portion of
because most commercial DRAMs use the                                           the CPIs; the bandwidth portion of CPIs is
direct-mapped structure. Our study confirms                                     almost unchanged in each program.


                                                                                                                                          JULY–AUGUST 2001                           27
                CACHED DRAM




                                                         Table 4. Memory access portion in CPI           Table 5. Reduction rate of the latency
                                                                 (cycles per instruction).               portion of cycles per instruction from
                                                                                                                  using cached DRAM.
                                                         Program    Memory portion percentage
                                                         TPC-C                  27                       Program     Reduction rate percentage
                                                         tomcatv                39                       TPC-C                   62
                                                         swim                   47                       tomcatv                 84
                                                         su2cor                 21                       swim                    83
                                                         hydro2d                52                       su2cor                  75
                                                         mgrid                  37                       hydro2d                 83
                                                         applu                  35                       mgrid                   79
                                                         turb3d                 20                       applu                   87
                                                         wave5                  15                       turb3d                  71
                                                                                                         wave5                   72



                                             1.2                                                        Cached DRAM and other DRAM architectures
                                                                                                           Figure 3 shows the CPIs memory access
                                             1.0                                                        portions of the TPC-C workload and five
                                                                                                        selected SPECfp95 programs on the cached
                   Normalized memory stall




                                             0.8                                                        DRAM and DRAM variants. Cached DRAM
                                                                                                        outperforms other DRAMs significantly on
                                             0.6
                                                                                                        all programs. ESDRAM performs better than
                                                                                                        RDRAM and DRDRAM because of the low
                                                                                                        latency in accessing the on-memory cache.
       SDRAM                                 0.4
                                                                                                        The cached DRAM outperforms the
       CDRAM
                                                                                                        ESDRAM because the large number of blocks
       ESDRAM
                                             0.2                                                        and the fully associative structure in the
       RDRAM
                                                                                                        cached DRAM make the hit rate very high.
       DRDRAM
                                              0                                                            RDRAM and DRDRAM support high
                                                   TPC-C tomcatv   swim    hydro2d mgrid     applu      bandwidth by overlapping accesses among dif-
                                                                                                        ferent banks. However, this is not very help-
Figure 3. Comparison of memory stall times. The performance values are nor-                             ful in reducing access latency. Although both
malized to the memory access portion of cycles per instruction in the SDRAM                             RDRAM and DRDRAM have many row
configuration.                                                                                           buffers, the hit rates are still low because of
                                                                                                        the direct-mapped structure. In contrast,
                                                           Table 5 presents reductions of CPI latency   when the number of accesses to the DRAM
                                                        portions by using cached DRAM. For all          core is very low, the cached DRAM acts
                                                        selected SPECfp95 programs, cached DRAM         almost as an SRAM main memory, providing
                                                        reduces CPI latency portions by more than       both low latency and high bandwidth. As a
                                                        71 percent. The reduction rate for the TPC-     result, the performance differences between
                                                        C workload is 62 percent. The program laten-    the cached DRAM and the other DRAM
                                                        cy reduction rates are mainly determined by     architectures are large.
                                                        hit rates to the on-memory cache in the            Our study shows that the performance of
                                                        cached DRAM and to the row buffer in the        some programs on RDRAM or DRDRAM
                                                        SDRAM. For example, the on-memory cache         can differ significantly for the programs shown
                                                        hit rate of the cached DRAM for the TPC-C       in Figure 3. The worst performance is for
                                                        workload is 93 percent, while the row-buffer    mgrid, and the best is applu. Both DRAMs
                                                        hit rate of the SDRAM is 50 percent. For        perform better for programs that have many
                                                        tomcatv, the on-memory hit rate of the          concurrent memory requests, because these
                                                        cached DRAM is 98 percent, whereas the          DRAMs can effectively overlap this type of
                                                        row-buffer hit rate of the SDRAM is as low      memory access. Figure 4 compares the distri-
                                                        as 8 percent.                                   butions of concurrent memory requests by


28                IEEE MICRO
mgrid and applu on the SDRAM when the                                                                                                                                              50                                                                                                           50
memory system is busy. Our experiments
                                                                                                                                                                                   45                                                                                                           45




                                                                                                                                           Concurrency distribution (percentage)




                                                                                                                                                                                                                                                        Concurrency distribution (percentage)
show that the memory access concurrency
degree of applu is much higher than that of                                                                                                                                        40                                                                                                           40
mgrid. For example, 27 percent of memory                                                                                                                                           35                                                                                                           35
accesses in applu have a concurrency degree
                                                                                                                                                                                   30                                                                                                           30
of 8, and the percentages of the memory
accesses with concurrency degrees of 3 to 7                                                                                                                                        25                                                                                                           25
range from 5 to 13 percent. In contrast, the                                                                                                                                       20                                                                                                           20
majority of memory accesses in mgrid have
                                                                                                                                                                                   15                                                                                                           15
concurrency degrees of 1 (48 percent of the
total memory accesses) or 2 (29 percent of the                                                                                                                                     10                                                                                                           10
total memory accesses). This explains why                                                                                                                                           5                                                                                                            5
RDRAM and the DRDRAM are more effec-
                                                                                                                                                                                    0                                                                                                            0
tive for applu than mgrid, although both pro-                                                                                                                                            1 2 3 4 5 6 7 8                                                                                                1 2 3 4 5 6 7 8
grams have comparable memory access                                                                                                       (a)
                                                                                                                                                                                          Number of requests
                                                                                                                                                                                                                                                                        (b)
                                                                                                                                                                                                                                                                                                         Number of requests
portions in their CPIs, as shown in Table 4.
                                                                                                                             Figure 4. Distribution of the number of concurrent memory requests on the
Increasing the ILP degree                                                                                                    SDRAM system when the memory is busy: The mgrid (a) program has a
   Commercial computer architectures exten-                                                                                  much lower concurrency than that of applu (b).
sively use wide-issue processors. When the ILP
degree increases, the processor increases
demand on the main memory system. Thus,                                                                                      ideal main memory system, the performance
it is important to understand how cached                                                                                     always improves as the ILP degree increases.
DRAM performs as ILP degrees change. We                                                                                         Our experiments show that the CPIs of three
compared the performance of cached DRAM                                                                                      programs on the 16-way issue processor with
and SDRAM with four-, eight-, and 16-way-                                                                                    the SDRAM are slightly higher than those of
issue processors.                                                                                                            the eight-way issue processor with the SDRAM.
   Figure 5 shows CPI as the ILP degree                                                                                      The performance degradation mainly comes
changes. The CPI’s base portion decreases                                                                                    from the heavy demand to the main memory
proportionally for all programs as the ILP                                                                                   system. This heavy demand causes congestion
degree increases. This means that, with an                                                                                   at the main memory system, and increases the



                                                                                                                                                                                    Bandwidth
Normalized execution time per instruction




                                                      SDRAM




                                                                                        SDRAM




                                                                                                                               SDRAM




                                                                                                                                                                                                        SDRAM




                                                                                                                                                                                                                                                                                 SDRAM




                                                                                                                                                                                                                                                                                                                                 SDRAM




                                                                                                                                                                                    Latency
                                            1.20
                                                                                                                                                                                    Base
                                                   CDRAM




                                                                                                                                                                                                                                                                                                                              CDRAM
                                                                                                                                                                                                                                                                             CDRAM
                                                                SDRAM




                                                                                                                                                                                                                                        SDRAM
                                                                                                             SDRAM




                                                                                                                                                                                                                                                                                                                                              SDRAM
                                                                                   CDRAM




                                                                                                                                                                                                                                                                                                                                                      SDRAM
                                                                                                     SDRAM




                                                                                                                                                                                                                        SDRAM




                                            1.00
                                                                                                                                                                                                                CDRAM




                                                                                                                                                                                                                                                                                                         SDRAM
                                                                                                                                                                                        SDRAM
                                                                           SDRAM
                                                              CDRAM




                                                                                                                                       CDRAM
                                                                                                                                                                            SDRAM
                                                                        CDRAM




                                                                                                                                                                                                                                                                                                                      SDRAM




                                                                                                                                                                                                                                                                                                                                         CDRAM




                                            0.80
                                                                                                CDRAM




                                                                                                                                                                                                                                                                                                     CDRAM
                                                                                                                                                                                                                                CDRAM




                                                                                                                                                                                                                                                                                                                                                              CDRAM
                                                                                                                     CDRAM




                                                                                                                                                                       CDRAM




                                            0.60
                                                                                                                                                                                                                                                                                                                 CDRAM
                                                                                                                                                                                                                                                CDRAM
                                                                                                                                                                                                CDRAM




                                            0.40


                                            0.20


                                            0.00
                                                   ILP4 ILP8 ILP16                 ILP4 ILP8 ILP16                            ILP4 ILP8 ILP16                                                           ILP4 ILP8 ILP16                                        ILP4 ILP8 ILP16                                                ILP4 ILP8 ILP16
                                                          TPC-C                           tomcatv                                                         swim                                                    hydro2d                                                                            mgrid                               applu

Figure 5. Cycles per instruction and their decompositions for the TPC-C workload and five SPECfp95 programs as the ILP
degree changes.



                                                                                                                                                                                                                                                                                                     JULY–AUGUST 2001                                         29
     CACHED DRAM



                    memory access latency due to queuing effects.         is known as page interleaving. All conven-
                    Consequently, the rate of instructions retiring       tional interleaving schemes use an address bit
                    from the instruction window decreases, reduc-         portion as the bank index to determine to
                    ing the speed of new instructions entering the        which bank the address is mapped. The bank
                    instruction window. For example, the instruc-         index used by the page-interleaving scheme is
                    tion dispatch rate on the 16-way issue processor      a portion of the address bits used for cache set
                    is 30 percent less than that of the eight-way issue   index. Because of this connection between
                    processor for hydro2d. The average time that          memory accesses and cache accesses, cache
                    an instruction stays in the instruction window        conflict misses will result in row buffer con-
                    is 43 processor cycles for the eight-way issue        flicts, and write backs will compete with cur-
                    processor, but increases to 88 processor cycles       rent reads for the same row buffer.10
                    for the 16-way issue processor.                          A permutation-based interleaving scheme
                       In contrast, the cached DRAM performs              uses two portions of address bits to generate
                    consistently well as the ILP degree increases.        new bank indices. One portion is the bank
                    Because the on-memory cache hit rate is high,         index used in the conventional page-inter-
                    the cached DRAM can support high-band-                leaving scheme, and the other portion comes
                    width and low-latency accesses. Thus, con-            from the address bits used for cache tags.11-12
                    gestion at the memory system is not as severe         Using these two portions as inputs, an XOR
                    as on SDRAM. The memory stall time does               operator outputs a new bank index for an
                    not increase as the ILP degree increases from         address mapping. It is still a page-interleaving
                    8 to 16.                                              scheme, but the portion from tags permutes
                       The processor’s inability to effectively tol-      the mapping of pages to banks. Consequent-
                    erating long memory access latency caused by          ly, accesses causing row buffer conflicts in the
                    SDRAM will eventually offset the benefit of            conventional page-interleaving scheme are
                    reducing computing time due to the ILP                distributed to different banks without chang-
                    degree increase. In contrast, cached DRAM             ing the locality in the row buffer. The results
                    effectiveness will increase as the ILP increas-       show that the scheme can significantly
                    es to a much higher degree.                           improve the row buffer hit rate.10-11
                                                                             Because cached DRAM and the permuta-
                    Exploiting row buffer locality                        tion-based interleaving scheme share the same
                       Researchers have made efforts to exploit the       objective of reducing conflicts in the on-mem-
                    locality in row buffers to reduce DRAM access         ory cache or row buffers, it is necessary to com-
                    latency and improve DRAM bandwidth uti-               pare their performance potential. Cached
                    lization. For example, most contemporary              DRAM usually uses full or high associativity.
                    DRAMs support both the open and close                 The permutation-based interleaving scheme
                    page modes. In the open page mode, data in            uses a special mapping to reduce row buffer
                    the row buffer is kept valid after the current        conflicts without changing the direct-mapped
                    access finishes, and only the column access is         structure. The cached DRAM approach has
                    necessary for the next access as long as the data     several advantages over the permutation-based
                    is in the row buffer (a row buffer hit). The          interleaving scheme. Accesses to the on-mem-
                    effectiveness of open page mode depends               ory cache are faster than accesses to the row
                    mainly on the hit rate to the row buffers. The        buffer. Due to its high associativity, the on-
                    structure of the row buffer is comparable to          memory cache hit rate is higher than the row
                    that of a direct-mapped cached DRAM.                  buffer hit rate under the permutation-based
                       Contemporary DRAM chips support                    scheme. The on-memory cache can be
                    increasingly more memory banks and row                accessed independently with the DRAM core.
                    buffers. However, the row buffer hit rate is still    Even if a DRAM bank is in precharge or in a
                    low. Studies have shown that the low hit rate         row access, its cached data in the SRAM can be
                    is directly related to the memory-interleaving        accessed simultaneously. In contrast, the data
                    scheme—that is, how physical memory                   in the row buffer is lost as soon as the precharge
                    addresses are mapped onto DRAM banks.10-11            starts. Finally, the cached DRAM organization
                    To exploit row buffer locality, the memory            allows more on-memory cache blocks than the
                    space is usually interleaved page by page; this       number of banks, which is beneficial to


30     IEEE MICRO
                             40
                             35                                                                     Cached DRAM
CPI reduction (percentage)




                                                                                                    Permutation-based
                             30
                                                                                                    page interleaving
                             25
                             20
                             15
                             10
                              5
                              0
                                  TPC-C   tomcatv   swim   su2cor   hydro2d   mgrid      applu     turb3d    wave5

Figure 6. Cycles per instruction reduction for cached DRAM and XOR-based interleaving
scheme.



DRAM systems with a limited number of                                 ments and discussions. We also thank our col-
banks. However, the permutation-based inter-                          league Stefan A. Kubricht for reading this arti-
leaving scheme requires little additional cost                        cle and providing comments and corrections.
and does not require any change in the DRAM                           Finally, we thank the anonymous referees for
chip. In contrast, cached DRAM requires an                            their constructive comments on our work.
additional chip area for the SRAM cache and                           This work is supported in part by the Nation-
additional circuits for cache management.                             al Science Foundation under grants CCR-
   Figure 6 compares the performance                                  9812187, EIA-9977030, and CCR-0098055;
improvements between cached DRAM and                                  the Air Force Office of Scientific Research
the permutation-based interleaving scheme                             under grant AFOSR-95-1-0215; and Sun
over SDRAM systems, using the identical                               Microsystems under grant EDUE-NAFO-
simulation configuration as Table 3 with the                           980405.
following differences: The cached DRAM has
16 × 4 Kbytes of on-memory cache, whereas                             References
the permutation-based interleaving scheme is                           1. D. Burger, J.R. Goodman, and A. Kägi,
used for 32 banks and 32- × 2-Kbyte row                                   “Memory Bandwidth Limitations of Future
buffers. We show that cached DRAM reduces                                 Microprocessors,” Proc. 23rd Ann. Int’l
the CPI by 23 percent on average, while the                               Symp. Computer Architecture, IEEE CS
average CPI reduction by the permutation-                                 Press, Los Alamitos, Calif., 1996, pp. 78-89.
based scheme is 10 percent.                                            2. V. Cuppu et al., “A Performance Compari-
                                                                          son of Contemporary DRAM Architectures,”


O    ur study provides three new findings:
     cached DRAM consistently improves
performance as the ILP degree increases; con-
                                                                          Proc. 26th Ann. Int’l Symp. Computer Archi-
                                                                          tecture, IEEE CS Press, Los Alamitos, Calif.,
                                                                          1999, pp. 222-233.
temporary DRAM schemes do not exploit                                  3. W.-C. Hsu and J.E. Smith, “Performance of
memory access locality as effectively as cached                           Cached DRAM Organizations in Vector
DRAM; and compared with a highly effective                                Supercomputers,” Proc. 20th Ann. Int’l
permutation-based DRAM interleaving tech-                                 Symp. Computer Architecture, IEEE CS
nique, cached DRAM still substantially                                    Press, Los Alamitos, Calif., 1993, pp. 327-336.
improves performance.                                                  4. R.P. Koganti, and G. Kedem, WCDRAM: A
                                                                          Fully Associative Integrated Cached-DRAM
Acknowledgments                                                           with Wide Cache Lines, tech. report CS-
  We thank Jean-Loup Baer and Wayne                                       1997-03, Dept. of Computer Science, Duke
Wong at the University of Washington for                                  Univ., Durham, N.C., 1997.
reading a preliminary version of this article                          5. W. Wong and J.-L. Baer, DRAM On-Chip
and for their insightful and constructive com-                            Caching, tech. report UW CSE 97-03-04,



                                                                                                                            JULY–AUGUST 2001   31
     CACHED DRAM



                          Dept. of Computer Science and Engineer-         ington, D.C. His research interests include
                          ing, Univ. of Washington, 1997.                 parallel and distributed systems, computer
                     6.   H. Hidaka et al., “The Cache DRAM Archi-        system performance evaluation, computer
                          tecture: A DRAM with an On-Chip Cache           architecture, and scientific computing. He has
                          Memory,” IEEE Micro, vol. 10, no. 2, Mar.       a BS in electrical engineering from Beijing
                          1990, pp. 14-25.                                Polytechnic University, China, and an MS and
                     7.   C.A. Hart, “CDRAM in a Unified Memory           PhD in computer science from the Universi-
                          Architecture,” Proc. 39th Int’l Computer        ty of Colorado at Boulder. He is an associate
                          Conf. (COMPCON 94), IEEE CS Press, Los          editor of the IEEE Transactions on Parallel and
                          Alamitos, Calif., 1994, pp. 261-266.            Distributed Systems and has chaired the IEEE
                     8.   Y. Katayama, “Trends in Semiconductor           Computer Society Technical Committee on
                          Memories,” IEEE Micro, Vol. 17, No. 6,          Supercomputing Applications. He is a senior
                          Nov./Dec. 1997, pp. 10-17.                      member of the IEEE.
                     9.   D.C. Burger and T.M. Austin, The Sim-
                          pleScalar Tool Set, Version 2.0, tech. report     Direct questions and comments about this
                          CS-TR-1997-1342, Dept. of Computer Sci-         article to Xiaodong Zhang, Dept. of Com-
                          ences, Univ. of Wisconsin, Madison, 1997.       puter Science, College of William and Mary,
                    10.   Z. Zhang, Z. Zhu, and X. Zhang, “A Permu-       Williamsburg, VA 23187-8795; zhang@cs.
                          tation-Based Page Interleaving Scheme to        wm.edu.
                          Reduce Row-Buffer Conflicts and Exploit
                          Data Locality,” Proc. 33rd Ann. IEEE/ACM
                          Int’l Symp. Microarchitecture (MICRO-33),
                          IEEE CS Press, Los Alamitos, Calif., 2000,
                          pp. 32-41.
                    11.   W.-F. Lin, S. Reinhardt, and D.C. Burger,
                          “Reducing DRAM Latencies with an Inte-
                          grated Memory Hierarchy Design,” Proc. 7th                 you@computer.org
                          Int’l Symp. High-Performance Computer                           FREE!
                          Architecture (HPCA-7), IEEE CS Press, Los                   All IEEE Computer Society
                          Alamitos, Calif., 2001, pp. 301-312.                       members can obtain a free,
                                                                                             portable email
                    Zhao Zhang is a PhD candidate in computer                        alias@computer.org. Select
                    science at the College of William and Mary.                        your own user name and
                    His research interests include computer archi-                    initiate your account. The
                                                                                     address you choose is yours
                    tecture and parallel systems. He has a BS and
                                                                                        for as long as you are a
                    MS in computer science from Huazhong Uni-                        member. If you change jobs
                    versity of Science and Technology, China. He                    or Internet service providers,
                    is a student member of the IEEE and the ACM.                    just update your information
                                                                                        with us, and the society
                    Zhichun Zhu is a PhD candidate in comput-                         automatically forwards all
                    er science at the College of William and Mary.                              your mail.
                    Her research interests include computer archi-
                    tecture and computer system performance                          Sign up today at
                    evaluation. She has a BS in computer science                    http://computer.org
                    from Huazhong University of Science and
                    Technology, China. She is a student member
                    of the IEEE and the ACM.

                    Xiaodong Zhang is a professor of computer
                    science at the College of William and Mary.
                    He is also the program director of the
                    Advanced Computational Research Program
                    at the National Science Foundation, Wash-


32     IEEE MICRO

								
To top