A Burst Scheduling Access Reordering Mechanism
Jun Shao and Brian T. Davis
Department of Electrical and Computer Engineering
Michigan Technological University
Abstract The access latency depends upon the location of requested
data and the state of SDRAM device. Two adjacent mem-
Utilizing the nonuniform latencies of SDRAM devices, ory accesses directed to the same row of the same bank can
access reordering mechanisms alter the sequence of main be completed faster than two accesses directed to different
memory access streams to reduce the observed access la- rows because the accessed row data can be maintained in an
tency. Using a revised M5 simulator with an accurate active row for faster access to following same row accesses.
SDRAM module, the burst scheduling access reordering In addition, accesses to unique banks can be pipelined, thus
mechanism is proposed and compared to conventional in or- two accesses directed to unique banks may have shorter la-
der memory scheduling as well as existing academic and in- tency than two accesses directed to the same bank.
dustrial access reordering mechanisms. With burst schedul- With aggressive out of order execution processors and
ing, memory accesses to the same rows of the same banks non-blocking caches, multiple main memory accesses can
are clustered into bursts to maximize bus utilization of the be issued and outstanding while the main memory is serv-
SDRAM device. Subject to a static threshold, memory reads ing the previous access. Compared with conventional in
are allowed to preempt ongoing writes for reduced read la- order access scheduling, memory access reordering mech-
tency, while qualiﬁed writes are piggybacked at the end of anisms execute these outstanding memory accesses in an
bursts to exploit row locality in writes and prevent write order which attempts to reduce execution time. Through ex-
queue saturation. Performance improvements contributed ploiting SDRAM row locality and bank parallelism, access
by read preemption and write piggybacking are identi- reordering mechanisms signiﬁcantly reduce observed main
ﬁed. Simulation results show that burst scheduling reduces memory access latency and improve the effective memory
the average execution time of selected SPEC CPU2000 bandwidth. Access reordering mechanisms do not require
benchmarks by 21% over conventional bank in order mem- a large amount of chip area and only need modiﬁcations to
ory scheduling. Burst scheduling also outperforms Intel’s the memory controller.
patented out of order memory scheduling and the row hit This paper makes the following contributions:
access reordering mechanism by 11% and 6% respectively.
• Studies and identiﬁes performance contributions made
by access reordering mechanisms.
• Introduces burst scheduling which creates bursts by
1. Introduction clustering accesses directed to the same rows of the
same banks to achieve a high data bus utilization.
Memory performance can be measured in two ways: Reads are allowed to preempt ongoing writes while
bandwidth and latency. Memory bandwidth can be largely qualiﬁed writes are piggybacked at the end of bursts.
addressed by increasing resources, using increased fre- • Compares burst scheduling with conventional bank in
quency, wider busses, dual or qual channels. However, re- order memory scheduling as well as existing access
ducing memory latency often requires reducing the device reordering mechanisms, including published academic
size. Although caches can hide the long main memory la- and industrial out of order scheduling.
tency, cache misses may require hundreds of CPU cycles
and cause pipeline stalls. Therefore main memory access • Explores the design space of burst scheduling by using
latency remains a factor limiting system perform . a static threshold to control read preemption and write
Due to the 3-D (bank, row, column) structure, modern piggybacking. Determines the threshold that yields the
SDRAM devices have nonuniform access latencies [3, 4]. shortest execution time by experiments.
The rest of the paper is organized as follows. Sec- Main memory access streams are comprised of cache
tion 2 discusses the background of modern SDRAM de- misses from the lowest level cache and have been shown
vices, memory access scheduling and related work. Sec- to have signiﬁcant spatial and temporal locality even after
tion 3 introduces burst scheduling. Section 4 and Section 5 being ﬁltered by cache(s) . These characteristics of the
present experimental environment and simulation results. main memory create a design space where parallelism and
Conclusions are explored in Section 6 based on the results. locality can be exploited by access reordering mechanisms
Section 7 brieﬂy discusses future work. to reduce the main memory access latency.
Throughout the rest of this paper, the term access refers
to a memory read or write issued by the lowest level cache.
An access may require several transactions depending upon
the state of the SDRAM devices.
Modern SDRAM devices store data in arrays (banks)
which can be indexed by row address and column ad- 2.1. Memory Access Scheduling
dress [3, 4]. An access to the SDRAM device may require
three transactions besides the data transfer: bank precharge, The memory controller, usually located on the north
row activate and column access . A bank precharge bridge or on the CPU die, generates the required transac-
charges and prepares the bank. A row activate copies an tions for each access and schedules them on the SDRAM
entire row data from the array to the sense ampliﬁers which busses. SDRAM busses are split transaction busses there-
function like a cache. Then one or more column accesses fore transactions belonging to different accesses can be in-
can access the row data. terleaved.
Depending on the state of the SDRAM device, a memory
Access0 to Access1 to Access2 to Access3 to
access could be a row hit, row conﬂict or row empty and bank0 row0 bank1 row0 bank0 row1 bank0 row0
experiences different latencies . A row hit occurs when R C R C P R C P R C
the bank is open and an access is directed to the same row D0 D1 D2 D3
as the last access to the same bank. A row conﬂict occurs
(a) In order scheduling without interleaving
when an access goes to a different row than the last access
to the same bank. If the bank is closed (precharged) then a R C R C C P R C
row empty occurs. Row hits only require column accesses D0 D3 D1 D2
while all three transactions are required for row conﬂicts
(b) Out of order scheduling with interleaving
therefore row hits have shorter latencies than row conﬂicts.
P Bank precharge R Row activate C Column access Dx Data
After completing a memory access, the bank can be left
open or closed by a bank precharge, subject to the mem-
ory controller policy. One of two static controller policies, Figure 1. Memory access scheduling
Open Page (OP) and Close Page Autoprecharge (CPA), typ-
ically makes this decision . Table 1 summarizes possi- How the memory controller schedules transactions of ac-
ble SDRAM access latencies given SDRAM busses are idle, cesses impacts the performance, as illustrated in Figure 1.
where tRP , tRCD and tCL are the timing constraints asso- In this example four accesses are to be scheduled. Access0
ciated with bank precharge, row activate and column access and access1 are row empties; access2 and access3 are row
respectively . conﬂicts. The SDRAM device has timing constraints of 2-
2-2 (tCL -tRCD -tRP ) and a burst length of 4 (2 cycles with
double data rate). In ﬁgure 1(a), the controller performs
Table 1. Possible SDRAM access latencies these accesses strictly in order and does not interleave any
Controller policy OP CPA transactions. It takes 28 memory cycles to complete four
Row hit tCL N/A accesses. While in order scheduling is easy to implement,
Row empty tRCD +tCL tRCD +tCL it is inefﬁcient because of the low bus utilization.
Row conﬂict tRP +tRCD +tCL N/A In Figure 1(b), the same four accesses are scheduled out
of order. Access3 is scheduled prior to access1, which
Multiple internal banks allow accesses to unique banks turns access3 from a row conﬂict into a row hit. The trans-
to be executed in parallel. A set of SDRAM devices con- actions of different accesses are also interleaved to maxi-
catenated to populate the system memory bus is known as mize SDRAM bus utilization. As a result, only 16 memory
a rank, which shares address bus and data bus with other cycles are needed to complete the same four accesses. It is
ranks. For systems that support dual channels, different possible that some accesses result in increased latency due
channels have unique busses. Therefore parallelism exists to access reordering, however the average access latency
in the memory subsystem between banks and/or channels. should be reduced.
2.2. Related Work SDRAM transactions to devices
SDRAM Transaction Scheduler
Other access reordering mechanisms exist. They are usu-
ally designed for special applications or hardware. Scott
Bank0 Bank1 BankN Shared write
Rixner et al. exploited the features of Virtual Channel Arbiter Arbiter Arbiter data pool
SDRAM devices  and proposed various access reorder-
ing policies to achieve a high memory bandwidth and low Shared
memory latency for modern web servers . Proposed access
by Ibrahim Hur et al., the adaptive history-based memory
scheduler tracks the access pattern of recently scheduled
accesses and selects memory accesses matching the pro- Read
gram’s mixture of reads and writes . Zhichun Zhu et Memory accesses Write data
al. proposed ﬁne-grain priority scheduling, which splits and
maps memory accesses into different channels and returns Figure 3. Structure of burst scheduling
critical data ﬁrst, to fully utilize the available bandwidth
and concurrency provided by Direct Rambus DRAM sys-
tems . Sally McKee et al. described a Stream Memory
Controller system for streaming computations, which com- for the ﬁrst one, are row hits and only require column ac-
bines compile-time detection of streams with execution- cess transactions. Data transfers of these accesses can be
time access reordering . SDRAM access reordering and performed back to back on the data bus, resulting in a large
prefetching were also proposed by Jahangir Hasan et al. to payload therefore improving data bus utilization. Increas-
increase row locality and reduce row conﬂict penalty for ing the row hit rate and maximizing the memory data bus
network processors . utilization are the major design goals of burst scheduling.
Other SDRAM related techniques such as SDRAM ad-
Figure 3 shows the structure of burst scheduling. Out-
dress mapping change the distribution of memory blocks in
standing accesses are stored in unique read queues and write
the SDRAM address space to exploit the parallelism [19,
queues based on their target banks. The read and write
5, 23, 16]. A dynamic SDRAM controller policy predic-
queues share a global access pool. A write data pool is used
tor proposed by Ying Xu reduces main memory access la-
to store the data associated with writes. The queues can be
tency by using a history based predictor similar to branch
implemented as linked lists. Depending upon the row index
predictors to make the decision whether or not to leave the
of the access address, new reads will join existing bursts, or
accessed row open for each access .
new bursts containing single access will be created and ap-
pended at the end of the read queues. Bank arbiters select an
3. Burst Scheduling ongoing access from the read or write queue for each bank.
At each memory cycle a global transaction scheduler selects
In a packet switching network, large packets can improve an ongoing access from all bank arbiters and schedules the
network throughput because the fraction of bandwidth used next transactions of the access.
to transfer packet heads (overhead) is reduced. Consider Newly arrived accesses can join existing bursts when
these three transactions as the overhead and the data trans- bursts are being scheduled. Bursts within a bank are sorted
fer as the payload, the same idea could be used in access based on the arrival time of the ﬁrst access to prevent starva-
scheduling to improve bus utilization. tion to single access bursts or small bursts in the same bank.
However a large or an increasing burst can still delay small
P0 R0 C0 C1 C2 C3 C4 bursts from other banks. Therefore burst scheduling inter-
leaves bursts from different banks to give relatively equal
opportunity to all bursts. High data bus utilization can be
Access0 Access1 Access2 Access3 Access4 maintained during burst interleaving. However consider-
Overhead Burst (payload) ation must be taken to avoid bubble cycles due to certain
SDRAM timing constrains, i.e. rank to rank turnaround cy-
cles introduced by DDR2 devices .
Figure 2. Burst scheduling
The following sections present the three subroutines of
As illustrated in Figure 2, the proposed burst scheduling the algorithm used in burst scheduling, which can be trans-
clusters outstanding accesses directed to the same rows of formed into ﬁnite state machine for incorporation into the
the same banks into bursts. Accesses within a burst, except SDRAM controller.
3.1. Access Enter Queue Subroutine subroutine BankArbiter(ongoing access)
1: if ongoing access == NULL then
2: if write queue is full then
Figure 4 shows the access enter queue subroutine, which 3: ongoing access = oldest write in write queue
is called when new accesses enter the queues. Because the 4: else if write queue length > threshold and
write queue serves as a write buffer, all reads must search last access was an end of burst and
the write queue for possible hits, although write queue hits any row hit in write queue then
happen infrequently due to the small queue size. When a 5: ongoing access = oldest row hit write
write queue hit occurs, a read is requesting the data at the 6: else if write queue is not empty and
same location as a preceding write. The data from the latest read queue is empty then
write (if there are multiple) will be forwarded to the read 7: ongoing access = oldest write in write queue
such that the read can complete immediately. Missed reads
8: ongoing access = ﬁrst read in next burst
enter the read queue. If a read is directed to the same row end if
as an existing burst, the read will be appended to that burst. 9: else if ongoing access is a write and
Otherwise, a new burst composed of the read will be created read queue is not empty and
and appended to the read queue. All writes enter the write write queues length < threshold then
queue in order and are completed from the view of the CPU. 10: reset ongoing access
11: ongoing access = ﬁrst read in next burst
subroutine AccessEnterQueue(access) end if
1: if access is a read then
2: if hit in the write queue then
3: forward the latest write data to access Figure 5. Bank arbiter subroutine
4: send response to access
5: else if found an existing burst in read queue then
6: append access to that burst The major functionality of the write queue, besides hid-
else ing the write latency and reducing write trafﬁc , is to
7: create a new burst allow reads to bypass writes. When the write queue reaches
8: append the new burst to read queue its capacity, the main memory can not accept any new ac-
end if cess, causing a possible CPU pipeline stall. Write piggy-
else backing is designed to speed up write process by appending
9: append access to the write queue qualiﬁed writes at the end of bursts. The writes being ap-
10: send response to access pended must be directed to the same row as the burst, so
end if that they will not disturb the continuous row hits created by
the burst. If there are no qualiﬁed writes available, the next
Figure 4. Access enter queue subroutine burst will start. Write piggybacking reduces the probability
of write queue saturation and exploits the locality of row
hits from writes as well.
Read preemption and write piggybacking may conﬂict
3.2. Bank Arbiter Subroutine with each other, i.e. a piggybacked write may be preempted
by a new read. A threshold is introduced to allow the bank
Each bank has one ongoing access, which is the ac- arbiter to switch dynamically between read preemption and
cess for which transactions are currently being scheduled, write piggybacking. When the write queue occupancy is
but have not yet been completed. The bank arbiter selects less than the threshold, read preemption is enabled. Other-
the ongoing access from either the read queue or the write wise, write piggybacking is enabled. Section 5.4 will have
queue, generally prioritizing reads over writes. Writes are a detailed study of the threshold.
selected only when there are no outstanding reads in the
read queue, when the write queue is full or when doing write 3.3. Transaction Scheduler Subroutine
piggybacking. The algorithm is given in Figure 5.
Two options, read preemption and write piggybacking, A transaction is considered as unblocked when all re-
are available to the bank arbiter. Read preemption allows a quired timing constraints are met. At each memory cycle
newly arrived read to interrupt an ongoing write. The read the transaction scheduler selects from all banks one ongo-
becomes the ongoing access and starts immediately, there- ing access for which the next transaction is unblocked and
fore reducing the latency to the read. Read preemption will schedules that transaction.
not affect the correctness of execution; the preempted write A static priority, as shown in Table 2, is used to select the
will restart later. ongoing access containing the next unblocked transaction to
subroutine TransactionScheduler(last bank, last rank)
Table 2. Transactions priority table (1: the 1: if last bank has unblocked col access then
highest, 8: the lowest) 2: schedule the unblocked col access
3: else if any unblocked col access in last rank then
Same Same Other 4: schedule the oldest unblocked col access
bank rank ranks 5: else if any unblocked precharge or row activate then
Bank precharge 5 5 5 6: schedule the oldest precharge or row activate
Read Row activate 5 5 5 7: else if any unblocked col access in other ranks then
Column access 1 2 7 8: schedule the oldest unblocked col access
Bank precharge 6 6 6 end if
Write Row activate 6 6 6 9: if access scheduled then
Column access 3 4 8 10: if scheduled access has completed then
11: send response to that access
be scheduled. Among all unblocked transactions, column 12: last bank = scheduled access’s target bank
accesses within the same rank as the last access scheduled 13: last rank = scheduled access’s target rank
have the highest priorities. Column accesses from differ- else
ent banks but within the same ranks are interleaved, so that 14: last bank = the bank having the oldest access
bursts from different banks are equally served. The high 15: last rank = the rank having the oldest access
data bus utilization is maintained because interleaved ac- end if
cesses are still row hits. Bank precharge and row activate
have the next highest priorities as they do not require data Figure 6. Transaction scheduler subroutine
bus resources, and therefore can be overlapped with column
access transactions. The scheduler has priorities set to ﬁn-
ish all bursts within a rank before switching to another rank 4. Experimental Environment
to avoid rank-to-rank turnaround cycles required by DDR2
device . Column accesses from different ranks thus have A revised M5 simulator  and SPEC CPU2000 bench-
the lowest priority. Within each category, read transactions mark suite  are used in the studies of access reorder-
always have higher priorities than write transactions. An ing mechanisms. The M5 simulator is selected mainly
oldest ﬁrst policy is used to break ties. because it supports a nondeterministic memory access la-
Based on the scheduling priority in Table 2, the subrou- tency. The revisions made to the M5 simulator include a
tine of the transaction scheduler is shown in Figure 6. When detailed DDR2 SDRAM module, a parameterized memory
there are no unblocked transactions from any accesses, the controller as well as the addition of the access reordering
scheduler will switch to the bank which has the oldest ac- mechanisms described in this paper.
cesses and initiates an access from that bank in the next
memory cycle. 4.1. Benchmarks and Baseline Machine
The priority table and transaction scheduler are the core
of burst scheduling, which maintain the structure of bursts Due to the page limitation, results from 16 of 26 SPEC
created by bank arbiters, meanwhile maximizing the data CPU2000 benchmarks are shown in Section 5 using the cri-
bus utilization by aggressively interleaving transactions be- terion that the benchmark selected exhibits more than 2%
tween accesses. performance difference between in order scheduling and
any out of order access reordering mechanisms studied in
3.4. Validation this paper. While excluding non-memory intensive bench-
marks provides a better illustration of impacts contributed
Burst scheduling will not affect program correctness. by access reordering mechanisms, results from the complete
Reads are checked in the write queues for hits before en- suite can be found in . Simulations are run through the
tering the read queues. If a read hits in the write queue, the ﬁrst 2 billion instructions with reference input sets and pre-
latest data will be forwarded from the write to the read, so compiled little-endian Alpha ISA SPEC2000 binaries .
read after write (RAW) hazards are avoided. Within bursts, Table 3 lists the conﬁguration of the baseline machine,
writes are always piggybacked after reads which have previ- which represents a typical desktop workstation in the near
ously checked the write queue for hits, avoiding write after future using a bank in order memory access scheduling
read (WAR) hazards. When performing write piggyback- (BkInOrder). With BkInOrder, accesses within the same
ing, the oldest qualiﬁed write will be selected ﬁrst. Writes banks are scheduled in the same order as they were issued,
are scheduled in program order within the same rows, there- while accesses from different banks are selected in a round
fore write after write (WAW) hazards are also avoided. robin fashion.
transaction priority table given in Table 2, which encom-
Table 3. Baseline machine conﬁguration passes timing constraints and is extensible, burst scheduling
CPU 4GHz, 8-way, 32 LSQ, 196 ROB guarantees that row hits within a burst are scheduled back to
L1 I-cache 128KB, 2-way, 64B cache line back, maximizing the bus utilization. Additionally, differ-
L1 D-cache 128KB, 2-way, 64B cache line ent information is employed to make scheduling decisions
L2 cache 2MB, 16-way, 64B cache line at the memory access level and the transaction level, making
FSB 64bit, 800MHz (DDR) burst scheduling a two-level scheduler.
Main memory 4GB DDR2 PC2-6400 (5-5-5),
Burst scheduling with various optimizations, as dis-
64-bit, burst length 8
cussed in Section 3, is also evaluated. Burst RP allows
Channel/Rank/Bank 2/4/4 (a total of 32 banks)
SDRAM row policy Open Page
reads to preempt writes. Burst WP piggybacks writes at
Address mapping Page Interleaving the end of bursts. Burst TH uses an experimentally selected
Access reordering Bank in order (BkInOrder) static threshold to control read preemption and write pig-
Memory access pool 256 (maximal 64 writes) gybacking. A threshold of 52 obtains the best performance
crossing simulated benchmarks, as will be shown in Sec-
4.2. Simulated Access Reordering Mechanisms
5. Simulation Results and Analysis
Besides BkInOrder scheduling, three existing access re-
ordering mechanisms, RowHit, Intel and Intel RP, are sim-
Increasing row hits and reducing access latency are the
ulated and compared with burst scheduling. Table 4 sum-
major design goals of burst scheduling. Studies of access
marizes all simulated access reordering mechanisms.
latency and SDRAM row hit rate illustrate the impacts of
access reordering mechanisms and inspire further improve-
Table 4. Simulated access reordering mecha- ments to burst scheduling.
BkInOrder In order intra banks, round robin inter banks
5.1. Access Latency
RowHit Row hit ﬁrst intra bank, round robin inter
banks  When a memory read access is issued to the main mem-
Intel Intel’s memory scheduling  ory, all in-ﬂight instructions dependent upon this read re-
Intel RP Intel’s scheduling with read preemption quest are blocked until the requested data is returned. Write
Burst Burst scheduling accesses, however, can complete immediately because no
Burst RP Burst scheduling with read preemption data needs to be returned. Therefore, one of the design
Burst WP Burst scheduling with write piggybacking goals of access reordering mechanisms is to reduce the read
Burst TH Burst scheduling with threshold (52) latency by postponing writes.
Figure 7 shows the averaged read latency and write la-
Proposed by Scott Rixner et al., RowHit scheduling uses tency obtained by all access reordering mechanisms simu-
a uniﬁed access queue for each bank. A row hit ﬁrst pol- lated. All out of order access reordering mechanisms re-
icy selects the oldest access directed to the same row as the duce the read latency by a range of 26% to 47% compared
last access to that bank. Accesses from different banks are to BkInOrder; while all write latencies except for RowHit
performed in a round robin fashion . are increased.
Intel’s patented out of order memory scheduling features RowHit treats reads and writes equally thus it reduces
unique read queues per bank and a single write queue for both read and write latency and achieves the lowest write
all banks. Reads are prioritized over writes to minimize latency among all access reordering mechanisms. Burst RP
read latency. Once an access is started, it will receive the has the lowest read latency because reads are not only pri-
highest priority so that it can ﬁnish as quickly as possible oritized over writes but also allowed to interrupt ongoing
to reduce the degree of reordering . Not proposed in writes. Read preemption helps Intel’s scheduling to reduce
the patent, Intel scheduling with read preemption (Intel RP) read latency as well. Intel and Burst postpone writes thus
allows reads to interrupt ongoing writes in a similar way to they both have long write latencies. Read preemption makes
read preemption as described in Section 3.2. the write latency of Intel and Burst even longer. On the
RowHit and Intel’s scheduling both attempt to prioritize other hand, write latency is greatly reduced by write piggy-
row hits, however, they employ a “best effort” mechanism backing because more row hits from writes are exploited.
in grouping row hit accesses. Without considerations of To better understand the relationship between read and
SDRAM timing constraints, bubble cycles could easily be write latency, the distribution of outstanding memory ac-
introduced, leading to performance degradation. With the cesses for the benchmark swim, which is deﬁned as the
(a) Read Latency (b) Write Latency (a) Row Hit/Conflict/Empty (b) Bus Utilization
100 800 1.0 0.45
700 0.9 0.40
Percentage of Accesses
80 0.8 0.35
Percentage of Time
60 500 0.6
40 300 0.4
20 0.2 Row empty
100 0.1 Row conflict 0.05 Address bus
Row hit Data bus
0 0 0.0 0.00
Figure 7. Access latency in memory cycles Figure 9. Average row hit, row conﬂict and
row empty rate and SDRAM bus utilization
(a) Outstanding Reads (b) Outstanding Writes
0.20 Intel 0.4 Intel
Percentage of Time
5.2. Row Hit Rate and Bus Utilization
0.15 Burst_TH 0.3 Burst_TH
Row hits require fewer transactions and have shorter la-
0.05 0.1 tencies than row conﬂicts, as discussed in Section 2. Access
reordering mechanisms usually select row hits ﬁrst and turn
0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 potential row conﬂicts into row hits. Also more row hits
Number of Accesses Number of Accesses
may become available as new accesses arrive. Figure 9(a)
Figure 8. Distribution of outstanding memory shows row hit, row conﬂict and row empty rate averaged
accesses for benchmark swim crossing all simulated benchmarks.
Out of order access reordering mechanisms increase row
hit rate. Among them, RowHit, Burst WP and Burst TH
have the highest row hit rates. Intel and Burst without write
percentage of time that a given number of accesses are out- piggybacking have lower row hit rates although they are still
standing in the main memory, is shown in Figure 8. better than BkInOrder. The reason is that in contrast to Intel
RowHit slightly increases the number of outstanding and Burst, which only search the read queues for row hits,
accesses compared to BkInOrder to allow row hits to be RowHit, Burst WP and Burst TH seek row hits in both the
served ﬁrst. Intel and Burst have large number of outstand- read queues and the write queues.
ing writes in the write queue due to postponed writes. Burst With static open page policy, most row empties happen
is more aggressive in prioritizing reads over writes than In- after SDRAM auto refreshes as banks are precharged. With
tel. As a result, Intel and Burst cause write queue saturation read preemption, an ongoing write interrupted by a read
24% and 46% of time respectively for the swim benchmark. may have precharged the bank while having not yet initi-
Read preemption reduces the number of outstanding reads ated the row activate. This causes the preempting read to be
but causes the write queue saturating more frequently, i.e. a row empty. Therefore Intel RP, Burst RP and Burst TH
Burst RP causes write queue saturation 70% of time. have increased row empty rates.
Prioritizing reads over writes can improve system per- The SDRAM bus utilization, which is the percentage of
formance. However, write queue saturation may result time that the bus is occupied, is shown in Figure 9(b). There
in CPU pipeline stalls. Write piggybacking is employed is only a 3% difference of address bus utilization among all
to empty writes from the write queue without causing an access reordering mechanisms, while the data bus utiliza-
undo increase in read latency. Burst WP only causes write tion varies in a range from 31% to 42%. This conﬁrms that
queue saturation 2% of time. Burst TH with a threshold of the data bus is more critical than the address bus. Given the
52 achieves a tradeoff between reducing read latency and simulated DDR2 PC2-6400 SDRAM, Burst TH achieves
preventing write queue saturation, resulting in a 9% write the highest data bus utilization of 42%. The effective mem-
queue saturation rate. Burst TH yields the best performance ory bandwidth is increased from 2.0GB/s (BkInOrder) to
as following sections will show. 2.7GB/s (Burst TH), resulting in a 35% improvement.
(a) Outstanding Reads (b) Outstanding Writes
1.0 0.25 0.5
0.9 WP WP
Normalized Execution Time
0.20 TH16 0.4 TH16
Percentage of Time
0.7 TH32 TH32
0.15 TH40 0.3 TH40
0.6 TH48 TH48
0.5 0.10 TH56 0.2 TH56
RowHit RP RP
Intel_RP 0.05 0.1
Burst_WP 0.00 0.0
0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70
0.1 Burst_TH Number of Accesses Number of Accesses
Figure 11. Outstanding accesses for bench-
mark swim under various thresholds
Figure 10. Execution time of access reorder-
ing mechanisms, normalized to BkInOrder mined.
5.4. Burst Threshold
5.3. Execution Time
Read preemption and write piggybacking have been
Previous sections showed access latency, row hit rate as shown to perform well on some benchmarks but not on all
well as but utilization of various access reordering mecha- benchmarks. Which technique has greater impact on per-
nisms. Results in this section examine the execution time formance is largely dependent on the memory access pat-
of each individual benchmark with these access reordering terns of benchmarks. For example, allowing a critical read
mechanisms. Execution times are normalized to BkInOrder having many dependent instructions to preempt an ongoing
and shown in Figure 10. write may improve the performance. However, completing
RowHit achieves an average 17% reduction of execution the ongoing write may prevent CPU pipeline stalls due to a
time compared to BkInOrder. Intel and Burst reduce the ex- saturated write queue.
ecution time by 12% and 14% respectively. Read preemp- When the write queue has low occupancy, read preemp-
tion alone contributes another 3% improvement on top of tion is desired to reduce read latency by allowing reads to
Intel and Burst. Write piggybacking alone also contributes bypass writes. When the write queue approaches capacity,
5% improvement on top of Burst, resulting in a total of 19% write piggybacking can keep the write queue from satura-
reduction of execution time by Burst WP. Burst TH which tion. Read preemption and write piggyback can be switched
combines read preemption and write piggybacking using a dynamically based on the write queue occupancy. When the
static threshold of 52 yields the best performance among all write queue occupancy is less than a certain threshold, read
access reordering mechanisms, achieving a 21% reduction preemption is enabled; otherwise, write piggybacking is en-
in execution time crossing all simulated benchmarks, which abled.
results in a 6% improvement over RowHit, 11% and 7% Using the same example swim benchmark as in Sec-
improvement over Intel and Intel RP respectively. tion 5.1, the distribution of outstanding accesses with vari-
Read preemption and write piggybacking have a var- ous thresholds are shown in Figure 11. Note Burst RP and
ied impact dependent upon benchmark characteristics. For Burst WP are equivalents to Burst TH64 and Burst TH0 re-
mcf, parser, perlbmk and facerec, read preemption spectively given that the write queue size is 64.
contributes much greater performance improvement com- From Figure 11, Burst RP has fewer outstanding reads
pared to write piggybacking. For the remainder of the than other thresholds, however, read latency of Burst RP
benchmarks, write piggybacking generally results in more is slightly higher than others. This is because when there
improvement than read preemption. Especially for gcc and are fewer reads in the read queue, there are less opportuni-
lucas, Burst WP achieves 14% and 18% reduction in ex- ties for row hits to occur. In order for burst scheduling to
ecution time respectively. It is desirable to take advantage create larger bursts and increase row hits, the read queue
of both read preemption and write piggybacking to achieve should contain a number of outstanding reads, and these
a maximal performance improvement. A static threshold is reads should be served at a rate that will not deplete the
employed to control read preemption and write piggyback- read queue too quickly.
ing. The next section will show how this threshold affects As the threshold increases from 0 to 64, the peak value
the performance and how the optimized threshold is deter- of outstanding writes increases as well. The write buffer
(b) Read Latency (c) Write Latency (a) Execution Time
60 800 1.0 benchmarks over in order scheduling. Burst scheduling also
0.9 outperforms the row hit scheduling and Intel’s out of order
Normalized to Burst
scheduling by 6% and 11% respectively.
30 400 As SDRAM devices evolve, timing parameters (tCL -
300 tRCD -tRP ) do not improve as rapidly as bus frequency.
0.6 For instance, DDR PC-2100 devices (133MHz) have typ-
0 0 0.5
ical timings of 2-2-2 (15ns-15ns-15ns), while DDR2 PC2-
6400 devices (400MHz) have typical 5-5-5 (12.5ns-12.5ns-
12.5ns) timing. From DDR PC-2100 to DDR2 PC2-6400,
Figure 12. Access latency and execution time the bus frequency as well as bandwidth improve by 200%
under various thresholds however the timing parameters only reduce by 17%. Ac-
cess latency, in terms of memory cycles, increases, i.e. row
conﬂict latency increases from 6 cycles to 15 cycles. In-
creased main memory access latency leads to more perfor-
saturation rate is below 7% when the threshold is less than mance improvement opportunities for memory optimization
48. The saturation rate increases to 14% at threshold 56 techniques. As the number of cycles for timing parameters
then jumps to 70% at threshold 64 (Burst RP). The earlier increases in the future, the performance improvement pro-
write piggybacking is enabled, the less frequently the write vided by access reordering mechanisms will be even more
queue will be saturated. signiﬁcant than the simulation results presented in this pa-
To determine the threshold that yields the best perfor- per.
mance, simulations with various thresholds are performed Access reordering mechanisms will play a more impor-
and the results are shown in Figure 12. The execution tant role with chip level multiple processors, as the mem-
times are averaged crossing all benchmarks and normal- ory controller will have larger number of outstanding main
ized to Burst. As the threshold increases, read latency ﬁrst memory accesses from which to select. Access reordering
decreases because there are more reads which experience mechanisms may also beneﬁt from integrating the memory
shorter latencies by preempting writes. From threshold 40 controller into the CPU die. Due to a tighter connection be-
read latency starts increasing mainly due to CPU pipeline tween the integrated memory controller and the CPU, more
stalls caused by increased occurrences of write queue sat- instruction level information, such as the number of depen-
uration. Write latency increases as expected as the thresh- dent instructions, is available to the controller. This infor-
old increases. Execution time is determined by both read mation may be utilized to make intelligent scheduling deci-
latency and write latency. According the results, the thresh- sions. An integrated memory controller can also run at the
old 52 yields the lowest execution time based upon the 16 same speed as the CPU, making complex scheduling algo-
benchmarks of SPEC CPU2000 simulated. rithms feasible.
6. Conclusions 7. Future Work
Memory scheduling techniques improve system perfor- Burst scheduling with static threshold works well on av-
mance by changing the sequence of memory accesses to in- erage, however, benchmarks have unique access patterns,
crease row hit rate and reduce memory access latency. In- and therefore require different thresholds. A dynamical
spired by existing academic and industrial access reordering threshold, which is calculated on the ﬂy based on some crit-
mechanisms, the burst scheduling access reordering mech- ical parameters such as read write ratios, will match access
anism is hereby being proposed with the goal of improving patterns of different benchmarks for further performance
performance of existing access reordering mechanisms and improvement.
addressing their shortcomings. Currently reads inside bursts are scheduled in the same
Using the M5 simulator with a detailed SDRAM mod- order as they were issued. Changing the order of accesses
ule, burst scheduling is examined and compared to the row within a burst will not affect the effective memory band-
hit scheduling and Intel’s out of order memory schedul- width and the total time to complete the burst. However,
ing. The performance contributions of read preemption critical accesses (i.e. having many dependent instructions)
and write piggybacking are studied and identiﬁed. The may beneﬁt from being scheduled ﬁrst inside the burst. In-
threshold that yields the best performance is determined by tegrating the SDRAM controller into the CPU die makes
experiments. With selected SPEC CPU2000 benchmarks, more instruction level information, which can be used to
burst scheduling with a threshold of 52 achieves an average perform intra burst scheduling, obtainable to the scheduler.
of 21% reduction in execution time crossing 16 simulated Similarly the sequence of bursts within banks can also be
changed to reduce latency of critical data. Bursts could be  S. Rixner. Memory Controller Optimizations for Web
sorted other than the arrival time of the ﬁrst access of each Servers. In MICRO 37: Proceedings of the 37th annual
burst as being proposed, i.e. sorted by the size of bursts. International Symposium on Microarchitecture, pages 355–
However, considerations are required to prevent starvation 366, Washington, DC, USA, 2004. IEEE Computer Society.
when performing inter burst scheduling.  S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.
Owens. Memory Access Scheduling. In ISCA ’00: Proceed-
Other SDRAM techniques such as address mapping [19,
ings of the 27th Annual International Symposium on Com-
5, 23, 16] change the distribution of memory accesses to puter Architecture, pages 128–138, New York, NY, USA,
increase row hit rate. Access reordering mechanisms will 2000. ACM Press.
beneﬁt from the increased row hit rate. Studies of access re-  H. G. Rotithor, R. B. Osborne, and N. Aboulenein. Method
ordering mechanisms working in conjunction with SDRAM and Apparatus for Out of Order Memory Scheduling. United
address mapping are ongoing . States Patent 7127574, Intel Corporation, October 2006.
 J. Shao. Reducing Main Memory Access Latency through
SDRAM Address Mapping Techniques and Access Reorder-
8. Acknowledgments ing Mechanisms. PhD thesis, Department of Electrical and
Computer Engineering, Michigan Technological University,
The work of Jun Shao and Brian Davis was supported 2006.
in part by NSF CAREER Award CCR 0133777. We would  J. Shao and B. T. Davis. The Bit-reversal SDRAM Address
also like to thank the reviewers for their comments and sug- Mapping. In SCOPES ’05: Proceedings of the 9th Interna-
gestions. tional Workshop on Software and Compilers for Embedded
Systems, pages 62–71, September 2005.
 K. Skadron and D. W. Clark. Design Issues and Tradeoffs
References for Write Buffers. In HPCA ’97: Proceedings of the 3rd
IEEE Symposium on High-Performance Computer Architec-
 N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network- ture, page 144, Washington, DC, USA, 1997. IEEE Com-
Oriented Full-System Simulation using M5. In Proceedings puter Society.
of the Sixth Workshop on Computer Architecture Evaluation  Standard Performance Evaluation Corporation. SPEC
using Commercial Workloads (CAECW), 2003. CPU2000 V1.2, December 2001.
 Chris Weaver. Pre-compiled Little-endian Alpha ISA SPEC  R. Tomas. Indexing Memory Banks to Maximize Page
CPU2000 Binaries. Mode Hit Percentage and Minimize Memory Latency. Tech-
 V. Cuppu, B. Jacob, B. Davis, and T. Mudge. High- nical Report HPL-96-95, Hewlett-Packard Laboratories,
Performance DRAMs in Workstation Environments. IEEE June 1996.
Trans. Comput., 50(11):1133–1153, 2001.  A. Wong. Breaking Through the BIOS Barrier: The Deﬁni-
 B. T. Davis. Modern DRAM Architectures. PhD thesis, De- tive BIOS Optimization Guide for PCs. Prentice Hall, 2004.
partment of Computer Science and Engineering, the Univer-  W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Im-
sity of Michigan, 2001. plications of the Obvious. SIGARCH Comput. Archit. News,
 W. fen Lin. Reducing DRAM Latencies with an Integrated 23(1):20–24, 1995.
Memory Hierarchy Design. In HPCA ’01: Proceedings  Y. Xu. Dynamic SDRAM Controller Policy Predictor. Mas-
of the 7th International Symposium on High-Performance ter’s thesis, Department of Electrical and Computer Engi-
Computer Architecture, page 301, Washington, DC, USA, neering, Michigan Technological University, April 2006.
2001. IEEE Computer Society.  Z. Zhang, Z. Zhu, and X. Zhang. A Permutation-based
 J. Hasan, S. Chandra, and T. N. Vijaykumar. Efﬁcient Page Interleaving Scheme to Reduce Row-buffer Conﬂicts
Use of Memory Bandwidth to Improve Network Processor and Exploit Data Locality. In MICRO 33: Proceedings of
Throughput. In ISCA ’03: Proceedings of the 30th Annual the 33rd annual ACM/IEEE international symposium on Mi-
International Symposium on Computer Architecture, pages croarchitecture, pages 32–41, New York, NY, USA, 2000.
300–313, New York, NY, USA, 2003. ACM Press. ACM Press.
 I. Hur and C. Lin. Adaptive History-Based Memory Sched-  Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority
ulers. In MICRO 37: Proceedings of the 37th annual Inter- Scheduling on Multi-channel Memory Systems. In HPCA
national Symposium on Microarchitecture, pages 343–354, ’02: Proceedings of the 8th International Symposium on
Washington, DC, USA, 2004. IEEE Computer Society. High-Performance Computer Architecture, page 107, Wash-
 J. Janzen. DDR2 Offers New Features and Functionality. ington, DC, USA, 2002. IEEE Computer Society.
DesignLine, 12(2), Micron Technology, Inc., 2003.
 S. A. McKee, W. A. Wulf, J. H. Aylor, M. H. Salinas, R. H.
Klenke, S. I. Hong, and D. A. B. Weikle. Dynamic Access
Ordering for Streamed Computations. IEEE Trans. Comput.,
 Micron Technology, Inc. Micron 512Mb: x4, x8, x16 DDR2
SDRAM Datasheet, 2006.
 NEC. 64M-bit Virtual Channel SDRAM, October 1998.