Design and Evaluation of an Optical CPU-DRAM Interconnect

Document Sample
Design and Evaluation of an Optical CPU-DRAM Interconnect Powered By Docstoc
					                                                                    Amit Hadke
                                                                  December 2009
                                                                Computer Science

  Design and Evaluation of an Optical CPU-DRAM Interconnect


                                    Abstract
   Four decades ago Amdahl proposed a set of rules of thumb for computer ar-
chitects that have withstood the test of time. One such rule of thumb is that a
balanced computing system should be capable of providing one byte of memory
and one byte per second of memory bandwidth for each instruction per second
of computation. Building balanced computing systems in the multicore era with
hundreds of processing cores per die is challenging because of the pin limitations
and poor scalability of bandwidth and memory capacity with off-chip electrical
interconnects between the CPU and memory subsystem.
   We propose using Wavelength Division Multiplexing (WDM)-based optical in-
terconnects between the CPU and the memory subsystem to overcome the prob-
lems of pin limitations and provide both high bandwidth and high memory capacity
simultaneously. We make use of concepts studied widely by long distance optical
networks such as dynamic wavelength allocation and bandwidth management in
our design. The main contributions of this thesis are (a) A prototype design of
an optical interconnect for CPU-DRAM interface without any modifications to
commodity DRAM devices. (b) A frame-based protocol for interfacing CPU and
DRAM with WDM-based optical interconnects. (c) Algorithms for dynamically
allocating optical resources for better utilization and more concurrent operations.
We show that significant improvements in memory bandwidth and memory capac-
ity can be achieved, by exploiting the wavelength domain concurrency offered by
WDM-based interconnects.
Design and Evaluation of an Optical CPU-DRAM
                 Interconnect
                                  by
                             Amit Hadke
 B.E., Computer Science and Engineering (University of Pune, India) 2004


                                  THESIS
   Submitted in partial satisfaction of the requirements for the degree of

                       MASTER OF SCIENCE
                               in
                       COMPUTER SCIENCE

                            in the
                OFFICE OF GRADUATE STUDIES
                            of the
                 UNIVERSITY OF CALIFORNIA
                           DAVIS

                  Approved by the Committee in Charge:


                      DR. MATTHEW K. FARRENS

                     Dr. Matthew K. Farrens
           Committee Chair and Professor of Computer Science


                       DR. VENKATESH AKELLA

                          Dr. Venkatesh Akella
             Professor of Electrical and Computer Engineering


                            DR. S. FELIX WU

                            Dr. S. Felix Wu
                      Professor of Computer Science

                                   2009


                                      i
Design and Evaluation of an Optical CPU-DRAM
                 Interconnect
                                  by
                             Amit Hadke
 B.E., Computer Science and Engineering (University of Pune, India) 2004


                                  THESIS
   Submitted in partial satisfaction of the requirements for the degree of

                       MASTER OF SCIENCE
                               in
                       COMPUTER SCIENCE

                            in the
                OFFICE OF GRADUATE STUDIES
                            of the
                 UNIVERSITY OF CALIFORNIA
                           DAVIS

                  Approved by the Committee in Charge:




                     Dr. Matthew K. Farrens
           Committee Chair and Professor of Computer Science




                          Dr. Venkatesh Akella
             Professor of Electrical and Computer Engineering




                            Dr. S. Felix Wu
                      Professor of Computer Science


                                   2009


                                      i
To my parents: Mr. Ashok Hadke and Mrs. Manik Hadke




                        ii
                                  Abstract

   Four decades ago Amdahl proposed a set of rules of thumb for computer ar-
chitects that have withstood the test of time. One such rule of thumb is that a
balanced computing system should be capable of providing one byte of memory
and one byte per second of memory bandwidth for each instruction per second
of computation. Building balanced computing systems in the multicore era with
hundreds of processing cores per die is challenging because of the pin limitations
and poor scalability of bandwidth and memory capacity with off-chip electrical
interconnects between the CPU and memory subsystem.
   We propose using Wavelength Division Multiplexing (WDM)-based optical in-
terconnects between the CPU and the memory subsystem to overcome the prob-
lems of pin limitations and provide both high bandwidth and high memory capacity
simultaneously. We make use of concepts studied widely by long distance optical
networks such as dynamic wavelength allocation and bandwidth management in
our design. The main contributions of this thesis are (a) A prototype design of
an optical interconnect for CPU-DRAM interface without any modifications to
commodity DRAM devices. (b) A frame-based protocol for interfacing CPU and
DRAM with WDM-based optical interconnects. (c) Algorithms for dynamically
allocating optical resources for better utilization and more concurrent operations.
We show that significant improvements in memory bandwidth and memory capac-
ity can be achieved, by exploiting the wavelength domain concurrency offered by
WDM-based interconnects.




                                        iii
                          Acknowledgments

   I would like to express my gratitude to my advisors, Dr. Matthew Farrens
and Dr. Venkatesh Akella, for guiding this work with utmost interest, care and
patience. I am grateful to them for introducing me to the subject of memory
systems and giving me the freedom to explore my ideas. I also thank them for
teaching excellent courses on Computer Architecture that laid the foundation for
my research work.
   I would like to thank Dr. S. Felix Wu for finding the time to serve on my thesis
committee and providing his valuable feedback.
   I am thankful to my colleague, Tony Benavides, for the assistance and support
provided by him. It was a very nice experience working with him on our research
papers. I’ll also like to thank Dr. Rajeevan Amritharajah for his time to time
advice and contribution to our work in memory systems. I extend my thanks Dr.
Ben Yoo and his students for explaining latest developments in Silicon Photonics
and sharing information with us. A special thanks to Christopher Nitta for sharing
the data from his research work and helping us to understand memory system
behavior. I would like to thank Prof. Bruce Jacob of University of Maryland
and his team for their extraordinary textbook on Memory Systems and DRAM
simulator (DRAMSim), which helped this project immensely.
   I owe a special gratitude to my parents Mr. Ashok Hadke and Mrs. Manik
Hadke, for supporting and motivating me during tough academic times. I would
like to thank my brother, Tejas Hadke, for being supportive during my two years
of graduate studies. Finally, I would like to express my gratitude to my friend
Ankush Garg who helped me in formatting this thesis.



                                        iv
Contents

  Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    iii
  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       iv
  Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     v
  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix

1 Introduction                                                                        1
  1.1   High Performance Memory Systems . . . . . . . . . . . . . . . . . .            2
  1.2   Silicon Photonics . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5
  1.3   Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . .        8
  1.4   Organization    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    9

2 FBDIMM - A brief overview                                                           10
  2.1   FBDIMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . .         10
  2.2   FBDIMM Working . . . . . . . . . . . . . . . . . . . . . . . . . . .          11
  2.3   FBDIMM Performance Key points . . . . . . . . . . . . . . . . . .             13

3 OCDIMM Architecture and Protocols                                                   14
  3.1   OCDIMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . .         14
  3.2   OCDIMM Designs and Protocol . . . . . . . . . . . . . . . . . . . .           17

                                          v
        3.2.1   OCDIMM-BASE . . . . . . . . . . . . . . . . . . . . . . . .          17
        3.2.2   OCDIMM Static Wavelength Assignment(SWA) . . . . . . .               17
        3.2.3   OCDIMM Dynamic Wavelength Assignment (DWA) . . . .                   18

4 Experimental Setup                                                                 28
  4.1   Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . .       28
        4.1.1   DRAM system simulator . . . . . . . . . . . . . . . . . . . .        28
        4.1.2   DRAM memories . . . . . . . . . . . . . . . . . . . . . . . .        29
        4.1.3   Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . .    29

5 Results and Discussions                                                            32
  5.1   Theoretical Limitation on Maximum Memory Bandwidth . . . . . .               32
  5.2   Sustained Bandwidth Study . . . . . . . . . . . . . . . . . . . . . .        34
  5.3   Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .   37
        5.3.1   Capacity Sensitivity Analysis . . . . . . . . . . . . . . . . .      37
        5.3.2   Latency Sensitivity Analysis . . . . . . . . . . . . . . . . . .     47
  5.4   Latency Impact with DWA . . . . . . . . . . . . . . . . . . . . . . .        56
        5.4.1   Why DWA improves latency? . . . . . . . . . . . . . . . . .          62
  5.5   C-DWA and RIFF . . . . . . . . . . . . . . . . . . . . . . . . . . .         63
  5.6   Power Analysis of OCDIMM . . . . . . . . . . . . . . . . . . . . . .         66

6 Related Work                                                                       70

7 Conclusion                                                                         72

References                                                                           80




                                         vi
List of Figures

 1.1   Optical ring resonator . . . . . . . . . . . . . . . . . . . . . . . . .    6

 2.1   FBDIMM Internals. . . . . . . . . . . . . . . . . . . . . . . . . . . .    10

 3.1   OCDIMM Architecture . . . . . . . . . . . . . . . . . . . . . . . . .      14
 3.2   Transaction Queue Insight . . . . . . . . . . . . . . . . . . . . . . .    19
 3.3   Optical bus protocol design and variations. . . . . . . . . . . . . . .    22

 5.1   Sustained Bandwidth Comparison . . . . . . . . . . . . . . . . . . .       35
 5.2   Capacity Sensitivity Analysis : 16 wavelengths . . . . . . . . . . . .     39
 5.3   Capacity Sensitivity Analysis : 32 wavelengths . . . . . . . . . . . .     42
 5.4   Capacity Sensitivity Analysis : 64 wavelengths . . . . . . . . . . . .     43
 5.5   Capacity Sensitivity Analysis : 96 wavelengths . . . . . . . . . . . .     44
 5.6   Capacity Sensitivity Analysis : 128 wavelengths . . . . . . . . . . .      45
 5.7   Latency Sensitivity Analysis : 16 wavelengths . . . . . . . . . . . .      48
 5.8   Latency Sensitivity Analysis : 32 wavelengths . . . . . . . . . . . .      52
 5.9   Latency Sensitivity Analysis : 64 wavelengths . . . . . . . . . . . .      53
 5.10 Latency Sensitivity Analysis : 96 wavelengths . . . . . . . . . . . .       54
 5.11 Latency Sensitivity Analysis : 128 wavelengths . . . . . . . . . . . .      55
 5.12 Latency impact on SPEC’06 benchmarks and OpenOffice session . .               58


                                       vii
5.13 Latency impact on PARSEC benchmarks . . . . . . . . . . . . . . .            59
5.14 Latency impact on SPLASH-2 benchmarks . . . . . . . . . . . . . .            60
5.15 RIFF and C-DWA . . . . . . . . . . . . . . . . . . . . . . . . . . .         65
5.16 Comparison of OCDIMM(OCD) and FBDIMM(FBD) Power Con-
     sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   69




                                      viii
List of Tables

 4.1   Bochs Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    29
 4.2   Simics Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
 4.3   Benchmark Summary . . . . . . . . . . . . . . . . . . . . . . . . . .        31

 5.1   Summary of Capacity Sensitivity Analysis . . . . . . . . . . . . . .         49
 5.2   OCDIMM Power Model . . . . . . . . . . . . . . . . . . . . . . . .           68




                                        ix
                                                                                1




Chapter 1

Introduction

In the last five years high performance computing is approaching towards many-
core Teraflop scale computing era. According to the corollary of Moore’s law,
number of cores are expected to double every 18 months [1]. Today industry is
at 64 cores (Tilera Tile64 [2]) while academic research on multicore systems is
focusing on 256 and more cores. These numbers are expected to reach 256 for
industry and 1024 for research by 2011. Even normal day to day computing a.k.a
desktop computing is built with a dual/quad core CPU (Central Processing Unit)
with 4-8 Gigabytes of DRAM (Dynamic Random Access Memory) memory, while
a typical medium sized file server is built with a quad core CPU and 8-16 Giga-
bytes of DRAM memory. As the gap between Desktops and Servers is closing,
trends in software designs are changing. Even desktop applications are specially
designed for multicore systems with large memory. Emerging applications a.k.a
Recognition, Mining, Synthesis (RMS)[3] such as virtual travel, financial analytic,
3D video, video mining have data sizes in Gigabytes. In cloud computing data sizes
grow to Terabytes. As we add more computing power, applications scale them-
1.1 High Performance Memory Systems                                               2


selves to achieve peak performance. In future, we will see an increased demand on
memory bandwidth as well as capacity. Some applications require either or both
of bandwidth and capacity. For example, gaming and entertainment industry re-
quires fast memory i.e. more bandwidth, storage industry needs large capacity but
less power consumption while supercomputing industry such as financial analysis
and modeling requires both. In this work we focus on using optical interconnect
with off-the-shelf DRAM devices to scale bandwidth and capacity simultaneously,
without compromising on access time and power consumption.



1.1     High Performance Memory Systems

Future high performance computing will be a Tera-flops scale computing, undoubt-
edly built with a manycore chip. In order to design balanced computing systems
[4, 5] for flat programming model, IO system must provide bandwidth of Ter-
abytes/second (one byte per second per flop) and Tera bytes of storage capacity (a
byte storage per flop). These Tera-flops computing platforms, package many sim-
ple in-order cores on a die and are connected to a high performance interconnect for
communication amongst themselves and with the IO system. As number of cores
go beyond 256, performance of on/off chip communication becomes a bottleneck.
In this work we focus on off-chip communication with DRAM system. Next we
discuss current high performance memory systems.
   Traditionally higher bandwidth requirements for DRAM system have been met
by adding more electrical wires (wider buses) and increasing clock frequencies.
Issues such as cross-talk, signal termination and power consumption limits perfor-
mance of high speed buses. This approach does not scale as there is a limit on
1.1 High Performance Memory Systems                                             3


the number of pins on a package and hence the pin bandwidth. Recent innovation
such as 3D stacking [6] does try to alleviate some of these problems by reducing
block-to-block distances and power consumption. But it will not scale with the
number of cores and amount of memory due to limits of physical material, heat
dissipation and inherent complexity in design and manufacturing process. It does
provide an immediate relief from hitting the memory wall, but sooner or later we
will hit that wall.
   Larger capacities are met by using denser DRAM chips, mounting more such
chips on a DIMM (Dual Inline Memory Module) or adding more DIMMs per
channel. However due to electrical signaling constraints, maximum number of
DIMMs per channel is decreasing in successive generations of DRAM. SDRAM
channels supported up to 8 DIMMs, while DDR3 now supports maximum 3 DIMMs
per channel.
   The state-of-the-art technology today to achieve high bandwidth and reason-
ably high capacity is the FBDIMM (Fully Buffered Dual Inline Memory Module)
architecture [7]. It uses a narrow high speed point-to-point interface between the
memory controller and memory modules (DIMMs) based on a split-bus mecha-
nism. FBDIMM allows to connect maximum 8 DIMMs per channel, typically
giving 16-32GB of total capacity per channel. Even though FBDIMM promises a
significant improvement over DDR2/DDR3 based interfaces, it is not scalable be-
yond 4/5 DIMMs per channel. Because of the limitation of the store-and-forward
protocol employed within the FBDIMM, as we add more DIMMs on a channel the
latency increases significantly. FBDIMMs are used in Niagara II, Intel Xeon server
processors and high-end 8 core server (Mac Pro) by Apple. Initially FBDIMM
was welcomed in industry but heat dissipation in ’Advanced Memory Buffer’ or
1.1 High Performance Memory Systems                                               4


AMB (described in 2.1) and additional latencies in store and forward network lim-
ited FBDIMM’s advancement in future memory implementations especially with
DDR3.
   RDIMM (Registered DIMM) allows servers to utilize higher-density memory
modules. These modules often have 18 or 36 DRAM chips on them (compared
to 8 or 16 on a typical configuration of desktop systems). All DIMMs are con-
nected to the memory controller through a register which can be seen as a stripped
down version of FBDIMM’s AMB. Register provides a temporary storage between
DRAM devices and the memory controller. Memory controller drives registers on
each DIMM which in turn drives DRAM chips on the DIMM, significantly reducing
an electrical load on the bus. Register affects the access latency, as one more clock
cycle is needed to read the data from it, which limits performance of read opera-
tions. RDIMMs connect to the memory controller using ’stub bus’ connection - a
single electrical connection with branches that connect to each device. They can
not connect to the memory controller in parallel because of the limit on pin count
and wires. Stub bus limits number of DIMMs per channel, as after a certain num-
ber of devices signal quality deteriorates causing errors/corruptions. As DRAM
clock/data rates increase, the number of devices allowed on each memory channel
decreases due to impedance discontinuities associated with the stub bus topology.
   Most of the RDIMM configurations use only few DIMMs per channel but add
more density per DIMM. This is a big commercial trade off between RDIMM and
FBDIMM. FBDIMM provides more bandwidth and capacity with DRAM chips
which are allowed to be slower and less denser. In order to match FDBIMM’s per-
formance, RDIMM uses denser chips or add more ranks per DIMM. 4-5 DIMMs
can be mounted on each channel to keep the latency within acceptable range.
1.2 Silicon Photonics                                                             5


RDIMM uses more channels to get same capacity and bandwidth. FBDIMM is
deployed with DDR2 memories while its been almost discontinued by Intel and
AMD for DDR3 memories. Industry is shifting towards RDIMM for DDR3 mem-
ories which offer very high density and higher clock speed (1.3Ghz and more).
RDIMMs are supported by almost all server vendors who support DDR2/DDR3
memories. RDIMMs are used by Intel Nehalem platform supporting 192 Gigabytes
of total DRAM capacity.



1.2     Silicon Photonics

Optical communication links are known to offer significantly higher bandwidth over
longer distances, and during the past few decades they have made rapid inroads
into applications such as local area networks, storage networks and rack-to-rack
and board-to-board communication links. Optical signals are free from electrical
problems such as crosstalk, impedance matching and inductance effect. However,
cost, optical losses, and incompatibility with nanoscale CMOS has kept them away
from inter-chip and intra-chip applications. Fortunately, recent developments in
CMOS-compatible nanoscale silicon photonic integrated circuits [8, 9] make opti-
cal interconnects a viable alternative in future high performance computing sys-
tems. With Wavelength Division Multiplexing (WDM) multiple wavelengths can
be transmitted by using one single optical waveguide without interfering with each
other. WDM is a major boost to optical bandwidth and allows network to be
flexible in terms of bandwidth management.
   The idea behind using silicon photonics is to stack optical components on the
top of a generic die (as shown in 3.1(a)). These days most of the memory controllers
1.2 Silicon Photonics                                                                   6




      (a) Microscope image of a microresonator-based ring.     (b) Ring resonator details.

                          Figure 1.1: Optical ring resonator

are integrated on the chip itself. Adding analog circuitry on top of a die, enables the
use of optics in on-chip interconnects eliminating electrical wires. Soref [10, 11, 12]
and others have demonstrated that the free-carrier plasma dispersion effect can
change the refractive index of silicon. This effect can be exploited in resonators to
build compact electro-optical modulators and demodulators using ring structures
[13, 14, 9], with the wavelengths of light generated externally and fed into the
system. A schematic of a ring resonator is shown in figure 1.1(b). The modulator
is designed as a PIN diode, with the waveguide core acting as the undoped intrinsic
region of the diode charged under a high-injection regime to realize the free-carrier
plasma effect.
   Figure 1.1(a) shows a scanning electron microscope image of a microresonator-
based ring fabricated at UC Davis. The rings are just few microns in diameter and
have extremely low power consumption. However, a key challenge with ring based
resonators is their sensitivity to temperature. However, researchers at UC Davis
[15], among others, have recently developed techniques that overcome this prob-
lem. An external mode locked laser (described in [16, 15]) can be used to generate
1.2 Silicon Photonics                                                           7


a comb of phase-coherent equally-spaced wavelengths. Multiple bits can be trans-
mitted simultaneously without interfering with each other by modulating them on
different wavelengths, providing a wider data bus using a single waveguide/fiber.
At the receiving end, the rings can be tuned to the appropriate wavelengths to
demultiplex the data.
   Researchers at MIT [17] have demonstrated the feasibility of monolithic silicon
photonics technology suitable for integration with standard bulk CMOS processes,
which reduces costs and improves opto-electrical coupling. Researchers at HP [18]
demonstrated an optical multidrop bus with external laser guided through a hollow
metal waveguide. Bus supports 8 modules at 10 Gb/s per channel. Corona [19] is a
’from the scratch’ 3D many-core architecture that uses WDM based nanophotonic
communication for both inter-core communication and off-stack communication
to memory or I/O devices. It is a Tera scale computing platform providing 10
Tera-flops peak floating-point performance and 10 TB/s memory bandwidth.
   Majority of research groups focus on the use of optics for on-chip communica-
tions and extend the concept for off-chip communication without describing how
communication should look like? This is the first study describing how to use op-
tical bandwidth specifically for CPU-DRAM communication. In this work, we will
use the knowledge of DRAM device’s working and memory controller’s schedul-
ing to gain maximum throughput, low latency and provide scalability of memory
bandwidth and capacity against optical bandwidth.
1.3 Research Contribution                                                     8


1.3     Research Contribution

In this thesis, we propose a design of an optical interconnect for CPU-DRAM
interface. We improve our basic design to make use of large bandwidth offered
by Wavelength Division Multiplexing (WDM). This work provides three major
contributions for future memory systems:
   a) A prototype design and a working protocol for using optics in off-chip mem-
ory communication. Design proposes minimum changes to hardware and firmware.
Design reuses most of the components from FBDIMM such as frame based pro-
tocol, command scheduling, AMB design logic and interface. It also strictly uses
off-the-shelf DRAM devices which make it commercially viable and easy to adapt.
   b) Scaling bandwidth and capacity at the same time with acceptable latency
and power consumption. New design will be scalable as more resources are added.
As we add more resources, system will be able to provide more bandwidth and
allow more capacity.
   c) Better utilization by dynamically allocating optical resources to different
memory modules, hence achieving maximum performance and lower latencies for
DRAM devices.
   We also briefly analyze and compare power consumption in new design with
traditional FBDIMM memories. A part of this work is published in 16th IEEE
Symposium on High Performance Interconnects (HOTI) at Palo Alto, CA in Au-
gust, 2008 [20] and 26th IEEE International Conference on Computer Design at
Lake Tahoe, CA in October, 2008. [21].
1.4 Organization                                                                9


1.4      Organization

This thesis is organized as follows.
   Chapter 2, briefly gives an overview of FBDIMM architecture and protocol.
   Chapter 3 explains a proposed architecture of OCDIMM (Optically Connected
Dual Inline Module) with three incremental designs - a) Base: Equivalent to FB-
DIMM design. b) Static: Wavelengths are allocated to a group of memory modules
to allow parallelism. c) Dynamic: Wavelengths are allocated dynamically for better
utilization and bandwidth shaping. We also discuss design of a new frame based
protocol for dynamic wavelength allocation and changes to memory controller.
   In chapter 4 we explain experimental methodology, simulation setup and dif-
ferent benchmarks selected for analyzing memory system performance.
   Chapter 5 presents results obtained from simulation and compares different
designs with different configurations. We do four different kinds of analysis: i)
Sustained bandwidth study, ii) Sensitivity study, iii) Latency analysis, and iv)
Power consumption comparison.
   Chapter 6 revisits ongoing research on memory systems and optical intercon-
nects. We conclude the thesis in chapter 7 with a summary of results and future
improvements.
                                                                                10




Chapter 2

FBDIMM - A brief overview


2.1     FBDIMM Architecture

Next we present a very short overview of the FBDIMM architecture. The reader
is referred to the excellent textbook [22] and tutorial paper [23] for more details
on the FBDIMM architecture and its performance evaluation.




        (a) FBDIMM topology.                       (b) AMB basic blocks

                         Figure 2.1: FBDIMM Internals.
2.2 FBDIMM Working                                                                 11


   Figure 2.1(a) shows FBDIMM architecture. Each DIMM is connected through
a point to point pair of differential links rather than traditional multidrop bus
in DDR2/DDR3 memories. Northbound (NB) link is used for reading data from
DIMMs and Southbound (SB) link is used for sending DRAM commands and data
to DIMMs. Each DIMM has a chip called ’Advanced Memory Buffer’(AMB). Fig-
ure 2.1(b) shows internal block diagram of AMB. Passthrough logic on southbound
bus relays a frame to the next DIMM and also forwards it to the De-serializer
and Decoder. DDRx interface takes decoded commands or data and sends across
DRAM system. It synchronizes all communication with DRAM devices according
to DRAM clock domain. Passthrough and merging logic on northbound bus is
a parallel-to-serial converter. AMB DDRx interface has internal buffers to hold
data to and from DRAM devices. Each DIMM is also connected through a System
Management Bus (SMB) used for writing into AMB’s configuration registers.



2.2      FBDIMM Working

FBDIMM replaces multidrop shared bus with a multihop store and forward in-
terconnect network. It runs multihop links at a higher frequency than DRAM
frequency (generally 6 times faster than DRAM speed). Memory controller and
AMB use same reference clock to drive these high speed buses and control slower
DRAM devices. Thus the system runs in mesochronous manner where clock do-
mains are synchronized but not the phases. FBDIMM deploys asymmetric read
and write bandwidth. Southbound bus uses 10 bits while northbound bus uses 14
bits for data transfers. Since there are two separate buses in two different directions
deadlocks are avoided. FBDIMM uses a frame based protocol for communicating
2.2 FBDIMM Working                                                             12


with AMBs. Each frame is equal to a DRAM cycle but since a frame is transfered
on a differential bus 6 times faster than DRAM speed, it can carry more data
in that time period. Frame can carry DRAM commands, or just data in south-
bound direction while it can carry only data in northbound direction. Memory
controller packs commands or data with a DIMM ID and relays through south-
bound channel. AMB reads a frame and relays it to the next DIMM and decodes
it. If command/data is sent for DRAM devices in that DIMM, then AMB controls
DRAM devices on behalf of the controller. Similarly when AMB receives data from
the DRAM device, it encapsulates data in one or more frames and relays it back to
controller through northbound channel. Frame and Command scheduling is con-
trolled by memory controller as AMB has very little idea about DRAM device’s
state and timings. AMB responds to commands in a predictable and predefined
cycles. It is a job of memory controller to send commands in such a way that there
are no conflicts on northbound channel when receiving data. Southbound channel
is free of conflicts since data is sent only by the memory controller. Northbound
channel is where AMBs try to send data back. FBDIMM memory controller solves
this problem by adding a fixed delay (latency of last DIMM) before reading data
from AMB. This allows time for previous transactions to complete before starting
the next one. This is called Fixed Latency mode. Another mode of operation is
Variable Latency mode where more than one AMBs can send data but they do
not share the same link. For example, DIMM-3 is sending 2 frames and DIMM-8
wants to send 2 frames. DIMM-8 is allowed to send it since by the time its data
frame reaches DIMM-3, DIMM-3 would have finished sending its frame and all
links from DIMM-3 to the memory controller will be free. Thus variable latency
mode utilizes northbound channel to the maximum. Fixed-latency mode is simple
2.3 FBDIMM Performance Key points                                               13


to implement but does not scale well as more DIMMs are added in the channel.
Variable latency is complex to implement as controller has to maintain the state of
each link on northbound channel. Typically, FBDIMM channels are not mounted
with more than 4 DIMMs and hence the fixed latency mode is preferred.



2.3     FBDIMM Performance Key points

FBDIMM provides reasonably high capacity and high bandwidth. With 2GB mem-
ory on each DIMM, 16 GB of capacity can be achieved on a single channel. It also
boosts peak throughput by 1.5 times DRAM device bandwidth. Memory con-
troller can write to a DIMM while reading from another DIMM. Since southbound
channel frames carry half of the data as northbound channel frames, southbound
bandwidth is half of northbound bandwidth. FBDIMM reduces pin count by 1/3rd.
DDR2 DIMMs pin count is ˜240 while FBDIMM pin count is ˜70 per channel.
                                                                                                                                                           14




Chapter 3

OCDIMM Architecture and
Protocols


3.1                                 OCDIMM Architecture


                                                                              Northbound
                                                                              Waveguide/
                                                                Detectors     Fiber
   Off-chip Laser Source




                                            Modulators
                           Optical Fiber                                                   OMC       OMC      OMC
                                                                              Southbound
                                                                              Waveguide/
                                           Integrated memory Controller
                                                                                                              DIMM n-1




                                                                              Fiber
                                                                                                     DIMM 1
                                                                                            DIMM 0




                                                   Shared L2 cache
                                TSV
                                (thru      Core          Core   Core   Core
                                silicon    +L1$          +L1$   +L1$   +L1$
                                 vias)
                                                     Heat Sink



                                            (a) System Architecture.                                                     (b) Optical Memory Controller (OMC)

                                                                       Figure 3.1: OCDIMM Architecture


          OCDIMM is derived from FBDIMM architecture. Figure 3.1(a) shows the top-
level physical architecture of a computing system that uses OCDIMM. We assume
the CPU is a 3D stacked die with the electronics on one layer and optical transre-
3.1 OCDIMM Architecture                                                          15


ceivers (modulators and demodulators) in a different layer with an off-chip laser
source powering the optical interconnect. The northbound and southbound buses
in the FBDIMM architecture are replaced by optical fibers. Furthermore, we as-
sume that each fiber can transport multiple wavelengths up to a maximum of 64.
However, as we will show, it is not necessary to have 64 wavelengths to reap the
benefits of an optical interconnect. Even with much fewer wavelengths, significant
benefits can be realized, as described in the results section We replace the AMB
on each DIMM with an Optical Memory Controller (OMC) (details shown in fig-
ure 3.1(b)), which is responsible for the communication to the DRAM from the bus.
As in the FBDIMM protocol, the commands and write data will be injected via
the southbound fiber to the OMC and the read data is received on the northbound
fiber. We assume a dedicated wavelength for clock distribution. The clock will
be extracted from its transportation wavelength and the data will be de-serialized
and put out as commands and data to the DRAM device. Microresonator based
modulators convert the electrical signals of the on-chip integrated memory con-
troller to the optical domain for transport over the optical fibers. The modulators
and demodulators are quite compact (order of tens of microns) and capable of
data rates of 10Gbps or higher. The optical signals are demodulated at the OMC
using microresonators and processed by the traditional electrical signal chain in
a FBDIMM AMB as shown in figure 3.1(b). So, basically we have a single-hop
(broadcast) bus instead of a multi-hop store and forward network. The read and
write data is organized as a packetized frame relay protocol which is realized using
multiple transmissions on the optical bus. The number of transmissions can be
reduced if there are more wavelengths available. Writing data to the DRAM is
simple. Depending on the DRAM mode, a RAS (Row Access Strobe) command,
3.1 OCDIMM Architecture                                                        16


followed by some number of CAS (Column Access Strobe) commands are sent
to the target DIMM, interspersed with the data on the write channel. Once the
commands and data are converted to the electrical domain via the E/O/E inter-
face in the target domain, the operation is exactly the same as with a standard
DDR2/DDR3 device and FBDIMM. No modification is necessary to the existing
DRAM devices. The only additional hardware is at the DIMM level, in the form
of the modulators and demodulators to convert the data from electrical to optical
domain and vice versa as shown in figure 3.1(b). Reading data from a given DIMM
is more complicated because the read subchannel is shared. Once a DIMM is ready
to send data, it has to acquire the read-subchannel because another DIMM could
be using it to send data back to the memory controller. In general, this requires
an arbiter, but to keep the design simple, initially, like FBDIMM, we assume that
the memory controller statically schedules the read transactions such that there
are no bus conflicts.
   If extra wavelengths are supported, OCDIMM can do certain optimizations in
order to keep small number of DIMMs in the active state. OCDIMM uses these
extra wavelengths as a chip select signal to activate a DIMM before sending out
the command/data frame on southbound bus. Only destination AMBs will read
the data on southbound bus. Additional cost in preselecting is negligible in terms
of the delay but as the number of DIMMs on a channel increases, demand on such
additional wavelengths increases.
3.2 OCDIMM Designs and Protocol                                                17


3.2     OCDIMM Designs and Protocol

3.2.1    OCDIMM-BASE

OCDIMM-BASE design is exactly similar to FBDIMM apart from the fact that
we can have 32 wavelengths going in southbound direction and 32 for northbound
and connect more than 8 DIMMs on the channel.
   Since optical speed is 10Ghz we simply have more data bits available per frame.
OCDIMM BASE can be seen as FBDIMM running at 10Ghz with maximum 32
bit wide bus. Th issue with this design is that FBDIMM access protocol is not
able to take advantage of increased bandwidth. The peak bandwidth of an optical
link is quite large. For example, consider a WDM channel with 64 wavelengths
and an optical data rate of 10 Gbps. The peak bandwidth is 10x64 = 640 gigabits
per second, or 80 GB/s per channel. However, given that we would like to use con-
ventional DRAM-based DIMMs for cost reasons, we face an interesting challenge.
It is well-known that DRAM latency improvements are very modest, around 7%
per year [24]. Given that the DRAM read/write latency is the sum of the time to
access a DRAM module and the time to ship the data across the optical link, the
DRAM access becomes the critical bottleneck. In other words, though the optical
interconnect is capable of delivering 80 GB/s on a single physical channel, it may
be idle most of the time. To utilize the given optical bandwidth we must either
change access protocol or change the topology.


3.2.2    OCDIMM Static Wavelength Assignment(SWA)

In this design we change the topology of the OCDIMM. Typically server systems
use multichannel FBDIMM configuration to control latency. Given the optical
3.2 OCDIMM Designs and Protocol                                                  18


bandwidth and more wavelengths, we can create similar configuration by creating
virtual optical channels. Memory controller partitions the available wavelengths
into an arbitrary number of subchannels, For example, the 64 wavelengths can
be partitioned into 8 subchannels of 8 wavelengths each, or 4 subchannels each
with 16 wavelengths, or 32 subchannels of 2 wavelengths each, etc. Furthermore,
each subchannel can have more than one DIMM, since optical links do not have
the same signal degradation issues as their electrical counterparts. Each group of
DIMMs share their wavelengths without interfering with other groups. We can
have asymmetric configurations depending upon the capacity of DIMM, speed of
DIMM or its distance from memory controller. With less bus conflicts we expect
to achieve the same degree of parallelism as in electrical multichannel system but
without increasing the pin footprint.
   Important point to notice is that we haven’t changed OCDIMM BASE protocol
which is same as FBDIMM. We simply created multiple channels with in waveguide
and handed those channels to the memory controller. The design is simple and
doesn’t need any significant changes to firmware.


3.2.3    OCDIMM Dynamic Wavelength Assignment (DWA)

After partitioning bandwidth statically, the natural thought is to do it dynamically
at run time to improve optical bandwidth utilization. To do so we have to change
access protocol so as to keep track of the owner of each wavelength, the duration
which wavelength is free or used and state of each command/transaction. Given
such a protocol, we can shape traffic. For example giving more bandwidth to read
transactions or giving priority to the traffic from a mission-critical CPU or appli-
cation. To be able to design such a protocol, first we start by proposing different
3.2 OCDIMM Designs and Protocol                                                 19




                      Figure 3.2: Transaction Queue Insight

algorithms for wavelength assignment and identify design/protocol changes with
respect to each such algorithm.
   Figure 3.2 shows how a transaction queue looks like inside the memory con-
troller. Memory controller orders transactions according to transaction ordering
policies such as Read or Instruction Fetch First(RIFF), Open Bank First(OBF),
Bank Round Robin(BRR) or First come First Service(FCFS). Each transaction is
broken into sequence of DRAM commands. For example a read transaction opens
the row of a bank (RAS) then issues column access (CAS). In case of DDR3 mem-
ories data is sent back as soon as CAS is finished. In FBDIMM protocol a special
command DRIVE is used to initiate data transfer from AMB to the controller on
northbound channel. Entire data(a cacheline typically 64 bytes) is sent in multiple
contiguous frames. OCDIMM with large number of wavelengths can fit all of the
data in one frame, controller might issue just one DRIVE command. Similarly for
write transactions RAS and CAS-WRITE are used to access row/column. FB-
DIMM uses DATA command for sending data bytes to AMB. Controller will send
data using multiple such commands even before sending CAS-WRITE. AMB/OMC
will buffer these commands until DRAM device is ready to write it. Thus each
3.2 OCDIMM Designs and Protocol                                                  20


memory transaction is broken into micro-ops or sub-transactions. We will in-
vestigate two different basic schemes for dynamic wavelength assignment, called
Transaction based assignment T-DWA and Command based assignment C-DWA.
   Algorithm T-DWA allocates wavelengths equally to the mutually independent
transactions requesting the bus. It can be perceived as a simple fair queuing
scheme where the goal is to maximize throughput (transactions/cycle). In case
a transaction needs less wavelengths than allocated, T-DWA tries to redistribute
surplus wavelengths amongst other transactions. By doing so it tries to minimize
internal fragmentation and maximizes the optical bandwidth utilization.
   Algorithm C-DWA is a fine grained algorithm working on DRAM commands
with priorities. DRAM commands follow a certain order. For example, a row ac-
cess is always applied before column access. This order can be used as a priority to
decide which command should be sent first. C-DWA tries to schedule the most crit-
ical section of the transaction first, which increases throughput of DRAM system.
For example, consider a case where two non-conflicting transactions request an ac-
cess to the bus and one of them wants to send a column access (CAS) while second
wants to send a row access (RAS). C-DWA will schedule the latter one. In all out
discussion a memory request latency is dominated primarily by DRAM latency
and secondarily by queuing delays. To reduce queuing delays we must make sure
that whenever possible DRAM system is not kept idle. Refresh commands get the
highest priority then RAS, CAS, PRECHARGE, DRIVE, DATA. We intentionally
give priority to DRIVE commands to bias resources towards read transaction. As
we will see in table 4.3 most of the real world applications have more reads than
writes. Another important reason why issuing DRIVE before DATA is that queu-
ing delays and power consumption. OMC already has received data from DRAM
3.2 OCDIMM Designs and Protocol                                                21


device in case of DRIVE and hence is holding data in its buffers. In case of DATA
commands OMC might not have yet received RAS or CAS command i.e DRAM
device may not be yet ready to accept data. Thus by allowing DRIVEs to go in
front of DATA we try to reduce waiting time and load on OMC buffers. In C-DWA
memory controller works in three stages. In the first stage, it finds out how many
wavelengths are available. It then finds out how many wavelengths are required
for REFRESH commands, RAS commands, CAS commands and likewise. In this
process it makes sure than no transaction or command inside transaction conflict
with each other. For example if two transactions wants to access same DIMM
first one in the queue according to its position in queue will be chosen. Similarly
for a transaction no two conflicting commands will be sent together. In second
stage, controller assigns wavelengths to each category which in turn get assigned
to commands chosen in first stage. In the final stage commands are packed into a
frame and frame is sent on the southbound channel.
   An important difference between C-DWA and T-DWA is that the original trans-
action order set by transaction ordering policy is preserved in C-DWA. In C-DWA,
when there are more than one commands of the same type or class, priority is
given to the command whose transaction is ahead in queue order implied by trans-
action scheduling policy. This makes C-DWA co-exist with current transaction
scheduling schemes. T-DWA tries to preserve the order while redistributing sur-
plus wavelengths using First-Fit method, giving as many wavelengths as required
to the transactions in the front.
3.2 OCDIMM Designs and Protocol                                      22




                               (a) Frame Design




                                (b) Single fiber




                                (c) Dual fiber

           Figure 3.3: Optical bus protocol design and variations.
3.2 OCDIMM Designs and Protocol                                                 23


Frame Design for Optical CPU-DRAM Interface with DWA

In order to be able to assign any wavelength to any DIMM we need to make
changes to basic frame design. figure 3.3(a) shows a single wavelength with 8 cycle
frame. First 3 cycles carry DIMM-ID, 4th cycle carries a direction bit, and the
next 4 cycles carry data bits. Figures 3.3(b) and 3.3(b) show two different physical
designs. In figure 3.3(b) single fiber carries wavelength assignment (WS) preamble
and data while in figure 3.3(c) two different fibers are used to carry preamble and
data. Figure 3.3(b) and figure 3.3(c) also captures a point in time snapshot of
DWA. Each shaded box/band denotes set of wavelengths assigned to a particular
DIMM. Left arrow denotes wavelengths carrying data from a DIMM to controller
while right arrow denotes data carried from controller to a DIMM.
   We inherit semi-synchronous frame based protocol from FBDIMM where con-
troller can send or receive commands/data at every frame. Frame cycle is selected
to match the DRAM cycle. But unlike FBDIMM, any wavelength can be driven in
any direction, inbound or outbound. Figure 3.3(a) shows a simple scheme where
each wavelength carries a DIMM-ID. Each DIMM listens to all wavelengths for
first few cycles of the frame and extract the wavelength allocations. A DIMM
can simultaneously retrieve data/commands (by listening to the write bus) and
transmit data (on the read bus). After the preamble of 4 cycles, DIMM listens
only to the wavelengths assigned to it and next frame cycles are used to extract
command/data. Preamble takes up some part of the frame which is the overhead
for DWA and may reduce net optical bandwidth available to memory transactions.
However, given that a WDM-based optical bus has a high raw bandwidth, this
might not be such a disadvantage (especially if the DIMM latency prevents full
bandwidth utilization in any case).
3.2 OCDIMM Designs and Protocol                                                 24


   Figure 3.3(b) shows a single fiber design. It would be possible to use a separate
fiber or dedicate a subset of the wavelengths to convey wavelength assignments
to DIMMs. Figure 3.3(c) shows a dual fiber design where wavelength selection
overhead is carried by 2nd fiber. This means first fiber can be fully utilized for
command/data transfers. It also has another advantage that DIMMs listen to
all wavelengths only for the first few cycles and after preamble they use assigned
wavelengths. Thus in figure 3.3(c) only second fiber carrying overhead needs to
be a broadcast bus mounted with splitters for taping out optical power while first
fiber does not need to mount splitters.
   Challenge in using dual fiber configuration is that it would require synchroniza-
tion between fiber carrying the commands and fiber carrying wavelength assign-
ment information, which is difficult to implement in the optical domain. Apart
from synchronization it would also double the optical modulators and demodula-
tors on a DIMMM.


Command Scheduling for DWA

DWA algorithms with semi-synchronous frame design establish a groundwork to
schedule commands to more than one DIMM. As pointed out earlier we inherit
some part of design from FBDIMM protocol. In DDR DRAM systems, commands
are assigned a pin. FBDIMM based DDR system serializes commands by putting
DIMM-ID in front and pack these packets into frames for transportation. We
deploy the same method to packetize commands and put them in the frame. But
we don’t need to put the DIMM-ID before the command since wavelengths assigned
to a DIMM are already carrying a DIMM-ID in preamble.
   The controller schedules on each frame. On each frame certain commands are
3.2 OCDIMM Designs and Protocol                                                 25


selected according to the algorithm being used, C-DWA or T-DWA. Each DIMM
is assigned certain wavelengths inside a frame. Thus it can receive certain amount
of bytes in a frame, which we call a bucket. Each command is converted to a
byte stream (a packet). If more than 2 commands goes to the same DIMM, then
we merge them into one packet. If all commands to a DIMM can not fit into
the bucket then the last command is partially put inside the bucket and marked
as ’INCOMPLETE’. This means that some of command bytes are transferred and
rest are due for next frames. We call these commands ’INCOMPLETE’ commands.
These commands are not fully issued, hence the controller does not calculate finish
time for them. By allowing partially transmitted commands, the controller avoids
internal fragmentation and makes use of all the bandwidth available. This increases
complexity at DIMM level as DIMM has to maintain a state per command. We
call it the inbound and outbound buffers. Although we call them buffers, no
extra space is needed as OMC already has buffers to convert from/to optical to
electrical domain. Inbound buffer: It listens to all bytes sent to a DIMM and
extracts commands from byte stream in a frame. If a command is not completely
sent it will not forward any other commands until ’INCOMPLETE’ command
is received completely. Outbound buffer: This buffer is used to store data
read from DRAM intermediately. When Inbound buffer detects DRIVE command,
outbound buffer triggers sending data on wavelengths assigned to carry data from
DIMM to controller. Outbound buffer can start sending data as soon as it receives
it from DRAM and wavelengths are assigned for it. Thus it maintains 2 bits state
per byte to be sent and a pointer ’head’ from where data will be sent back. In case
’critical word first’ is configured, DRIVE command can carry position of critical
word and ’head’ can be positioned to the critical word. Thus DRIVE command
3.2 OCDIMM Designs and Protocol                                                 26


is typically 1-2 bytes long carrying command identifier, and critical word position.
Since DIMM holds on data or command until it is not finished, C-DWA gives
second most priority (i.e. after REFRESH) to such commands to avoid DIMM
being locked out.


Wavelength State Management for DWA

In FBDIMM there are two modes of operations fixed mode latency, and variable
mode latency. Fixed mode operation assumes worst case delays so as to avoid
more than one DIMM sending data at the same time, while variable mode latency
mode tries to schedule commands such that more than one DIMM can transfer
data but not at the same segment of bus(in FBDIMM bus is point-to-point). In
earlier OCDIMM variations we assumed fixed mode operation, where controller
waits worst case delay (delay of last DIMM) and schedules accordingly.
   In DWA all wavelengths are shared and they can change direction at any frame
cycle. When a wavelength is assigned in one direction, data can be pipelined such
that no stalls are required. But when it changes direction then they need to be
managed differently. When a wavelength is assigned to read data from a DIMM
then in first few cycles of the frame it is carrying DIMM-ID and reaches to DIMM
in Tws (cycles in preamble) + Tprop (Optical Propagation delay DIMM). After
Tws cycles it changes direction and now DIMM starts sending data back. If Fl is
frame length in terms of cycles, then all data bits will be received by controller
after (Fl - Tw ) + Tprop cycles. Thus wavelengths remain busy for Fl + 2*Tprop
amount of cycles, which means they can not be used for next frame starting at
(current time + Fl ). Memory controller keeps track of each wavelength by single
bit status (IDLE/BUSY) and start and finish times. Hence before any wavelength
3.2 OCDIMM Designs and Protocol                                                   27


assignment is done, available idle wavelength vector is recalculated.
   This scheme allows controller to use as much optical bandwidth as possible,
improving utilization. One important point here is that REFRESH commands
are scheduled at certain intervals and they are critical. If controller allocates all
wavelengths for reading data from DIMMs just before refresh then it will destabilize
the entire system. To avoid this at every refresh cycle, the controller reserves
wavelengths by marking them busy from Tref resh (start of next refresh command)
+ Fl .
                                                                                28




Chapter 4

Experimental Setup


4.1     Experimental Methodology

We realized our design on a simulator and subjected it to variety of workloads
and different set of configurations. It is a broken into 2 major parts. i) Generating
Traces, and ii) Simulating DRAM system. Following subsections will describe each
section of our experiment.


4.1.1    DRAM system simulator

We modified the DRAM simulator (DRAMsim) [25] from University of Maryland to
model the OCDIMM and its different variations. Simulator is capable of reading
memory access traces or using random traces and model cycle accurate DRAM
system. It also simulates various scheduling policies, address mapping algorithms.
We implemented both DWA algorithms in DRAMsim along with new command
scheduling and wavelength management. We assumed single fiber configuration
as in figure 3.3(b). DRAMsim designed is based on external memory controller.
4.1 Experimental Methodology                                                      29


It models front side bus (FSB) using Bus Interface Unit (BIU). BIU is modeled
as a queue with fixed delays per transaction. OCDIMM uses integrated memory
controller. To model an integrated memory controller we made BIU queue infinite
and removed the fixed delay added to per transaction. By doing this an incoming
transaction will go directly in to memory controller’s queue. If the queue is full it
will be retried once space is available.


4.1.2     DRAM memories

All DIMMs are identical - single rank 8 banks, 2GB capacity, 64 bit wide data
bus and auto refresh enabled. To keep the design simple we assumed single rank
DIMMs. Multirank DIMM can be seen as many small DIMMs added to the same
channel. We compiled DDR2-667Mhz and faster DDR3 DIMMs running at fre-
quencies 1066, 1333, 2000. Latency number are taken from micron datasheet

                       No of Cores        1 @ 2.5Ghz
                     Processor type            x86
                    L1 icache/dcache      32K, 2 way
                        L2 Cache      Shared 2MB, 16 way
                        Cache line          64 bytes
                  Total DRAM memory          0.5 GB
                         System      Knoppix 5.1.1, live CD

                              Table 4.1: Bochs Setup




4.1.3     Workloads

We evaluated our designs against various benchmarks. We simulated these bench-
marks on a full system simulator - simics (2.0.31). Simics provides generic cache
module (g-cache) to model memory system hierarchy. We modified the g-cache
4.1 Experimental Methodology                                                   30

                  No of Cores                 4 @ 2.5Ghz
                 Processor type      x86 64(440BX-AMD Hammer)
           L1 icache/dcache per core          64K, 4 way
                   L2 Cache               Shared 2MB, 16 way
                   Cache line                   64 bytes
            Total DRAM memory                     8GB
                    System             Fedora core 5,Linux 2.6.15

                             Table 4.2: Simics Setup

module to filter out DRAM accesses after L2 cache. Table 4.2 shows the sim-
ics configuration used to generate memory access traces. We used 3 different
benchmarks PARSEC, SPECCPU2006, and Splash-2. We chose benchmarks with
highest miss rate from analysis done by [26, 27, 28]. To model multicore traffic
we also mixed 4 different SPEC-CPU2000 benchmarks and ran them on differ-
ent cores. All benchmarks are compiled with gcc-4.1 using -O3 optimization. All
SPEC’06 traces are taken from a single iteration of the benchmark with reference
dataset size. PARSEC benchmarks are configured to run with 16 threads and
large dataset. Characteristics of benchmarks chosen are listed in table 4.3. We
also used a single CPU trace from modified bochs 4.1 with the session consisting
of a) OpenOffice converting to postscript a 100 page OpenOffice document, while
a mp3 was being decoded in the background, and b) The same postscript file was
opened using the Konqueror web browser upon the successful conversion of the
OpenOffice document.
   Along with these workload traces we also used synthetic randomly generated
traces. These traces are typically used to measure system response at a particular
arrival rate in order to measure sustained bandwidth (maximum bandwidth) from
a system. Arrival rate is poisson distributed, addresses are uniformly distributed
so that all DIMMs receive equal number of requests.
4.1 Experimental Methodology                                                    31

 Trace                                          Length        (READs:
                                                              WRITEs)
 SPEC’06 403.gcc                                1m            4:1
 SPEC’06-MIX (403.gcc, 429.mcf,401.bzip2,       53m           2.2:1
 470.lbm)
 Splash-2 FFT(16M)                              112m          1.1:1
 Splash-2 OCEAN(2050x2050 grid)                 80m           2.2:1
 Splash-2 RADIX(64M integers)                   52m           1.4:1
 PARSEC Streamcluster                           133m          1:0 (all reads)
 PARSEC Canneal                                 31m           1.7:1
 PARSEC Fluidanimate                            19m           4.3:1
 Applications OpenOffice/mp3/web-browser          81m           2.2:1

                        Table 4.3: Benchmark Summary

   All traces are 3 tuple format (address, TYPE, CPU cycle). DRAMsim reads
traces and adds them into Bus Interface Unit (or Integrated memory controller).
Transactions are then added into memory controller queue (typically small 16-32
transactions). Once transaction finishes reading last byte or writing last byte, it
is marked as finished and retired. Latency is measured as time difference between
(L2 miss time) - (Transaction completion time). Queuing delay is measured as (L2
miss time) - (time when first command of transaction is scheduled).
                                                                           32




Chapter 5

Results and Discussions

We evaluate OCDIMM on the basis of two goals (1) Sustained bandwidth behavior
of different designs and configurations (with random synthetic traces), and (2)
Latency impact on real workloads with DWA.



5.1     Theoretical Limitation on Maximum Mem-

        ory Bandwidth

In this section we use an empirical model to find how much of optical bandwidth
is actually used for transferring data from CPU to memory or vice versa. For
modeling purpose we assume 16 DDR3 1066Mtps DIMMs on a channel and 64
wavelengths with optical data rate of 10gbps for each channel.
   For each transaction generally there are minimum 2/3 commands along with
one or DATA/DRIVE commands. DATA /DRIVE commands carry actual trans-
action data. For FBDIMM based protocols (OCDIM-BASE and OCDIMM-SWA),
each command is 5 bytes (1 byte for DIMM ID) and hence for every 64 bytes we
5.1 Theoretical Limitation on Maximum Memory Bandwidth                           33


have 10 bytes of command overhead which is ≈ 15%. Half of the wavelengths go in
southbound direction and half in the northbound direction. Command overhead
affects only southbound direction therefore 15% of southbound bandwidth is used
by command overhead. Out of 40GB/s only 33.75GB/s is used for carrying actual
data. In the case of northbound channel only one DIMM transmits data at a time.
In fixed latency mode of operation worst case propagation delay is added before
starting any transfer on northbound channel. To send 64 bytes of data using 32
wavelengths takes one frame cycle. But unless and until all of its bits are received
by the controller no other DIMM is allowed to send the data. Worst case propaga-
tion delay for 16 DIMMs on a OCDIMM channel is DIMM-to-DIMM delay (300ps)
* 16 = 4.8ns which is almost equal to two frame cycles. Thus we can utilize only
33% of northboud channel bandwidth for data transfer. Hence theoretical limit
on maximum bandwidth that is possible with these protocols is 33.75 + 13.2 =
46.95GB/s.
   For DWA overhead of wavelength assignment is 5 cycles (4 cycles for DIMM
ID and 1 for direction) per frame. With 1033Mtps DIMMs and 10gbps frame cycle
is 20. Hence DWA takes 25% (20GB/s) of total optical bandwidth and remaining
bandwidth (60GB/s) is used for commands/data. In DWA, commands do not
carry DIMM ID hence the size of command is 4 bytes. Total command overhead
is 12.5%. Thus with 64w@10gbps upperbound on maximum bandwidth possible
is 52GB/s. DWA overhead increases as more devices are added on the channel.
DWA uses certain number of initial cycles of every frame, hence if frame length
becomes too short additional wavelengths must be added to compensate the loss
in DWA overhead.
5.2 Sustained Bandwidth Study                                                   34


5.2     Sustained Bandwidth Study

In this study our goal is to find the rate at which memory system sustains within a
reasonable latency. Fig 5.2 shows our initial results for DDR2 and DDR3 memories.
It compares FBDDIM (with 8 and 16 DIMMs per channel), OCDIMM-BASE,
OCDIMM-SWA with 4C4D(4 optical subchannels with 4 DIMMs per subchannel)
and 8C2D (8 optical subchannels with 2 DIMMs per subchannel), OCDIMM T-
DWA and OCDIMM C-DWA.
   For DDR2-667Mtps 5.1(a) FBDIMM sustains at 5-6GB/s while OCDIMM-
BASE, which is equivalent to FBDIMM running an electrical bus at 10GHz, sus-
tains at 12GB/s. This is due to the limitation of protocol and topology. Even
though we can run our electrical buses at very high speed, memory bandwidth will
not improve because of poor utilization. OCDIMM-SWA has two configurations:
i) 32 wavelengths in each direction are divided in 4 subchannels and each subchan-
nel is loaded with 4 DIMMs, and ii) 32 wavelengths in each direction are divided
in 8 subchannels and each subchannel is loaded with 2 DIMMs. Both OCDIMM-
SWA sustain at 20GB/s. OCDIMM-DWA performs much better than all other
configurations. T-DWA sustains at 23GB/s but C-DWA sustains at 38GB/s. At
higher arrival rates T-DWA acts similar to SWA. At higher arrival rate transaction
queue becomes full, hence T-DWA assigns very few (total available λs / transac-
tion queue size) wavelengths per transaction. As discussed earlier, C-DWA tries
to reach DRAM system quickly reducing queuing delays but T-DWA focuses on
throughput (transactions per frame cycle). In T-DWA at higher arrival rate, in
addition to waiting for DRAM response, transactions will also have to wait for bus,
which affects overall performance. OCDIMM DWA significantly reduces average
5.2 Sustained Bandwidth Study                                                                                            35




     Latency (ns)




                    100




                                                                                       FBDIMM-8DIMMs
                                                                                      FBDIMM-16DIMMs
                                                                                         OCDIMM-BASE
                                                                                     OCDIMM-SWA 8C2D
                                                                                     OCDIMM-SWA 4C4D
                                                                                        OCDIMM T-DWA
                                                                                       OCDIMM C-DWA
                    10
                          0       5       10   15   20    25    30   35   40   45   50   55     60   65   70   75   80
                                                               Observed Bandwidth (GB/s)




                                                         (a) DDR2-667MHz
                    1000
     Latency (ns)




                     100




                                                                                       FBDIMM-8DIMMs
                                                                                      FBDIMM-16DIMMs
                                                                                         OCDIMM-BASE
                                                                                     OCDIMM-SWA 8C2D
                                                                                     OCDIMM-SWA 4C4D
                                                                                        OCDIMM T-DWA
                                                                                       OCDIMM C-DWA
                      10
                              0       5   10   15   20    25     30   35   40   45   50    55   60   65   70   75   80
                                                               Observed Bandwidth (GB/s)




                                                     (b) DDR3-1066MHz

                   Figure 5.1: Sustained Bandwidth Comparison
Transaction arrival rate is varied from 1 to 80GB/s, bandwidth observed is plotted
on X axis and corresponding latency at that bandwidth is plotted on Y axis (log
scale). Random (stress) traces are used with a poisson arrival rate and uniform
address distribution. DRAM system was configured for high performance mem-
ory mapping (interleaved cachelines), 2:1 READ:WRITE ratio, greedy transaction
scheduling policy, closed page mode and 32 transaction queue depth. All resources
are kept same to make a fair unbiased comparison. Number of DIMMs are 16 in
all but FBDIMM-8DIMM configuration. Total number of wavelengths is kept to 64
in all OCDIMM variations.
5.2 Sustained Bandwidth Study                                                     36


latency. Comparing to FBDIMM latency has been brought down to 50ns from
80ns (37% reduction).
    For DDR3-1066Mtps 5.1(b), we see similar behavior; C-DWA outperforming
other protocols. C-DWA sustains at 46GB/s with average latency of 43ns. It
provides 8 times and 1.8 times improvement in bandwidth over FBDIMM and
OCDIMM-SWA respectively. Given 80GB/s total optical bandwidth C-DWA is
able to provide 46GB/s which is close to its theoretical limit of 52GB/s as discussed
in section 5.1. T-DWA sustains at 34GB/s with an average latency of 60ns. T-
DWA does not improve latency as much as C-DWA. T-DWA tries to do a fair
allocation irrespective of the state of transactions while C-DWA prefers to get
data as soon as possible from DIMMs. For example, consider a case where 3 read
transactions are given to 3 different DIMMs. Just when all DIMMs are ready
to send back the data, a new transaction arrives. T-DWA will divide bandwidth
amongst four transactions and will allow new transaction to start early. In the
next cycle it will do same allocations. This approach is good when there are few
transactions in queue, but as queue fills up new incoming requests will be turned
down which degrades the performance. C-DWA will give few wavelengths to the
new transaction in current scheduling cycle because starting a transaction earlier
means DRAM devices are kept more busy (RAS has higher priority than DRIVE in
C-DWA). But in the next scheduling cycle, it will give higher priority to incomplete
DRIVE commands from previous cycles. Thus C-DWA tries to improve throughput
but without loosing on the average latency.
    Key takeaway from these graphs is that DWA can push the memory wall fur-
ther by intelligently adapting to arrival rates and utilizing system resources to the
fullest.
5.3 Sensitivity Analysis                                                       37


5.3         Sensitivity Analysis

In previous subsection 5.2 we have seen how different designs perform with respect
to sustained memory bandwidth. We found that OCDIMM C-DWA gives 100%
bandwidth improvement over OCDIMM-SWA, while T-DWA does not perform
that well.
   In this section we will try to answer following questions with respect to OCDIMM
DWA:
   1) How many wavelengths are necessary for a given capacity on a channel for
a particular algorithm? What should be the optical data rate? Or, How much
capacity a particular algorithm can support with given number of wavelengths?
   2) How many wavelengths are necessary for a different high speed DDR3 mem-
ories for a particular algorithm? What should be the optical data rate?
   3) Which algorithm is good for a particular configuration?
   4) If more wavelengths can not be added, is it possible to scale bandwidth by
increasing optical data rate? In other words which approach is better - adding
more wavelengths or increasing the optical data rate?
   We vary all possible parameters and explore the design space to find out optimal
configurations, common trends and patterns. We call it sensitivity analysis which
is sectioned in two parts a) Capacity sensitivity analysis, b) Latency sensitivity
analysis.


5.3.1        Capacity Sensitivity Analysis

In this study we address the problem of simultaneous scaling of bandwidth and
capacity. We vary optical resources (number of wavelengths and optical datarate)
5.3 Sensitivity Analysis                                                         38


for different capacities. Figure 5.2 shows one such configuration where we fixed
number of wavelengths to 16. Subfigures 5.2(a), 5.2(b), 5.2(c) show 16 wavelengths
running at 10gbps, 20gbps and 40gbps respectively. We do similar analysis as
done in the previous section 5.2 using random (Stress) traces with a poisson arrival
rate, uniform address distribution, high performance memory mapping (interleaved
cachelines), 2:1 READ:WRITE ratio, transaction scheduling policy : greedy, closed
page mode, transaction queue depth = 2 * number of DIMMs on a channel, DDR3-
1066Mtps DIMMs. For each capacity we vary the request arrival rate and note
down the sustained bandwidth and corresponding latency. Number of wavelengths
and optical datarate are kept constant. Analysis is done for both T-DWA and
C-DWA. In all these graphs, X axis shows channel capacity. Number of DIMMs
on a channel is varied from 4, 8, 16, 32, 64, 128 (each DIMM with 2GB capacity).
For each capacity, primary Y axis plots observed sustained bandwidth(GB/s) and
secondary Y axis plots the corresponding average latency in nanoseconds.
   With 16 wavelengths running at 10Gbps 5.2(a) both T-DWA and C-DWA fail
to scale as capacity increases. For 4 DIMMs C-DWA sustains at 8GB/s and T-
DWA sustains at 6GB/s. DWA overhead limits the use of entire optical bandwidth
when more than 4 DIMMs are added to the channel.
   In figure 5.2(b) we increased the optical datarate to 20gbps. Now controller
gets enough bandwidth to allow parallel transactions. As we can see both DWA
algorithms perform better than earlier configuration with 10gbps. T-DWA scales
from 10GB/s to 14GB/s till 16 DIMMs but after that point both bandwidth and
latency degrade. Both DWA algorithms suffer from DWA overhead which increases
as more DIMMs are added. But in the case of T-DWA more devices on channel
means more transactions in queue waiting for the bus. This situation is equivalent
5.3 Sensitivity Analysis                                                                                                            39

                                                                16 Wavelengths@ 10Gbps

                                         9                                                                     250
                                                                                            T-DWA B/W          240
                                                                                            C-DWA B/W
                                                                                                               230
                                         8
                                                                                            T-DWA Latency
                                                                                                               220
                                                                                            C-DWA Latency
                                                                                                               210
                                                                                                               200
                                         7
                                                                                                               190
                                                                                                               180
                                                                                                               170
                                         6

            Sustained Bandwidth (GB/s)
                                                                                                               160
                                                                                                               150
                                         5                                                                     140




                                                                                                                     Latency (ns)
                                                                                                               130
                                                                                                               120
                                         4                                                                     110
                                                                                                               100
                                                                                                               90
                                         3
                                                                                                               80
                                                                                                               70
                                                                                                               60
                                         2
                                                                                                               50
                                                                                                               40

                                         1                                                                     30
                                                                                                               20
                                                                                                               10
                                         0                                                                     0
                                          4D           8D         16D                 32D    64D            128D

                                                                        Capacity( DIMMs)




                                                                  (a) 16@10gbps

                                                                16 Wavelengths@ 20Gbps

                                         30                                                                    250
                                                T-DWA B/W                                                      240
                                                C-DWA B/W                                                      230
                                                T-DWA Latency
                                                                                                               220
                                                C-DWA Latency
                                         25                                                                    210
                                                                                                               200
                                                                                                               190
                                                                                                               180
                                                                                                               170
                                         20
            Sustained Bandwidth (GB/s)




                                                                                                               160
                                                                                                               150
                                                                                                               140




                                                                                                                     Latency (ns)
                                                                                                               130
                                         15
                                                                                                               120
                                                                                                               110
                                                                                                               100
                                                                                                               90
                                         10
                                                                                                               80
                                                                                                               70
                                                                                                               60
                                                                                                               50
                                         5                                                                     40
                                                                                                               30
                                                                                                               20
                                                                                                               10
                                         0                                                                     0
                                          4D            8D        16D                 32D    64D            128D

                                                                        Capacity( DIMMs)




                                                                  (b) 16@20gbps

                                                                16 Wavelengths@ 40Gbps

                                         70                                                                    250

                                               T-DWA B/W                                                       240
                                               C-DWA B/W                                                       230
                                               T-DWA Latency                                                   220
                                         60
                                               C-DWA Latency                                                   210
                                                                                                               200
                                                                                                               190
                                         50                                                                    180
                                                                                                               170
            Sustained Bandwidth (GB/s)




                                                                                                               160
                                                                                                               150
                                         40                                                                    140
                                                                                                                     Latency (ns)




                                                                                                               130
                                                                                                               120

                                         30                                                                    110
                                                                                                               100
                                                                                                               90
                                                                                                               80
                                         20                                                                    70
                                                                                                               60
                                                                                                               50
                                                                                                               40
                                         10
                                                                                                               30
                                                                                                               20
                                                                                                               10
                                         0                                                                     0
                                          4D            8D        16D                 32D    64D            128D

                                                                        Capacity( DIMMs)




                                                                  (c) 16@40gbps

           Figure 5.2: Capacity Sensitivity Analysis : 16 wavelengths
5.3 Sensitivity Analysis                                                            40


to the one when T-DWA fails to scale with high arrival rate, but this time, devices
per channel with uniform address distribution causes queue to fill up quickly. C-
DWA is able to scale linearly up till 32 DIMMs but after that bandwidth flattens
out. High number of waiting transactions don’t affect C-DWA since it prioritizes
commands and doesn’t change its allocation based on total number of transactions
in the queue. As more devices are added more REFRESH commands need to
be issued for which reservation is done in advance. Propagation delay plays an
important role in deeper channels. As discussed in section 3.2.3, we can not change
the direction of wavelengths allocated for reading the data until the last bit has been
received. Longer the propagation delay, more is the number of such wavelengths
(especially when we have more read transactions). From latency perspective even
with 128 DIMMs C-DWA is able to keep it down to 110 ns.
   From figure 5.2(c) we can see that T-DWA now scales till 32 DIMMs sustain-
ing at 35GB/s and degrades afterwards. C-DWA gives maximum of 58GB/s from
128DIMMs configuration which is twice than 28GB/s obtained from 20gbps con-
figuration. Scaling factor of 2 remains same for every capacity. Latency remains
almost the same as with 20gbps configuration. Only thing that has changed in all
these three configurations is the controller’s capacity to send or receive more data,
rest of the parameters or dependencies remain the same. C-DWA uses additional
bandwidth to improve throughput i.e. bandwidth rather than latency. With 16
wavelengths running at 40gbps and 128 DIMMs on channel, frame capacity is 512
bits = 64 bytes. DWA can read all data at once or split the bus to read from mul-
tiple devices. Time multiplexing as done by FBDIMM protocol adds additional
delays (fixed mode or variable modes of operations). Propagation delay of first
transaction is added to the waiting time of second transaction. By doing wave
5.3 Sensitivity Analysis                                                      41


multiplexing DWA increases the time required to transfer the data but it avoids
these additional delays which increase with total devices per channel.
   Figure 5.3 shows similar analysis with 32 wavelengths. With 10gbps both C-
DWA and T-DWA do not scale beyond 16 DIMMs. As discussed in section 5.1, we
are limited by command and DWA overheads. DWA overhead particularly affects
when frame cycle is too short. As we add more DIMMs on a channel DWA overhead
increases and hence both DWA algorithms fail to scale along with capacity.
   In figure 5.3(b) we see that both C-DWA and T-DWA scale upwards. At
20gbps datarate, number of cycles in a frame is 40, thus DWA overhead is now
limited to 12.5% in the case of 16 DIMMs. This gives more room to controller
for optimization. T-DWA is now able to sustain at 64 DIMMs and 128 DIMMs.
C-DWA sustains at 10GB/s more than T-DWA and with 40ns less latency. Sudden
change in T-DWA’s behavior is due to additional wavelengths it has got which can
be distributed across the transaction queue. From figure 5.3(c) we can see that C-
DWA gives 86GB/s with 70ns latency. Comparing it with 16 wavelengths@40gbps
configuration, bandwidth scaling factor is of (87/58) 1.5 and latency is brought
down to 70ns from 110ns (44% reduction). This bolsters our goal of adding more
optical resources to support more capacity and also get more bandwidth.
   Figures 5.4, 5.5 and 5.6 show results for 64, 96 and 128 wavelengths respec-
tively. Figure 5.4(a) shows that C-DWA and T-DWA supported 32 DIMMs but as
seen earlier in figure 5.3(a) adding more devices increases DWA overhead. Increas-
ing optical data rate gets rid of overhead problems. In figure 5.5(a) we can see
that both C-DWA and T-DWA supported 128 DIMMs. Bandwidth consumed by
DWA overhead is compensated by additional wavelengths. With 96 wavelengths
C-DWA is able to sustain at 90GB/s and 96GB/s with 20gbps and 40gbps datarate
5.3 Sensitivity Analysis                                                                                                 42

                                                                 32 Wavelengths@ 10Gbps

                                         30                                                         250
                                                                                                    240
                                                 T-DWA B/W
                                                                                                    230
                                                 C-DWA B/W
                                                 T-DWA Latency                                      220
                                         25      C-DWA Latency                                      210
                                                                                                    200
                                                                                                    190
                                                                                                    180
                                                                                                    170
                                         20

            Sustained Bandwidth (GB/s)
                                                                                                    160
                                                                                                    150
                                                                                                    140




                                                                                                          Latency (ns)
                                                                                                    130
                                         15
                                                                                                    120
                                                                                                    110
                                                                                                    100
                                                                                                    90
                                         10
                                                                                                    80
                                                                                                    70
                                                                                                    60
                                                                                                    50
                                         5                                                          40
                                                                                                    30
                                                                                                    20
                                                                                                    10
                                         0                                                          0
                                          4D              8D       16D              32D    64D   128D

                                                                      Capacity( DIMMs)




                                                                   (a) 32@10gbps

                                                                 32 Wavelengths@ 20Gbps

                                         80                                                         250
                                                 T-DWA B/W                                          240
                                                 C-DWA B/W                                          230
                                                 T-DWA Latency                                      220
                                         70
                                                 C-DWA Latency
                                                                                                    210
                                                                                                    200

                                         60                                                         190
                                                                                                    180
                                                                                                    170
            Sustained Bandwidth (GB/s)




                                                                                                    160
                                         50
                                                                                                    150
                                                                                                    140




                                                                                                          Latency (ns)
                                                                                                    130
                                         40
                                                                                                    120
                                                                                                    110
                                                                                                    100
                                         30
                                                                                                    90
                                                                                                    80
                                                                                                    70
                                         20                                                         60
                                                                                                    50
                                                                                                    40
                                         10                                                         30
                                                                                                    20
                                                                                                    10
                                         0                                                          0
                                          4D              8D       16D              32D    64D   128D

                                                                      Capacity( DIMMs)




                                                                   (b) 32@20gbps

                                                                 32 Wavelengths@ 40Gbps

                                         100                                                        250

                                                T-DWA B/W                                           240
                                                C-DWA B/W                                           230
                                         90
                                                T-DWA Latency                                       220
                                                C-DWA Latency                                       210
                                         80                                                         200
                                                                                                    190
                                                                                                    180
                                         70
                                                                                                    170
            Sustained Bandwidth (GB/s)




                                                                                                    160
                                         60                                                         150
                                                                                                    140
                                                                                                          Latency (ns)




                                                                                                    130
                                         50
                                                                                                    120
                                                                                                    110
                                         40                                                         100
                                                                                                    90
                                                                                                    80
                                         30
                                                                                                    70
                                                                                                    60
                                         20                                                         50
                                                                                                    40
                                                                                                    30
                                         10
                                                                                                    20
                                                                                                    10
                                          0                                                         0
                                           4D              8D       16D              32D   64D   128D

                                                                      Capacity( DIMMs)




                                                                   (c) 32@40gbps

           Figure 5.3: Capacity Sensitivity Analysis : 32 wavelengths
5.3 Sensitivity Analysis                                                                                                43

                                                                64 Wavelengths@ 10Gbps

                                         50                                                        250
                                                                                                   240
                                                T-DWA B/W                                          230
                                         45     C-DWA B/W                                          220
                                                T-DWA Latency
                                                                                                   210
                                                C-DWA Latency
                                         40                                                        200
                                                                                                   190
                                                                                                   180
                                         35
                                                                                                   170

            Sustained Bandwidth (GB/s)
                                                                                                   160
                                         30                                                        150
                                                                                                   140




                                                                                                         Latency (ns)
                                                                                                   130
                                         25
                                                                                                   120
                                                                                                   110
                                         20                                                        100
                                                                                                   90
                                                                                                   80
                                         15
                                                                                                   70
                                                                                                   60
                                         10                                                        50
                                                                                                   40
                                                                                                   30
                                         5
                                                                                                   20
                                                                                                   10
                                         0                                                         0
                                          4D             8D       16D              32D    64D   128D

                                                                     Capacity( DIMMs)




                                                                  (a) 64@10gbps

                                                                64 Wavelengths@ 20Gbps

                                         90                                                        250
                                                                                                   240
                                                T-DWA B/W
                                                                                                   230
                                                C-DWA B/W
                                         80                                                        220
                                                T-DWA Latency
                                                C-DWA Latency                                      210
                                                                                                   200
                                         70
                                                                                                   190
                                                                                                   180
                                                                                                   170
                                         60
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                                                                                   150
                                         50                                                        140




                                                                                                         Latency (ns)
                                                                                                   130
                                                                                                   120
                                         40                                                        110
                                                                                                   100
                                                                                                   90
                                         30
                                                                                                   80
                                                                                                   70
                                                                                                   60
                                         20
                                                                                                   50
                                                                                                   40

                                         10                                                        30
                                                                                                   20
                                                                                                   10
                                         0                                                         0
                                          4D             8D       16D              32D    64D   128D

                                                                     Capacity( DIMMs)




                                                                  (b) 64@20gbps

                                                                64 Wavelengths@ 40Gbps

                                         100                                                       250

                                                T-DWA B/W                                          240
                                                C-DWA B/W                                          230
                                         90
                                                T-DWA Latency                                      220
                                                C-DWA Latency                                      210
                                         80                                                        200
                                                                                                   190
                                                                                                   180
                                         70
                                                                                                   170
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                         60                                                        150
                                                                                                   140
                                                                                                         Latency (ns)




                                                                                                   130
                                         50
                                                                                                   120
                                                                                                   110
                                         40                                                        100
                                                                                                   90
                                                                                                   80
                                         30
                                                                                                   70
                                                                                                   60
                                         20                                                        50
                                                                                                   40
                                                                                                   30
                                         10
                                                                                                   20
                                                                                                   10
                                          0                                                        0
                                           4D            8D        16D              32D   64D   128D

                                                                     Capacity( DIMMs)




                                                                  (c) 64@40gbps

           Figure 5.4: Capacity Sensitivity Analysis : 64 wavelengths
5.3 Sensitivity Analysis                                                                                                44

                                                                96 Wavelengths@ 10Gbps

                                         80                                                        250
                                                                                                   240
                                                T-DWA B/W                                          230
                                                C-DWA B/W                                          220
                                         70
                                                T-DWA Latency
                                                                                                   210
                                                C-DWA Latency
                                                                                                   200

                                         60                                                        190
                                                                                                   180
                                                                                                   170

            Sustained Bandwidth (GB/s)
                                                                                                   160
                                         50
                                                                                                   150
                                                                                                   140




                                                                                                         Latency (ns)
                                                                                                   130
                                         40
                                                                                                   120
                                                                                                   110
                                                                                                   100
                                         30
                                                                                                   90
                                                                                                   80
                                                                                                   70
                                         20                                                        60
                                                                                                   50
                                                                                                   40
                                         10                                                        30
                                                                                                   20
                                                                                                   10
                                         0                                                         0
                                          4D             8D       16D              32D    64D   128D

                                                                     Capacity( DIMMs)




                                                                  (a) 96@10gbps

                                                                96 Wavelengths@ 20Gbps

                                         100                                                       250
                                                                                                   240
                                                T-DWA B/W
                                                                                                   230
                                         90     C-DWA B/W
                                                T-DWA Latency                                      220
                                                C-DWA Latency                                      210
                                         80                                                        200
                                                                                                   190
                                                                                                   180
                                         70
                                                                                                   170
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                         60                                                        150
                                                                                                   140




                                                                                                         Latency (ns)
                                                                                                   130
                                         50
                                                                                                   120
                                                                                                   110
                                         40                                                        100
                                                                                                   90
                                                                                                   80
                                         30
                                                                                                   70
                                                                                                   60
                                         20                                                        50
                                                                                                   40
                                                                                                   30
                                         10
                                                                                                   20
                                                                                                   10
                                          0                                                        0
                                           4D            8D        16D              32D   64D   128D

                                                                     Capacity( DIMMs)




                                                                  (b) 96@20gbps

                                                                96 Wavelengths@ 40Gbps

                                         120                                                       250

                                                T-DWA B/W                                          240
                                                C-DWA B/W                                          230
                                                T-DWA Latency                                      220
                                         100    C-DWA Latency                                      210
                                                                                                   200
                                                                                                   190
                                                                                                   180
                                                                                                   170
                                         80
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                                                                                   150
                                                                                                   140
                                                                                                         Latency (ns)




                                                                                                   130
                                         60
                                                                                                   120
                                                                                                   110
                                                                                                   100
                                                                                                   90
                                         40
                                                                                                   80
                                                                                                   70
                                                                                                   60
                                                                                                   50
                                         20                                                        40
                                                                                                   30
                                                                                                   20
                                                                                                   10
                                          0                                                        0
                                           4D            8D        16D              32D   64D   128D

                                                                     Capacity( DIMMs)




                                                                  (c) 96@40gbps

           Figure 5.5: Capacity Sensitivity Analysis : 96 wavelengths
5.3 Sensitivity Analysis                                                                                                45

                                                                128 Wavelengths@ 10Gbps

                                         90                                                        250
                                                                                                   240
                                                T-DWA B/W
                                                                                                   230
                                                C-DWA B/W
                                         80                                                        220
                                                T-DWA Latency
                                                C-DWA Latency                                      210
                                                                                                   200
                                         70
                                                                                                   190
                                                                                                   180
                                                                                                   170
                                         60

            Sustained Bandwidth (GB/s)
                                                                                                   160
                                                                                                   150
                                         50                                                        140




                                                                                                         Latency (ns)
                                                                                                   130
                                                                                                   120
                                         40                                                        110
                                                                                                   100
                                                                                                   90
                                         30
                                                                                                   80
                                                                                                   70
                                                                                                   60
                                         20
                                                                                                   50
                                                                                                   40

                                         10                                                        30
                                                                                                   20
                                                                                                   10
                                         0                                                         0
                                          4D             8D        16D             32D    64D   128D

                                                                     Capacity( DIMMs)




                                                                  (a) 128@10gbps

                                                                128 Wavelengths@ 20Gbps

                                         100                                                       250
                                                T-DWA B/W                                          240
                                                C-DWA B/W                                          230
                                         90
                                                T-DWA Latency                                      220
                                                C-DWA Latency                                      210
                                         80                                                        200
                                                                                                   190
                                                                                                   180
                                         70
                                                                                                   170
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                         60                                                        150
                                                                                                   140




                                                                                                         Latency (ns)
                                                                                                   130
                                         50
                                                                                                   120
                                                                                                   110
                                         40                                                        100
                                                                                                   90
                                                                                                   80
                                         30
                                                                                                   70
                                                                                                   60
                                         20                                                        50
                                                                                                   40
                                                                                                   30
                                         10
                                                                                                   20
                                                                                                   10
                                          0                                                        0
                                           4D            8D        16D              32D   64D   128D

                                                                     Capacity( DIMMs)




                                                                  (b) 128@20gbps

                                                                128 Wavelengths@ 40Gbps

                                         120                                                       250
                                                T-DWA B/W                                          240
                                                C-DWA B/W
                                                                                                   230
                                                T-DWA Latency
                                                                                                   220
                                                C-DWA Latency
                                         100                                                       210
                                                                                                   200
                                                                                                   190
                                                                                                   180
                                                                                                   170
                                         80
            Sustained Bandwidth (GB/s)




                                                                                                   160
                                                                                                   150
                                                                                                   140
                                                                                                         Latency (ns)




                                                                                                   130
                                         60
                                                                                                   120
                                                                                                   110
                                                                                                   100
                                                                                                   90
                                         40
                                                                                                   80
                                                                                                   70
                                                                                                   60
                                                                                                   50
                                         20                                                        40
                                                                                                   30
                                                                                                   20
                                                                                                   10
                                          0                                                        0
                                           4D            8D        16D              32D   64D   128D

                                                                     Capacity( DIMMs)




                                                                  (c) 128@40gbps

           Figure 5.6: Capacity Sensitivity Analysis : 128 wavelengths
5.3 Sensitivity Analysis                                                        46


respectively(figures 5.5(b), 5.5(c)).
   Figure 5.6(a) can be compared to figure 5.4(b) since total optical bandwidth
is same. C-DWA shows similar behavior in both these configurations. T-DWA
with 128 wavelengths performs slightly better than 64 wavelengths. Both DWA
basically allocate wavelengths to either transactions or commands, hence more the
number of wavelengths more is the scope for parallelism. In some particular cases
increasing optical speed does not help. For example, lets say the memory controller
wants to send one byte DRIVE command to a DIMM. In both cases 64w@20gbps
(40 bit a frame cycle neglecting overheads) or 128w@10gbps (20 bit frame cycle),
only one wavelength is more than enough to send a byte. Thus having a longer
frame will have internal fragmentation. To minimize fragmentation most of the
DRAM commands are 4 bytes and data packets are also of the power of 2 (usually
8/16 bytes per DATA/DRIVE command).
   Important point to note here is by adding more optical resources we are able
to get the minimum possible latency from DRAM system but not able to get peak
throughput. With 64 wavelengths@40gbps we get 90GB/s of memory bandwidth
and with 128 wavelengths@40gbps we get close to 100GB/s of bandwidth. We can
see that scaling factor decreases even though we add more optical resources. To get
maximum bandwidth from a particular configuration we need more wavelengths
as it is the basic unit of allocation. To support 128 DIMMs we would need more
wavelengths. By looking at 16 DIMMs configuration we needed 160GB/s (64
wavelengths@20gbps or 32wavelengths@40gbps) of optical bandwidth to get 65%
of their combined peak throughput(83GB/s = 65% of 16*8 GB/s). As discussed
earlier several factors such as internal fragmentation, propogation delay, overhead
of commands and DWA affect the resultant memory bandwidth.
5.3 Sensitivity Analysis                                                        47


   To conclude this section we can say
   1) T-DWA scales only with more wavelengths and performance degrades with
deeper channels.
   2) C-DWA scales upwards most of the time and outperforms T-DWA for every
configuration.
   3) DWA overhead dominates the use of optical bandwidth when frame length is
short. In case of 16,32,64 wavelengths with 10gbps datarate both DWA algorithms
fail to perform when more devices are added to channel.
   4) Table 5.1 summarizes maximum devices that can be supported with a par-
ticular optical configuration with average latency less than or equal to 100ns. Nat-
urally these configurations can support less number of devices and get less average
latency but bandwidth numbers in table 5.1 are the upperbounds.


5.3.2    Latency Sensitivity Analysis

In this section we study how DWA reacts to faster DRAM memories.
   By giving faster DRAM (with lower DRAM latencies) we want to evaluate if our
algorithms scale with respect to bandwidth and importantly with the latency. This
analysis gives us an idea about how many wavelengths are enough for a particular
DRAM speed. We use DDR3 memories with 1066, 1333, 1600, 2000 Mtps. We
keep channel capacity constant to 16 DIMMs (32GB). For each DDR3 memory
we vary optical resources and note down sustained bandwidth (GB/s) and average
latency in nanoseconds. Graphs are similar to the ones in the earlier section 5.3.1
   Figure 5.7 shows results obtained for 16 wavelengths running at 10gbps 5.7(a),
20gbps 5.7(b) and 40gbps 5.7(c). For DDR3 2000Mtps frame cycle with 10gbps is
5 which can only carry DWA overhead for 16 DIMMs (5 cycles). Hence for DDR3
5.3 Sensitivity Analysis                                                                                                            48

                                                                        16 Wavelengths@ 10Gbps

                                          8                                                                    250

                                                    T-DWA B/W                                                  240
                                                    C-DWA B/W                                                  230
                                          7         T-DWA Latency                                              220
                                                    C-DWA Latency                                              210
                                                                                                               200

                                          6                                                                    190
                                                                                                               180
                                                                                                               170

            Sustained Bandwidth (GB/s)
                                                                                                               160
                                          5
                                                                                                               150
                                                                                                               140




                                                                                                                     Latency (ns)
                                                                                                               130
                                          4
                                                                                                               120
                                                                                                               110
                                                                                                               100
                                          3
                                                                                                               90
                                                                                                               80
                                                                                                               70
                                          2                                                                    60
                                                                                                               50
                                                                                                               40
                                          1                                                                    30
                                                                                                               20
                                                                                                               10
                                           0                                                                    0
                                         1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                DDR3 Frequency




                                                                              (a) 16@10gbps

                                                                        16 Wavelengths@ 20Gbps

                                         30                                                                    250

                                                    T-DWA B/W                                                  240
                                                    C-DWA B/W                                                  230
                                                    T-DWA Latency                                              220
                                         25         C-DWA Latency                                              210
                                                                                                               200
                                                                                                               190
                                                                                                               180
                                                                                                               170
                                         20
            Sustained Bandwidth (GB/s)




                                                                                                               160
                                                                                                               150
                                                                                                               140




                                                                                                                     Latency (ns)
                                                                                                               130
                                         15
                                                                                                               120
                                                                                                               110
                                                                                                               100
                                                                                                               90
                                         10
                                                                                                               80
                                                                                                               70
                                                                                                               60
                                                                                                               50
                                          5                                                                    40
                                                                                                               30
                                                                                                               20
                                                                                                               10
                                           0                                                                    0
                                         1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                DDR3 Frequency




                                                                              (b) 16@20gbps

                                                                        16 Wavelengths@ 40Gbps

                                         60                                                                    250
                                                   T-DWA B/W                                                   240
                                                   C-DWA B/W                                                   230
                                                   T-DWA Latency
                                                                                                               220
                                                   C-DWA Latency
                                         50                                                                    210
                                                                                                               200
                                                                                                               190
                                                                                                               180
                                                                                                               170
                                         40
            Sustained Bandwidth (GB/s)




                                                                                                               160
                                                                                                               150
                                                                                                               140
                                                                                                                     Latency (ns)




                                                                                                               130
                                         30
                                                                                                               120
                                                                                                               110
                                                                                                               100
                                                                                                               90
                                         20
                                                                                                               80
                                                                                                               70
                                                                                                               60
                                                                                                               50
                                         10                                                                    40
                                                                                                               30
                                                                                                               20
                                                                                                               10
                                           0                                                                    0
                                         1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                DDR3 Frequency




                                                                              (c) 16@40gbps

                Figure 5.7: Latency Sensitivity Analysis : 16 wavelengths
5.3 Sensitivity Analysis                                                 49

 Optical Re- T-DWA                            C-DWA
 sources
 16w@40gbps  16 DIMMs,        b/w=32GB/s, 64 DIMMs,        b/w=58GB/s,
             lat=60ns                     lat=68ns
 32w@20gbps  64 DIMMs,        b/w=58GB/s, 128 DIMMs,       b/w=67GB/s,
             lat=90ns                     lat=98ns
 32w@40gbps  64 DIMMs,        b/w=68GB/s, 128 DIMMs,       b/w=86GB/s,
             lat=82ns                     lat=70ns
 64w@10gbps  32 DIMMs,        b/w=37GB/s, 32 DIMMs,        b/w=46GB/s,
             lat=90ns                     lat=51ns
 64w@20gbps  128 DIMMs,       b/w=70GB/s, 128 DIMMs,       b/w=84GB/s,
             lat=100ns                    lat=80ns
 64w@40gbps  128 DIMMs,       b/w=78GB/s, 128 DIMMs,       b/w=90GB/s,
             lat=80ns                     lat=66ns
 96w@10gbps  128 DIMMs,       b/w=67GB/s, 128 DIMMs,       b/w=73GB/s,
             lat=100ns                    lat=84ns
 96w@20gbps  128 DIMMs,       b/w=72GB/s, 128 DIMMs,       b/w=92GB/s,
             lat=90ns                     lat=80ns
 96w@40gbps  128 DIMMs,       b/w=80GB/s, 128 DIMMs,       b/w=95GB/s,
             lat=70ns                     lat=60ns
 128w@10gbps 128 DIMMs,       b/w=74GB/s, 128 DIMMs,       b/w=85GB/s,
             lat=92ns                     lat=80ns
 128w@20gbps 128 DIMMs,       b/w=76GB/s, 128 DIMMs,       b/w=95GB/s,
             lat=90ns                     lat=64ns
 128w@40gbps 128 DIMMs,       b/w=84GB/s, 128 DIMMs,       b/w=98GB/s,
             lat=70ns                     lat=57ns

              Table 5.1: Summary of Capacity Sensitivity Analysis

2000Mtps analysis is done only with wavelengths > 48. With 16 wavelengths
running at 10gbps 5.7(a) both C-DWA and T-DWA fail to scale with DRAM
frequency. As DRAM frequency increases number of length of the frame goes
down which mean DWA overhead will use most of the optical bandwidth (more
than 50%). Thus to support these devices either more wavelengths need to be
added or optical datarate needs to be increased. At 20gbps (figure 5.7(b)) C-
DWA and T-DWA are able to support 1600Mtps DIMMs. For 1600Mtps DIMMs,
5.3 Sensitivity Analysis                                                        50


C-DWA provides 27GB/s at 43ns while T-DWA sustains at 24GB/s with 70ns
average latency.
   In figure 5.7(c) C-DWA and T-DWA performances are similar(in parallel). Both
algorithms have almost the same latency but not the same bandwidth. This means
T-DWA is not able to utilize DRAM bandwidth. T-DWA strictly follows wave-
length multiplexing i.e. it will always split the bus equally. C-DWA splits the bus
according to the state of transactions. If most of the transactions want to send
RAS/CAS (initiating transaction at DRAM level), C-DWA delays data transfers
from DIMMs. But once a transfer from DIMM is initiated it will assign more wave-
lengths to it. In both cases the goal of C-DWA is to start DRAM activity as soon
as possible and once its done, get the data quickly so that a new transaction can
be started. Although once data is received from DRAM, OMC can keep it inside
its buffers but it can not buffer data of two different transactions at a time. OMC
lacks the knowledge of transaction. Memory controller can start a write transac-
tion because it will not use outbound buffers. Another read transaction can be
issued as far as data of the previous read transaction is taken out of outbound
buffers before DRAM prepares data of the new transaction.
   At higher DDR3 frequency, frame cycle shrinks which means more wavelengths
are needed to send commands or receive data. DRAM devices are faster and hence
whenever they are ready to accept or send data C-DWA adjusts priorities so that
DIMM doesn’t have to wait too long for bus access. For example, when transaction
queue is full T-DWA will divide wavelengths equally amongst all transactions. Now
each DIMM is sending or receiving data from controller which means that there is
no activity at DRAM level. This increases DRAM idle time. Keeping faster DRAM
devices idle is counterproductive to the benefits offered by these devices. C-DWA
5.3 Sensitivity Analysis                                                       51


can adjust itself so that sometimes bandwidth is used for better throughput (as in
T-DWA) and sometimes it is used for latency optimization. In any system design
performance is limited by bottleneck. When using high speed devices DWA should
always make sure that bus is not the bottleneck.
   Figure 5.8 shows results for 32 wavelengths. As seen earlier due to DWA over-
head both DWA algorithms fail to obtain higher bandwidth from faster DIMMs.
By increasing optical speed to 20gbps (figure 5.8(b)) and 40gbps figure 5.8(c) C-
DWA can now 72GB/s and 80GB/s of memory bandwidth. As the bus speed
increases gap between T-DWA and C-DWA reduces.
   Figures 5.9, 5.10 and 5.11 represent results for 64, 96 and 128 wavelengths re-
spectively. In these graphs we are measuring bandwidth and latency for DDR3
2000Mtps memories. In figure 5.9(a) DWA algorithms do not scale with 64 wave-
lengths@10gbps.     At 20gbps (figure 5.9(b)) and 40gbps (figure 5.9(c)) optical
datarates C-DWA is able to support 82GB/s and 96GB/s of bandwidth for 2000Mtps
DIMMs. Average latency is 30ns for 2000Mtps DIMMs. For DDR3 2000Mtps C-
DWA is able to get as much as 99GB/s and 107GB/s with 96 and 128 wavelengths
running at 40gbps. Minimum latency for DDR3 2000ns is observed to 26ns by C-
DWA. As done in the previous section, we compare 64wavelengths@20gbps (figure
5.9(b)) with 128wavelengths@10gbps configuration(figure 5.11(a)). Configuration
with 128 wavelengths gives slightly better results than 64 wavelengths in both
DWA.
   In conclusion,
   1) To support faster DDR3 memories with 16,32 or 64 wavelengths DWA needs
optical rate of 20gbps or more.
   2) C-DWA is able to extract more bandwidth from faster DRAM chips than
5.3 Sensitivity Analysis                                                                                                             52

                                                                         32 Wavelengths@ 10Gbps

                                          30                                                                    250
                                                     T-DWA B/W                                                  240
                                                     C-DWA B/W                                                  230
                                                     T-DWA Latency
                                                                                                                220
                                                     C-DWA Latency
                                          25                                                                    210
                                                                                                                200
                                                                                                                190
                                                                                                                180
                                                                                                                170
                                          20

             Sustained Bandwidth (GB/s)
                                                                                                                160
                                                                                                                150
                                                                                                                140




                                                                                                                      Latency (ns)
                                                                                                                130
                                          15
                                                                                                                120
                                                                                                                110
                                                                                                                100
                                                                                                                90
                                          10
                                                                                                                80
                                                                                                                70
                                                                                                                60
                                                                                                                50
                                           5                                                                    40
                                                                                                                30
                                                                                                                20
                                                                                                                10
                                            0                                                                    0
                                          1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                 DDR3 Frequency




                                                                               (a) 32@10gbps

                                                                         32 Wavelengths@ 20Gbps

                                          80                                                                    250
                                                    T-DWA B/W                                                   240
                                                    C-DWA B/W                                                   230
                                                    T-DWA Latency
                                          70                                                                    220
                                                    C-DWA Latency
                                                                                                                210
                                                                                                                200

                                          60                                                                    190
                                                                                                                180
                                                                                                                170
             Sustained Bandwidth (GB/s)




                                                                                                                160
                                          50
                                                                                                                150
                                                                                                                140




                                                                                                                      Latency (ns)
                                                                                                                130
                                          40
                                                                                                                120
                                                                                                                110
                                                                                                                100
                                          30
                                                                                                                90
                                                                                                                80
                                                                                                                70
                                          20                                                                    60
                                                                                                                50
                                                                                                                40
                                          10                                                                    30
                                                                                                                20
                                                                                                                10
                                            0                                                                    0
                                          1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                 DDR3 Frequency




                                                                               (b) 32@20gbps

                                                                         32 Wavelengths@ 40Gbps

                                          90                                                                    250
                                                    T-DWA B/W                                                   240
                                                    C-DWA B/W
                                                                                                                230
                                          80        T-DWA Latency
                                                                                                                220
                                                    C-DWA Latency
                                                                                                                210
                                                                                                                200
                                          70
                                                                                                                190
                                                                                                                180
                                                                                                                170
                                          60
            Sustained Bandwidth (GB/s)




                                                                                                                160
                                                                                                                150
                                          50                                                                    140
                                                                                                                      Latency (ns)




                                                                                                                130
                                                                                                                120
                                          40                                                                    110
                                                                                                                100
                                                                                                                90
                                          30
                                                                                                                80
                                                                                                                70
                                                                                                                60
                                          20
                                                                                                                50
                                                                                                                40

                                          10                                                                    30
                                                                                                                20
                                                                                                                10
                                            0                                                                    0
                                          1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                 DDR3 Frequency




                                                                               (c) 32@40gbps

                  Figure 5.8: Latency Sensitivity Analysis : 32 wavelengths
5.3 Sensitivity Analysis                                                                                                           53


                                                                      64 Wavelengths@ 10Gbps

                                         50                                                                   250
                                                                                                              240
                                                   T-DWA B/W
                                                   C-DWA B/W                                                  230
                                         45
                                                   T-DWA Latency                                              220
                                                   C-DWA Latency                                              210
                                         40                                                                   200
                                                                                                              190
                                                                                                              180
                                         35
                                                                                                              170

            Sustained Bandwidth (GB/s)
                                                                                                              160
                                         30                                                                   150
                                                                                                              140




                                                                                                                    Latency (ns)
                                                                                                              130
                                         25
                                                                                                              120
                                                                                                              110
                                         20                                                                   100
                                                                                                              90
                                                                                                              80
                                         15
                                                                                                              70
                                                                                                              60
                                         10                                                                   50
                                                                                                              40
                                                                                                              30
                                          5
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (a) 64@10gbps

                                                                      64 Wavelengths@ 20Gbps

                                         90                                                                   250
                                                                                                              240
                                                   T-DWA B/W
                                                   C-DWA B/W                                                  230
                                         80                                                                   220
                                                   T-DWA Latency
                                                   C-DWA Latency                                              210
                                                                                                              200
                                         70
                                                                                                              190
                                                                                                              180
                                                                                                              170
                                         60
            Sustained Bandwidth (GB/s)




                                                                                                              160
                                                                                                              150
                                         50                                                                   140




                                                                                                                    Latency (ns)
                                                                                                              130
                                                                                                              120
                                         40                                                                   110
                                                                                                              100
                                                                                                              90
                                         30
                                                                                                              80
                                                                                                              70
                                                                                                              60
                                         20
                                                                                                              50
                                                                                                              40

                                         10                                                                   30
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (b) 64@20gbps

                                                                       64 Wavelengths@ 40Gbps

                                         120                                                                  250
                                                   T-DWA B/W                                                  240
                                                   C-DWA B/W
                                                                                                              230
                                                   T-DWA Latency
                                                                                                              220
                                                   C-DWA Latency
                                         100                                                                  210
                                                                                                              200
                                                                                                              190
                                                                                                              180
                                                                                                              170
                                         80
            Sustained Bandwidth (GB/s)




                                                                                                              160
                                                                                                              150
                                                                                                              140
                                                                                                                    Latency (ns)




                                                                                                              130
                                         60
                                                                                                              120
                                                                                                              110
                                                                                                              100
                                                                                                              90
                                         40
                                                                                                              80
                                                                                                              70
                                                                                                              60
                                                                                                              50
                                         20                                                                   40
                                                                                                              30
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (c) 64@40gbps

                Figure 5.9: Latency Sensitivity Analysis : 64 wavelengths
5.3 Sensitivity Analysis                                                                                                           54


                                                                      96 Wavelengths@ 10Gbps

                                         80                                                                   250
                                                                                                              240
                                                   T-DWA B/W
                                                   C-DWA B/W                                                  230
                                         70        T-DWA Latency                                              220
                                                   C-DWA Latency                                              210
                                                                                                              200

                                         60                                                                   190
                                                                                                              180
                                                                                                              170

            Sustained Bandwidth (GB/s)
                                                                                                              160
                                         50
                                                                                                              150
                                                                                                              140




                                                                                                                    Latency (ns)
                                                                                                              130
                                         40
                                                                                                              120
                                                                                                              110
                                                                                                              100
                                         30
                                                                                                              90
                                                                                                              80
                                                                                                              70
                                         20                                                                   60
                                                                                                              50
                                                                                                              40
                                         10                                                                   30
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (a) 96@10gbps

                                                                      96 Wavelengths@ 20Gbps

                                         90                                                                   250
                                                                                                              240
                                                   T-DWA B/W
                                                   C-DWA B/W                                                  230
                                         80                                                                   220
                                                   T-DWA Latency
                                                   C-DWA Latency                                              210
                                                                                                              200
                                         70
                                                                                                              190
                                                                                                              180
                                                                                                              170
                                         60
            Sustained Bandwidth (GB/s)




                                                                                                              160
                                                                                                              150
                                         50                                                                   140




                                                                                                                    Latency (ns)
                                                                                                              130
                                                                                                              120
                                         40                                                                   110
                                                                                                              100
                                                                                                              90
                                         30
                                                                                                              80
                                                                                                              70
                                                                                                              60
                                         20
                                                                                                              50
                                                                                                              40

                                         10                                                                   30
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (b) 96@20gbps

                                                                       96 Wavelengths@ 40Gbps

                                         120                                                                  250
                                                   T-DWA B/W                                                  240
                                                   C-DWA B/W
                                                                                                              230
                                                   T-DWA Latency
                                                                                                              220
                                                   C-DWA Latency
                                         100                                                                  210
                                                                                                              200
                                                                                                              190
                                                                                                              180
                                                                                                              170
                                         80
            Sustained Bandwidth (GB/s)




                                                                                                              160
                                                                                                              150
                                                                                                              140
                                                                                                                    Latency (ns)




                                                                                                              130
                                         60
                                                                                                              120
                                                                                                              110
                                                                                                              100
                                                                                                              90
                                         40
                                                                                                              80
                                                                                                              70
                                                                                                              60
                                                                                                              50
                                         20                                                                   40
                                                                                                              30
                                                                                                              20
                                                                                                              10
                                           0                                                                   0
                                         1033Mhz                   1333Mhz                      1600Mhz   2000Mhz

                                                                               DDR3 Frequency




                                                                             (c) 96@40gbps

           Figure 5.10: Latency Sensitivity Analysis : 96 wavelengths
5.3 Sensitivity Analysis                                                                                                                       55


                                                                             128 Wavelengths@ 10Gbps

                                                90                                                                    250
                                                                                                                      240
                                                           T-DWA B/W
                                                                                                                      230
                                                           C-DWA B/W
                                                80                                                                    220
                                                           T-DWA Latency
                                                           C-DWA Latency                                              210
                                                                                                                      200
                                                70
                                                                                                                      190
                                                                                                                      180
                                                                                                                      170
                                                60
           Sustained Bandwidth (GB/s)                                                                                 160
                                                                                                                      150
                                                50                                                                    140




                                                                                                                             Latency (ns)
                                                                                                                      130
                                                                                                                      120
                                                40                                                                    110
                                                                                                                      100
                                                                                                                      90
                                                30
                                                                                                                      80
                                                                                                                      70
                                                                                                                      60
                                                20
                                                                                                                      50
                                                                                                                      40

                                                10                                                                    30
                                                                                                                      20
                                                                                                                      10
                                                  0                                                                    0
                                                1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                       DDR3 Frequency



                                                                                     (a) 128@10gbps

                                                                             128 Wavelengths@ 20Gbps

                                                90                                                                    250
                                                                                                                      240
                                                           T-DWA B/W
                                                                                                                      230
                                                           C-DWA B/W
                                                80                                                                    220
                                                           T-DWA Latency
                                                           C-DWA Latency                                              210
                                                                                                                      200
                                                70
                                                                                                                      190
                                                                                                                      180
                                                                                                                      170
                                                60
           Sustained Bandwidth (GB/s)




                                                                                                                      160
                                                                                                                      150
                                                50                                                                    140




                                                                                                                             Latency (ns)
                                                                                                                      130
                                                                                                                      120
                                                40                                                                    110
                                                                                                                      100
                                                                                                                      90
                                                30
                                                                                                                      80
                                                                                                                      70
                                                                                                                      60
                                                20
                                                                                                                      50
                                                                                                                      40

                                                10                                                                    30
                                                                                                                      20
                                                                                                                      10
                                                  0                                                                    0
                                                1033Mhz                    1333Mhz                      1600Mhz   2000Mhz

                                                                                       DDR3 Frequency



                                                                                     (b) 128@20gbps

                                                                               128 Wavelengths@ 40Gbps

                                                120                                                                    250
                                                           T-DWA B/W                                                   240
                                                           C-DWA B/W
                                                                                                                       230
                                                           T-DWA Latency
                                                                                                                       220
                                                           C-DWA Latency
                                                100                                                                    210
                                                                                                                       200
                                                                                                                       190
                                                                                                                       180
                                                                                                                       170
                                                 80
                   Sustained Bandwidth (GB/s)




                                                                                                                       160
                                                                                                                       150
                                                                                                                       140
                                                                                                                                Latency (ns)




                                                                                                                       130
                                                 60
                                                                                                                       120
                                                                                                                       110
                                                                                                                       100
                                                                                                                       90
                                                 40
                                                                                                                       80
                                                                                                                       70
                                                                                                                       60
                                                                                                                       50
                                                 20                                                                    40
                                                                                                                       30
                                                                                                                       20
                                                                                                                       10
                                                   0                                                                   0
                                                 1033Mhz                    1333Mhz                     1600Mhz   2000Mhz

                                                                                       DDR3 Frequency




                                                                                     (c) 128@40gbps

           Figure 5.11: Latency Sensitivity Analysis : 128 wavelengths
5.4 Latency Impact with DWA                                                       56


T-DWA even with less optical resources.
   3) In both DWA algorithms latency reduces as DDR3 speed increases. T-DWA
can match C-DWA latency only with higher optical resources.



5.4     Latency Impact with DWA

In this section we see how different OCDIMM designs perform with real world
applications. We choose FBDIMM, OCDIMM-Base and 32 wavelengths each di-
rection, OCDIMM-SWA with 4C2D (4 subchannels of 16 wavelengths each and
2 DIMMs per channel), and OCDIMM C-DWA with total 64 wavelengths. Total
number of DIMMs per channel were 8 and we used DDR2 667Mhz DIMMs. We
chose C-DWA over T-DWA since it is more responsive to variations in resources
available. Figures 5.12, 5.13, 5.14 show latency analysis for SPEC’06, PARSEC
and Splash benchmarks respectively. We present results in two different ways
   a) Continuous Latency Variations: X axis plots number of transactions while
Y axis plots average latency of every 10000 transactions. This effect leaves out
minute details and variations but gives us overall look on how latency is affected
by each of above designs at different stages of the application. It also gives an idea
about memory traffic at the controller.
   b) Average Latency Breakup: We divide the transaction latency in 4 major
components and plot the average of all transactions. Each transaction goes through
different finite states before it is marked as completed. When a transaction arrives
at the memory controller, it is put in a queue. A transaction may have to wait
in a queue because of two major reasons i) transaction scheduling policy and,
ii)resources(DIMM/DRAM) are busy in processing requests from earlier transac-
5.4 Latency Impact with DWA                                                     57


tions. We measure this time spent in a queue as a difference between time when
the first command of a transaction is selected for scheduling and the time when
transaction arrived. When commands are selected for scheduling they are sent to
respective DIMMs with a frame based protocol. In the case of OCDIMM-DWA
commands may reach to destination DIMM in more than one frames depend-
ing upon wavelengths assigned to the transaction. In other designs all commands
scheduled are packed in a single frame and sent across the southbound channel. We
measure the time spent by commands in the southbound channel as ’Southbound
Latency’ component. It includes time of flight of all commands in a transaction.
Similarly when a DIMM is ready to send data back to the controller, we measure
’Northbound Latency’ component. FBDIMM, OCDIMM-BASE and OCDIMM-
SWA add a fixed delay in order to avoid conflicts on northbound channel. We add
it to ’Northbound Latency’ component. In case of OCDIMM-DWA, OMC on a
DIMM might send the data in more than one frame according to wavelength assign-
ment. We include these frame cycles along with any idle time between these frames
in ’Northbound Latency’ component. DRAM+OMC or DRAM+AMB component
includes time spent by a command in AMB/OMC (decoder/serializer/queue) and
DRAM processing time.
   In figures 5.12, 5.13, 5.14 graphs on the left show continuous latency variations
while graphs on the right show the average latency and its break up for each
design. Results for SPEC benchmarks and OpenOffice trace mentioned in table
4.3 is shown in figure 5.12.
   GCC benchmark is the smallest benchmark and does not request for high mem-
ory bandwidth. From figures 5.12(a) and 5.12(b) we can clearly see that latency
is reduced in successive versions of OCDIMMs. C-DWA was able to achieve la-
5.4 Latency Impact with DWA                                                                                                            58


                                                           120
                                                                                                                   Northbound
                                                                                          GCC                      Southbound
                                                                                                                   Transaction Queue
                                                                                                                   DRAM+OMC
                                                           100




                                                           80




                                     Average Latency(ns)
                                                           60




                                                           40




                                                           20




                                                            0
                                                                   FBDIMM   OCDIMM-Base            OCDIMM-Static   OCDIMM C-DWA



               (a) Gcc                                               (b) Gcc Latency Breakup
                                                           300
                                                                                                                   Northbound
                                                                                          SPEC'06 - MIX            Southbound
                                                                                                                   Transaction Queue
                                                                                                                   DRAM+OMC
                                                           250




                                                           200
                                     Average Latency(ns)




                                                           150




                                                           100




                                                            50




                                                            0
                                                                   FBDIMM   OCDIMM-Base            OCDIMM-Static   OCDIMM C-DWA



            (c) SPEC-Mixed                                       (d) SPEC-Mixed Latency Breakup
                                                           250
                                                                                                                   Northbound
                                                                                      OpenOffice                   Southbound
                                                                                                                   Transaction Queue
                                                                                                                   DRAM+OMC


                                                           200
                                     Average Latency(ns)




                                                           150




                                                           100




                                                           50




                                                            0
                                                                   FBDIMM   OCDIMM-Base            OCDIMM-Static   OCDIMM C-DWA



             (e) OpenOffice                                        (f) OpenOffice Latency Breakup

 Figure 5.12: Latency impact on SPEC’06 benchmarks and OpenOffice session
   Red:FBDIMM, Green:OCDIMM-BASE, Blue:OCDIMM-SWA,
                         Aqua:OCDIMM-DWA
5.4 Latency Impact with DWA                                                                                                          59


                                                      120
                                                                                                                 Northbound
                                                                                      Canneal                    Southbound
                                                                                                                 Transaction Queue
                                                                                                                 DRAM+OMC
                                                      100




                                                      80




                                Average Latency(ns)
                                                      60




                                                      40




                                                      20




                                                       0
                                                               FBDIMM   OCDIMM-Base             OCDIMM-Static   OCDIMM C-DWA



              (a) Canneal                                     (b) Canneal Latency Breakup
                                                      300
                                                                                                                 Northbound
                                                                                      Fluidanimate               Southbound
                                                                                                                 Transaction Queue
                                                                                                                 DRAM+OMC
                                                      250




                                                      200
                                Average Latency(ns)




                                                      150




                                                      100




                                                      50




                                                       0
                                                               FBDIMM   OCDIMM-Base             OCDIMM-Static   OCDIMM C-DWA



            (c) Fluidanimate                                (d) Fluidanimate Latency Breakup
                                                      500

                                                                                  Streamcluster                 Northbound
                                                                                                                Southbound
                                                      450
                                                                                                                Transaction Queue
                                                                                                                DRAM+OMC

                                                      400



                                                      350
                                Average Latency(ns)




                                                      300



                                                      250



                                                      200



                                                      150



                                                      100



                                                      50



                                                       0
                                                               FBDIMM   OCDIMM-Base             OCDIMM-Static   OCDIMM C-DWA



            (e) Streamcluster                               (f) Streamcluster Latency Breakup

         Figure 5.13: Latency impact on PARSEC benchmarks
   Red:FBDIMM, Green:OCDIMM-BASE, Blue:OCDIMM-SWA,
                       Aqua:OCDIMM-DWA
5.4 Latency Impact with DWA                                                                                                    60


                                                      120
                                                                                                           Northbound
                                                                                   FFT                     Southbound
                                                                                                           Transaction Queue
                                                                                                           DRAM+OMC
                                                      100




                                                      80




                                Average Latency(ns)
                                                      60




                                                      40




                                                      20




                                                       0
                                                            FBDIMM   OCDIMM-Base           OCDIMM-Static   OCDIMM C-DWA



               (a) FFT                                        (b) FFT Latency Breakup
                                                      350
                                                                                                           Northbound
                                                                                   Radix                   Southbound
                                                                                                           Transaction Queue
                                                                                                           DRAM+OMC
                                                      300




                                                      250
                                Average Latency(ns)




                                                      200




                                                      150




                                                      100




                                                      50




                                                       0
                                                            FBDIMM   OCDIMM-Base           OCDIMM-Static   OCDIMM C-DWA



               (c) Radix                                    (d) Radix Latency Breakup
                                                      100

                                                                                   Ocean                   Northbound
                                                                                                           Southbound
                                                      90
                                                                                                           Transaction Queue
                                                                                                           DRAM+OMC

                                                      80



                                                      70
                                Average Latency(ns)




                                                      60



                                                      50



                                                      40



                                                      30



                                                      20



                                                      10



                                                       0
                                                            FBDIMM   OCDIMM-Base           OCDIMM-Static   OCDIMM C-DWA



              (e) Ocean                                      (f) Ocean Latency Breakup

         Figure 5.14: Latency impact on SPLASH-2 benchmarks
   Red:FBDIMM, Green:OCDIMM-BASE, Blue:OCDIMM-SWA,
                       Aqua:OCDIMM-DWA
5.4 Latency Impact with DWA                                                   61


tency of 50ns (minimum latency for DDR2 667Mhz devices is 45ns). Most of the
reduction in latency is coming from delays in transaction and northbound latency
components.
   For SPEC-MIX trace (figure 5.12(d)) we see that FBDIMM, OCDIMM and
OCDIMM-SWA latencies merging together on boundaries. While C-DWA clearly
stands out reducing latency by 30%. C-DWA works better with reads since READ
transaction requires additional work of moving data from DIMM to controller.
More the number of READ transacitons, more is the race for bus access by different
DIMMs. SPEC-MIX trace has 4 benchmarks running at the same time and hence
request arrival rate is high. Therefore, we don’t see much of difference between
FBDIMM and OCDIMM-Base latency as both are following the same topology
and protocol. Delays in transaction queue are brought down to 50% in C-DWA
when compared with FBDIMM.
   For real-world application OpenOffice with mp3 decoding we see that OCDIMM-
BASE and OCDIMM-SWA both have almost same latencies but sightly better than
FBDIMM. This trace is a mix of 2 traces, one trace requests memory at consis-
tent rate while other requests memory in intervals. OCDIMM-SWA is unable to
response to varying request rate. Even though it allows independent channels all
channels were not utilized at the same time. OCDIMM-DWA was able to reduce
latency by 47% (from a thick band in graph at latency 170ns for other designs vs
90ns for OCDIMM-DWA).
   Figure 5.13 shows latency impact for canneal, fluidanimate and streamclus-
ter. These programs have repeating patterns and high memory request rate. All
OCDIMM designs significantly improve latencies from previous designs in the fol-
lowing order: FBDIMM > OCDIMM-Base > OCDIMM-SWA > OCDIMM-DWA.
5.4 Latency Impact with DWA                                                  62


For streamcluster we see that at higher latencies FBDIMM and OCDIMM-Base
latencies hardly differ. This is again because of READ:WRITE ratio. Streamclus-
ter has no WRITE transactions, memory requests grow exponentially and then
suddenly drop. Streamscluster does not reuse any of the computations and fits all
intermediate results in cache. This is observed from PARSEC analysis [26]. DWA
is observed to be efficient for READ traffic hence we see almost 50% reduction for
fluidanimate and streamcluster which have high READ:WRITE ratio (table 4.3).
   Figure 5.14 shows latency impact for SPLASH-2 benchmarks. FFT and Radix
are kernels and require high memory bandwidth. Both have almost the same
number of READs as WRITEs. We can see that now DWA does not offer 40-
50% reduction in latency, but it drops down to 20% on average. We are also
limited by DDR2 latency (min 45ns). Ocean benchmark has more reads than
writes and hence we see 40% latency reduction for DWA. In conclusion we found
that DWA performs better for READ oriented traffic. It reduces latency by 20-
40% compared with FBDIMM. OCDIMM-SWA performs well when arrival rate is
not high. OCDIMM-Base performs almost similar to FBDIMM except low arrival
rate.


5.4.1    Why DWA improves latency?

OCDIMM can transfer data faster than FBDIMM because of raw speed of the
optical communication. Even though we can hardly see a major reduction in
latency from OCDIMM-BASE or sometimes from OCDIMM-SWA. Next we discuss
why DWA was able to achieve lower latencies.
   From latency breakup graphs we can say that the majority of reduction in la-
tency by DWA is coming from time spent in the transaction queue and time spent
5.5 C-DWA and RIFF                                                               63


in northbound channel. Time spent in DRAM and OMC/AMB remains the same
in all designs. In FBDIMM and OCDIMM-BASE only one DIMM uses northbound
channel at a time. In OCDIMM-SWA, a DIMM uses a subchannel other DIMMs
have to wait before using this subchannel. Hence each bus transaction on north-
bound channel subsequently delays next bus transaction due to fixed mode latency
operation. This also indirectly affects the time spent by transactions in queue since
transactions at the front are waiting to receive their data. OCDIMM-DWA allows
more than one DIMM to send data hence DIMM now can receive commands from
other transactions without having to buffer data for too long. C-DWA also gives
priority to northbound bus transactions thus giving maximum possible share of
optical bandwidth. By doing this it tries to complete more tasks rather than start-
ing new transactions. Southbound channel is free of conflicts hence ’Southbound
Latency’ component is also almost the same in all designs except OCDIMM-DWA.
DWA can allocate all of its wavelengths to southbound channel when there are no
more requests on northbound channel and hence we can see up to 50% reduction
for ’Southbound Latency’.



5.5     C-DWA and RIFF

As we discussed before C-DWA tries to follow the order set by transaction ordering
policies used by the controller. Internally C-DWA gives priority to read transac-
tion’s DRIVE commands over write transaction’s DATA commands. Intuitively
it is similar to Read or Instruction Fetch First (RIFF) policy which queues read
transactions ahead of write transactions. In all our simulations we used Greedy
policy which tries to schedule as many transactions as possible with the order they
5.5 C-DWA and RIFF                                                              64


arrived in. If RIFF is used then C-DWA will not have to prioritize DRIVE com-
mands over DATA as RIFF will make sure that whenever read is pending it is at
the front of the queue.
   For example lets say there are 4 read transactions and 2 write transactions
in the queue. Initially C-DWA will give bandwidth to RAS/CAS/PRECHARGE
operations irrespective of transaction type or the order. It helps to prepare DRAM
devices as early as possible. But when all RAS commands can not be sent with cur-
rent available wavelengths, it uses transaction order to assign wavelengths. When
read transactions are ready to issue DRIVE commands and write transactions are
ready to issue DATA commands, DRIVEs will get the first priority. It improves
the read latency but not the write latency.
   C-DWA with equal priority to DRIVE and DATA (we call it unbiased C-DWA)
will give an equal share to DATA and DRIVE. But when enabled with RIFF, read
transactions will be in front of the queue, hence their demands will be satisfied
before the write transactions. In other words when priorities are same then C-
DWA falls back to first fit allocation following the transaction order in the queue.
   In both approaches; Greedy with biased C-DWA and RIFF with unbiased C-
DWA are similar in nature since number of wavelengths that get assigned to DATA
commands depend upon the number of DRIVE commands. Difference is only when
C-DWA tries to use ’transaction ordering’ to assign wavelengths. Biased C-DWA
with Greedy will use the order in which transactions arrived while any configuration
with RIFF enabled will give an advantage to read transactions. Thus RIFF + C-
DWA strictly bias bandwidth towards reads.
   To verify this we setup three configurations a) RIFF with unbiased C-DWA b)
RIFF with biased C-DWA and, c) Biased C-DWA with greedy. We keep 8DIMMs
5.5 C-DWA and RIFF                                                                               65



                    1000
     Latency (ns)




                    100




                                                            RIFF + Biased C-DWA READ
                                                           RIFF + Biased C-DWA WRITE
                                                          RIFF + Unbiased C-DWA READ
                                                         RIFF + Unbiased C-DWA WRITE
                                                        GREEDY + Biased C-DWA READ
                                                       GREEDY + Biased C-DWA WRITE
                     10
                           0   5   10   15   20    25    30    35     40   45   50     55   60
                                              Observed Bandwidth (GB/s)




                     Figure 5.15: RIFF and C-DWA
No of DIMMs = 8, Transaction queue size = 16, READ:WRITE ratio 2:1

on a channel with 64 wavelengths and use synthetic random traces. Arrival rate
is increased from 1-80GB/s in steps of 2GB/s. We split the total observed band-
width into two parts i) Read bandwidth and, ii) Write bandwidth along with their
respective average latencies. Figure 5.15 shows results for all three configurations.
   As we can see from figure 5.15, read bandwidth for RIFF with unbiased or
biased C-DWA is slightly better than biased C-DWA with greedy but write band-
width is less than biased C-DWA with greedy. RIFF + biased C-DWA is equivalent
to RIFF + unbiased C-DWA. Biased C-DWA will always satisfy DRVIEs first and
due to implicit ordering by RIFF even unbiased C-DWA will do the same assign-
ments. From figure 5.15 we can see that results for both approaches are same.
Hence when RIFF is set then C-DWA need not prioritize DRIVEs over DATAs.
5.6 Power Analysis of OCDIMM                                                    66


5.6     Power Analysis of OCDIMM

We focus on the power consumption of the memory subsystem shown in figure
3.1(a), which includes the optical losses and the power consumption of the elec-
trical and optical components. We target a 65nm node and assume the memory
controller operates at 4 GHz. Within this model, there are two sources for power
consumption - the DRAM chips themselves with the advanced memory buffer
(AMB) for FBDIMM and the DRAM chips running with the optical memory
controller (OMC) for the OCDIMM. The OMC components consists of the seri-
alizer, modulator, deserializer, detector and optical clocking source as shown in
figure 3.1(b). We use [29] to model the power consumed by the DDR2/3 DRAM
chips based on the activity statistics obtained from the DRAMSim for the vari-
ous memory traces given. The AMB power consumption was obtained from the
       ˇ
DRAMSimSs modeled FBDIMM selection. The OCDIMM’s power model consists
of DRAM data obtained from the DDR power calculator combined with the esti-
mated energy of the photonic components used. The DIMM IDLE time difference
between the two memory interfaces has also been considered. For FBDIMM, the
IDLE time represents the time that the DRAM is in a known CKE power down
state. This shuts down the output drivers and forces the memory controller to
activate periodic refresh intervals approximately every 7.6us. Since the simulation
does not model a CKE powerdown sequence we assume that when the DRAMs
are in the IDLE state the DRAM is also put into a CKE powerdown state. Our
assumption here is that for 5% of the total system run time the DRAM is in this
state. When analyzing the IDLE time for the OMC we will assume that the mod-
ulators, serializers, etc are also in a known powerdown state thus reducing system
5.6 Power Analysis of OCDIMM                                                 67


power at IDLE time. Enacting IDLE has noted differences between the OCDIMM
and FBDIMM configurations for FBDIMM, the AMB is still passing data to other
ACTIVE DIMMs in the system. This hop latency for FBDIMM adds an increased
energy per route because the AMB cannot be put into some power savings state.
For the OCDIMM case, when the DIMM is in an IDLE mode, the OMC related
to the DIMM in the IDLE state does not pass any data and can also go into a
powerdown mode itself.
   FBDIMM power analysis was derived from the DRAMSim including power
numbers for both the DDR2-667 and DDR3-1333 components and the AMB. For
OCDIMM, we use the same power numbers for the DDR2/3 DRAM employed
and model the power consumed in the modulators, detectors and other structures
shown in figure 3.1(a) based on the research and data reported in [30, 17]. When
simulating with the DDR3- 1333 components, we decided to use low power DDR3
which runs at a lower VDD range of 1.35V. Table 5.2 lists optical components and
their power consumption or losses. It also shows equations we used to derive the
power model.
   Figure 5.16(a) and 5.16(b) compares OCDIMM with FBDIMM with respect to
power, using randomly generated memory traffic for a DDR2 and a DDR3 based
system respectively. As expected, OCDIMM has a power advantage due to the
OMC not having store and forward logic. However, if only 1 DIMM is popu-
lated, OCDIMM could have a high power consumption because of the additional
overhead in the electrical-to-optical conversion and back. But, as the capacity
increases, OCDIMM delivers good power savings (on average about 15% to 19%)
compared to FBDIMM. One could argue that in absolute terms, the power con-
sumption of OCDIMMs, especially at larger capacities (64GB) is quite high, but
5.6 Power Analysis of OCDIMM                                                     68

 Power Component                         Losses  / Per Node(DIMM)
                                         Consumed
 On-chip Coupling       loss(twice   per 3dB       Pcoupling =2*3dB
 DIMM)
 Silicon optical fiber                      0.5 * e−5    Pf iber = 2cm * 0.5*
                                           dB/cm        e−5 dB/cm
 Waveguide                                 1 dB/cm      Pwaveguide = 1cm *
                                                        1dB/cm
 Modulation Insertion loss                 1 dB         PM = 1 dB
 Splitter                                  0.2 dB       Psplitter = 0.2dB
 Photo-diode min Detector power(per        48µW         Pdet = 48µW * λ
 wavelength)
 Miscellaneous(Serializer    +     De-     320µW        Pmisc = 320 µW
 Serializer)
 Modulator driver (per wavelength)         280 µW                 t
                                                        Pmod = 280¸W * λ

                        Table 5.2: OCDIMM Power Model
Equations:
POM C = PIN −ELEC + PIN −OP T Total power consumed in OMC.
PIN −OP T = N*Pdet *10[(PM +Pcoupling +Psplitter +Pf iber +Pwaveguide )/10]
PIN −ELEC = N*(Pmisc +Pmod )....N is the total number of DIMMs on a channel.

given the higher bandwidth we believe it is justifiable in high performance com-
puting applications. However, opportunities for power optimizations can be seen
by analyzing the sources of the power consumption as shown in figures 5.16(a),
5.16(b). It is interesting to note that the power consumption due to the optical
overhead (modulators, laser source and detectors) is negligible compared to the
power consumed by the OMC and the DRAM chips. So, clearly the OMC has to
be optimized in conjunction with the memory controller to reduce the power con-
sumption. With the modulators and lasers taking up a good portion of the power
envelope, an effort should be put forth that focuses on better modulation materials
and external laser generation to reduce the overall system power requirements.
5.6 Power Analysis of OCDIMM                                                                                                                                                    69




              220                                                                                    180
                                                                                                     175
              210                                                                                    170
                                                                                                     165
              200
                                                                                                     160
              190                                                                                    155
                                                                                                     150
              180                                                                                    145
                                                                                                     140
              170
                                                                                                     135
              160                                                                                    130
                                                                                                     125
              150                                                                                    120
              140                                                                                    115
                                                                                                     110
              130                                                                                    105
                                                                                                     100
              120
  Power (W)




                                                                                         Power (W)
                                                                                                      95
              110                                                       AMB                           90                                                       AMB
              100
                                                                        PASSTHROUGH/OM                85
                                                                                                      80
                                                                                                                                                               PASSTHROUGH/OM
              90
                                                                        C MOD_DEMOD                   75                                                       C MOD_DEMOD
                                                                                                      70
              80                                                        AMB/OMC ACTIVE                65                                                       AMB/OMC ACTIVE
                                                                                                      60
              70                                                        AMB/OMC IDLE                  55
                                                                                                                                                               AMB/OMC IDLE
              60                                                        DRAM                          50                                                       DRAM
                                                                                                      45
              50                                                                                      40
                                                                                                      35
              40
                                                                                                      30
              30                                                                                      25
                                                                                                      20
              20                                                                                      15
                                                                                                      10
              10
                                                                                                       5
               0                                                                                       0
                    2(FBD) 4(FBD) 8(FBD) 16(FBD) 32(FBD) 64(FBD)                                           2(FBD) 4(FBD) 8(FBD) 16(FBD) 32(FBD) 64(FBD)
                         2(OCD) 4(OCD) 8(OCD) 16(OCD) 32(OCD) 64(OCD)                                           2(OCD) 4(OCD) 8(OCD) 16(OCD) 32(OCD) 64(OCD)



                                     Capacity (GB)                                                                          Capacity (GB)


                               (a) DDR2-667Mhz                                                                       (b) DDR3-1333MHz

Figure 5.16: Comparison of OCDIMM(OCD) and FBDIMM(FBD) Power Con-
sumption
Random traces with a fixed latency, single channel, open page mode with
greedy scheduling and queue depth of 16 were used . Capacity(number
of DIMMs on channel) is varied on the X axis and the Y axis plots aver-
age power. Even positions on X axis represents OCDIMM. Components
PDRAM , PAM B IDLE , PAM B ACT IV E are same for FBDIMM and OCDIMM.
                                                                                70




Chapter 6

Related Work

Miller presents a strong case for why optical interconnects make sense in [31],
and [32] compares electrical and optical interconnects from a power perspective.
More recently, HP researchers quantified the advantage of nanoscale photonic de-
vices and identified opportunities for nanophotonics in computing applications [8].
Corona [19] is a manycore 3D architecture recently proposed by HP that uses an
on-chip WDM-based nanophotonic crossbar to connect 64 clusters of processors
and external memory. MIT researchers [17, 33] have presented a new monolithic
silicon technology suited for integration with standard bulk CMOS processes and
have developed a processor memory network architecture for future manycore sys-
tems based on an optoelectronic global crossbar. Cornell researchers [30] have also
described a methodology to leverage optical technology to reduce power and latency
in on-chip networks suitable for bus-based multiprocessors. Columbia university
researchers proposed microresonator-based mesh/torus on-chip networks [34, 35]
for chip-level multiprocessors.
   The MIT and Columbia groups use electrical networks for arbitration, while the
                                                                                 71


HP approach uses an all-optical approach. Though the MIT and HP groups indi-
cate that their on-chip networks can also be used to interface with the memory, they
do not discuss the details. We believe that the first step in exploiting the benefits
of nanophotonics in computing systems should be the processor-memory interface.
IBM researchers point out the benefits of optical interconnects in a server archi-
tecture [36], while the Multi-Wavelength Assemblies for Ubiquitous Interconnects
(MAUI) has been examining the idea of interfacing fiber to the processor (FTTP)
and how the cost will affect the overall penetration in the market [37]. An older re-
search project, the High Speed Opto-Electric Memory System (HOLMS) interface,
proposes an optical connection between the memory system and the processor, but
acknowledges the need for a controller-to-DIMM interface to take advantage of the
known optical networking success in the telecommunications industry [38].
   The problem of the memory wall has spawned research on novel interconnect
technologies. Researchers at SUN Microsystems, Drost et. al. [39] point out the
technological challenges to building a flat-bandwidth memory hierarchy in CMOS-
based systems, and offer proximity communication as a way to provide both high
bandwidth and high capacity simultaneously. Stacking die or chips to realize a
3D topology has been advocated by Intel researchers [40], which can be used to
increase the memory bandwidth without increasing the number of pins and possibly
even lower I/O power. DRAM caches [41, 42] have been studied in this context to
increase memory bandwidth and possibly reduce latency. We propose to address
some of the same challenges using WDM-based optical interconnect. Haas and
Vogt introduced FBDIMM technology in [7], and [23] evaluated the potential of
FBDIMM on real workloads.
                                                                                 72




Chapter 7

Conclusion

With decreasing feature sizes we can easily pack a large number of cores on a given
chip. The problem shifts on how to feed the cores, especially on how to increase the
pin bandwidth and reduce the memory latency. In addition, it is important to build
balanced computing systems those scale with performance, which means capacity
and bandwidth both have to increase simultaneously and this is a challenging
problem.
   Chapter 3 concentrated on detailed design of OCDIMM. We first introduced
a simple extension to FBDIMM, OCDIMM-BASE. OCDIMM-BASE eliminates
electrical problems in high speed electric buses and gives an opportunity to in-
crease data transfer between memory controller and modules. Design is simple
and doesn’t require any changes to FBDIMM frame/controller.
   In chapter 5 we analyzed OCDIMM-BASE and found that the optical band-
width utilization is very low. By changing topology and protocol to improve optical
bandwidth utilization we derived OCDIMM-SWA and OCDIMM-DWA.
   From section 5.2 we proved that memory wall has been pushed behind as we
                                                                              73


move from FBDIMM to OCDIMM-DWA. As compared to FBDIMM, OCDIMM
was able to provide as much as 8 times more bandwidth.
   In section 5.3 we showed that OCDIMM-DWA does not need 64 wavelengths to
get performance as much as FBDIMM. We varied optical resources (wavelengths
and frequency of operation) against capacity and demonstrated that 8GB, 16GB
and 128GB capacities are possible to support at 16, 32, 64 wavelengths, respec-
tively, without decreasing bandwidth and latency. Adding more wavelengths or
increasing optical speed improved bandwidth for these configurations which sup-
ports our major goal of solving capacity and bandwidth simultaneously. We also
analyzed performance from latency perspective. We used faster DRAM chips run-
ning at higher clocks (1.3Ghz and 2Ghz) and saw the reduction in latency and
improved bandwidth. OCDIMM DWA performed much better with faster DRAM
devices and adjusted itself to give more bandwidth as high as 107GB/s and lower
latency as low as 25ns. We didn’t change any protocol or design except for some
parameters configured initially to achieve this performance. Sensitivity analysis
also helped us to figure out better of the two DWA algorithm, T-DWA and C-
DWA. While T-DWA is a simple and strictly fair algorithm, it doesn’t respond
to changes in request arrival rate and congestion at DRAM level. T-DWA can
keep DRAM device waiting for command/data in order to force fairness to all
transactions and hence it doesn’t utilize DRAM resources fully. C-DWA looks at
memory operations at fine granularity and does priority based scheduling, allow-
ing critical operations to complete before non-critical. Thus C-DWA is responsive
to variable request arrival rate and tries to get work done from DRAM system as
soon as possible. Ideal algorithm should be parameterized algorithm which can get
inputs from external system and shape traffic accordingly. For example, if CPU-1
                                                                               74


is running mission-critical load then it may be possible to reserve certain number
of wavelengths for all transactions from CPU-1. Inserting parameters in C-DWA
is very simple as it already accommodates reserving wavelengths for refresh cy-
cles in advance. Externally controlling bandwidth sharing will offload some of the
complexity from memory controller. External entities are in the best position to
give feedback to memory controller. Memory controller by its current design can
not prioritize bandwidth requirements. Letting applications to send a feedback to
memory controller will solve application specific memory scaling problems.
   Section 5.4 looked at latency reduction offered by OCDIMM. Since optical
components run at higher frequency, its natural to see latency reduction. But we
showed in some cases that latency of FBDIMM and OCDIMM do not differ by a
large margin. This is one of the key aspects behind design of OCDIMM-SWA and
OCDIMM-DWA. To utilize optical resources to the fullest we kept more DIMMs
active in parallel. The high performance address mapping plays a key role in
distributing address amongst available DIMMs. Using parallelism offered by it,
OCDIMM was able to reduce bus conflicts giving more independence to DIMMs.
We saw that for real-world applications and benchmarks, latency is reduced by 10-
50%. Simulation methods can be changed to do full system analysis giving detailed
information about each instruction and memory latency. This would help to de-
termine total running time of the application exactly and fine tune parameterized
C-DWA as mentioned above.
   In section 5.6 we presented a basic power model for OCDIMM using FBDIMM’s
power model. We showed that OCDIMM will be able to reduce power consumption
by 15-17% for average load. A more detailed analysis can be done to find out its
effect on the entire system but we leave it as a part of future work.
REFERENCES                                                                       75




References

[1] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James
    Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester
    Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The
    landscape of parallel computing research: A view from berkeley. Technical
    Report UCB/EECS-2006-183, EECS Department, University of California,
    Berkeley, Dec 2006.

[2] Tilera. Tile64 processor. http://www.tilera.com/products/tile64.php.

[3] Pradeep Dubey. Recognition, mining and synthesis moves computers to the
    era of tera. In Intel Technology Magazine, February 2005.

[4] Gordon Bell, Jim Gray, and Alex Szalay. Petascale computational systems.
    Computer, 39(1):110–112, 2006.

[5] Daniel Atkins, K. Droegmeier, S. I. Feldman, H. Garcia-Molina, Paul Messina,
    and J. Ostriker. Revolutionizing science and engineering through cyberinfras-
    tructure - report of blue ribbon panel on cyberinfrastructure. Technical report,
    National Science Foundation, 2003.

[6] Gabriel H. Loh. 3d-stacked memory architectures for multi-core processors.
    SIGARCH Comput. Archit. News, 36(3):453–464, 2008.

[7] Jon Haas and Pete Vogt. Fully-buffered dimm technology moves enterprise
    platforms to the next level. Intel Technology Magazine, 2005.

[8] R.G. Beausoleil, P.J. Kuekes, G.S. Snider, Shih-Yuan Wang, and R.S.
    Williams. Nanoelectronic and nanophotonic interconnect. Proceedings of the
    IEEE, 96(2):230–247, Feb. 2008.

[9] M. Lipson. High performance photonics on silicon. Optical Fiber communica-
    tion/National Fiber Optic Engineers Conference, 2008. OFC/NFOEC 2008.
    Conference on, pages 1–3, Feb. 2008.
REFERENCES                                                                      76


[10] R. Soref and B. Bennett. Electrooptical effects in silicon. Quantum Electron-
     ics, IEEE Journal of, 23(1):123–129, Jan 1987.
[11] R.A. Soref.    Silicon-based optoelectronics.     Proceedings of the IEEE,
     81(12):1687–1706, Dec 1993.
[12] R. Soref. The past, present, and future of silicon photonics. Selected Topics
     in Quantum Electronics, IEEE Journal of, 12(6):1678–1687, Nov.-dec. 2006.
[13] Q. Xu, B. Schmidt, S. Pradhan, and M. Lipson. Micrometre-scale silicon
     electro-optic modulator. Nature, 2005.
[14] M. Lipson. Compact electro-optic modulators on a silicon chip. Selected Topics
     in Quantum Electronics, IEEE Journal of, 12(6):1520–1526, Nov.-dec. 2006.
[15] Linjie Zhou, Ken Kashiwagi, Katsunari Okamoto, R. P. Scott, N. K. Fontaine,
     Dan Ding, S. J. Ben Yoo, and Venkatesh Akella. Towards athermal optically-
     interconnected computing system using slotted silicon microring resonators
     and rf-photonic comb generation. Submitted to Applied Physics A - Special
     Issue on Photonics Interconnects, 2008.
[16] Brian R. Koch, Alexander W. Fang, Oded Cohen, and John E. Bowers. Mode-
     locked silicon evanescent lasers. Opt. Express, 15(18):11225–11233, 2007.
[17] Christopher Batten, Ajay Joshi, Jason Orcutt, Anatoly Khilo, Benjamin Moss,
     Charles Hozwarth, Milos Popovic, Hanqing Li, Henry Smit, Judy Hoyt, Franz
     Kartner, Rajeev Ram, Vladimir Stojanovic, and Krste Asanovic. Building
     manycore processor to dram networks with monolithic silicon photonics. In
     Proceedings of the Sixteenth Symposium on High Performance Interconnects
     (HOTI-16), august 2008.
[18] Michael Tan, Paul Rosenberg, Jong Souk Yeo, Moray McLaren, Sagi Mathai,
     Terry Morris, Joseph Straznicky, Norman P. Jouppi, Huei Pei Kuo, Shih-
     Yuan Wang, Scott Lerner, Pavel Kornilovich, Neal Meyer, Robert Bicknell,
     Charles Otis, and Len Seals. A high-speed optical multi-drop bus for computer
     interconnections. In HOTI ’08: Proceedings of the 2008 16th IEEE Symposium
     on High Performance Interconnects, pages 3–10, Washington, DC, USA, 2008.
     IEEE Computer Society.
[19] Dana Vantrease, Robert Schreiber, Matteo Monchiero, Moray McLaren, Nor-
     man P. Jouppi, Marco Fiorentino, Al Davis, Nathan Binkert, Raymond G.
     Beausoleil, and Jung Ho Ahn. Corona: System implications of emerging
     nanophotonic technology. In Proceedings of the 35th International Sympo-
     sium on Computer Architecture (ISCA-35), pages 153–164. IEEE Computer
     Society, 2008.
REFERENCES                                                                    77


[20] Amit Hadke, Tony Benavides, S. J. Ben Yoo, Rajeevan Amirtharajah, and
     Venkatesh Akella. OCDIMM: Scaling the DRAM memory wall using WDM
     based optical interconnects. In Proceedings of 16th IEEE Symposium on High
     Performance Interconnects (HOTI 2008), Palo Alto, CA, august 2008.

[21] Amit Hadke, Tony Benavides, Mathew Farrens, Rajeevan Amirtharajah, and
     Venkatesh Akella. Design and evaluation of an optical cpu/dram interconnect.
     In Proceedings of 26th IEEE International Conference on Computer Design,
     Squaw Creek, Lake Tahoe, CA, October 2008.

[22] Bruce Jacob, Spencer Ng, and David Wang. Memory Systems - Cache,
     DRAM, disk. Morgan Kaufman Publishers, 2007.

[23] Brinda Ganesh, Aamer Jaleel, David Wang, and Bruce Jacob. Fully-buffered
     dimm memory architectures: Understanding mechanisms, overheads and scal-
     ing. hpca, 0:109–120, 2007.

[24] David A. Patterson. Latency lags bandwith. Communications of the ACM,
     47(10):71–75, 2004.

[25] David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes,
     Aamer Jaleel, and Bruce Jacob. Dramsim: a memory system simulator.
     SIGARCH Computer Architecture News, 33(4):100–107, 2005.

[26] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The
     PARSEC benchmark suite: Characteristics and architectural implications.
     In Parallel Architecture and Compilation Techniques (PACT 2008), Toronto,
     October 2008.

[27] Christian Bienia, Sanjeev Kumar, and Kai Li. Parsec vs. splash-2: A
     quantitative comparison of two multithreaded benchmark suites on chip-
     multiprocessors. Workload Characterization, 2008. IISWC 2008. IEEE In-
     ternational Symposium on, pages 47–56, Sept. 2008.

[28] Aamer Jaleel. Memory characterization of workloads using instrumentation-
     driven simulation. Web Copy: http://www.glue.umd.edu/ ajaleel/workload/.

                      ¸
[29] Micron. Micron, Stechnical note: Calculating memory system power for
          ˇ
     ddr3,T. www.micron.com, 0, 2006.

[30] Nevin Kirman, Meyrem Kirman, Rajeev K. Dokania, Jose F. Martinez,
     Alyssa B. Apsel, Matthew A. Watkins, and David H. Albonesi. Leveraging
     optical technology in future bus-based chip multiprocessors. In MICRO 39:
REFERENCES                                                                       78


    Proceedings of the 39th Annual IEEE/ACM International Symposium on Mi-
    croarchitecture, pages 492–503, Washington, DC, USA, 2006. IEEE Computer
    Society.

[31] D.A.B. Miller. Rationale and challenges for optical interconnects to electronic
     chips. Proceedings of the IEEE, 88(6):728–749, Jun 2000.

[32] Hoyeol Cho, Pawan Kapur, and Krishna C. Saraswat. Power comparison
     between high-speed electrical and optical interconnects for interchip commu-
     nication. Journal of Lightwave Technology, 22(9):2021, 2004.

[33] J.S. Orcutt, A. Khilo, M.A. Popovic, C.W. Holzwarth, B. Moss, Hanqing Li,
     M.S. Dahlem, T.D. Bonifield, F.X. Kartner, E.P. Ippen, J.L. Hoyt, R.J. Ram,
     and V. Stojanovic. Demonstration of an electronic photonic integrated circuit
     in a commercial scaled bulk cmos process. Lasers and Electro-Optics, 2008 and
     2008 Conference on Quantum Electronics and Laser Science. CLEO/QELS
     2008. Conference on, pages 1–2, May 2008.

[34] Benjamin A. Small, Benjamin G. Lee, Keren Bergman, Qianfan Xu, , and
     Michal Lipson. Multiple-wavelength integrated photonic networks based on
     microring resonator devices. Journal of Optical Networking, 2006.

[35] Asssaf Shacham, Keren Bergman, and Luca Carloni. On the design of a
     photonic network-on-chip. In First International Symposium on Networks-
     on-Chip, May 2007.

[36] A. F. Benner, M. Ignatowski, J. A. Kash, D. M. Kuchta, and M. B. Ritter. Ex-
     ploitation of optical interconnects in future server architectures. IBM Journal
     of Research and Development, 49(4/5):755–775, 2005.

[37] B.E. Lemoff, M.E. Ali, G. Panotopoulos, G.M. Flower, B. Madhavan, A.F.J.
     Levi, and D.W. Dolfi. Maui: enabling fiber-to-the-processor with paral-
     lel multiwavelength optical interconnects. Journal of Lightwave Technology,
     22(9):2043–2054, Sept. 2004.

[38] Paul Lukowicz, J Jahns, R Barbieri, P Benabes, T Bierhoff, A Gauthier,
     M Jarczynski, G Russel, J Schrage, J Snowdon, M Wirz, and G Troster.
     Optoelectronic interconnection technology in the holms system. IEEE Journal
     of selected topics in quantum electronics, 1995.

[39] Robert Drost, Craig Forrest, Bruce Guenin, Ron Ho, Ashok V. Krishnamoor-
     thy, Danny Cohen, John E. Cunningham, Bernard Tourancheau, Arthur
     Zingher, Alex Chow, Gary Lauterbach, and Ivan Sutherland. Challenges
     in building a flat-bandwidth memory hierarchy for a large-scale computer
REFERENCES                                                                    79


    with proximity communication. In HOTI ’05: Proceedings of the 13th Sym-
    posium on High Performance Interconnects, pages 13–22, Washington, DC,
    USA, 2005. IEEE Computer Society.

[40] Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang,
     Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pan-
     tuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb.
     Die stacking (3d) microarchitecture. Microarchitecture, 2006. MICRO-39.
     39th Annual IEEE/ACM International Symposium on, pages 469–479, Dec.
     2006.

[41] Li Zhao, R. Iyer, R. Illikkal, and D. Newell. Exploring dram cache architec-
     tures for cmp server platforms. Computer Design, 2007. ICCD 2007. 25th
     International Conference on, pages 55–62, Oct. 2007.

[42] Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. Design and optimization of
     large size and low overhead off-chip caches. IEEE Transactions on Computers,
     53(7):843–855, 2004.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:10/6/2011
language:Hungarian
pages:90