Virtually Pipelined Network Memory

Document Sample
Virtually Pipelined Network Memory Powered By Docstoc
					                                   Virtually Pipelined Network Memory

                                         Banit Agrawal       Timothy Sherwood
                                           Department of Computer Science
                                         University of California, Santa Barbara

                         Abstract                                 each packet needs to be buffered, classified into a service
                                                                  class, looked up in the forwarding table, queued for switch-
   We introduce virtually-pipelined memory, an architec-          ing, rate controlled, and potentially even scanned for content.
tural technique that efficiently supports high-bandwidth,          Each of these steps may require multiple dependent accesses
uniform latency memory accesses, and high-confidence               to large irregular data structures such as trees, sparse bit-
throughput even under adversarial conditions. We apply this       vectors, or directed graphs, usually from the same memory
technique to the network processing domain where memory           hierarchy. To make things worse the size of the data struc-
hierarchy design is an increasingly challenging problem as        tures grow significantly with the line rate (40 gbps to 160
network bandwidth increases. Virtual pipelining provides          gbps). Routing tables have grown from 100K to 360K pre-
a simple to analyze programing model of a deep pipeline           fixes and classification rules have grown from 2000 to 5000
(deterministic latencies) with a completely different physi-      rules. Network devices will become increasingly reliant on
cal implementation (a memory system with banks and prob-          high density commodity DRAM to remain competitive in
abilistic mapping). This allows designers to effectively de-      both pricing and performance.
couple the analysis of their algorithms and data structures
from the analysis of the memory buses and banks. Unlike              In this paper, we present virtually pipelined network
specialized hardware customized for a specific data-plane          memory (VPNM), an idea that shields algorithm and pro-
algorithm, our system makes no assumption about the mem-          cessor designers from the complexity inherent to commodity
ory access patterns. In the domain of network processors this     memory DRAM devices which are optimized for common
will be of growing importance as the size of the routing ta-      case performance. The pipeline provides a programming
bles, the complexity of the packet classification rules, and the   model and timing abstraction which greatly eases analysis.
amount of packet buffering required, all continue to grow at      A novel memory controller upholds that abstraction and han-
a staggering rate. We present a mathematical argument for         dles all the complexity of memory system, including bank
our system’s ability to provably provide bandwidth with high      conflicts, bus scheduling, and worst case caching. This frees
confidence and demonstrate its functionality and area over-        the programmer from having to worry about any of these is-
head through a synthesizable design. We further show that,        sues, and the memory can be treated as a flat deeply pipelined
even though our scheme is general purpose to support new          memory with fully deterministic latency no matter what the
applications such as packet reassembly, it outperforms the        memory access pattern is. Building a memory controller that
state of the art in specialized packet buffering architectures.   can create such an illusion requires that we solve several ma-
                                                                  jor problems:
1   Introduction                                                  • Multiple Conflicting Requests: Two memory requests that
   While consumers reap the benefits of ever increasing              access the same bank in memory will be in conflict, and we
network functionality, the technology underlying these ad-          will need to stall at least one request. To hide these con-
vances require armies of engineers and years of design and          flicts, our memory controller uses per-bank queues along
validation time. The demands placed on a high throughput            with a randomized mapping to ensure that independent
network device are significantly different than those encoun-        memory accesses have a statistically bounded number of
tered in the desktop domain. Network components need                bank conflicts. (Section 3.2)
to reliably service traffic even under the worst conditions        • Reordering of Requests: To resolve bank conflicts requests
[22, 17, 12, 27, 14], yet the underlying memory components          need be reordered, but a virtual pipeline requires determin-
on which they are built are often optimized for common case         istic (in-order) behavior. Since the latencies of all memory
performance. The problem is that network processing, at the         accesses are normalized through novel specialized queues,
highest throughputs, requires massive amounts of memory             distributed reordering of accesses can be done which create
bandwidth with worst-case throughput guarantees. A new              the appearance of fully pipelined memory. (Sections 3.3
packet may arrive every three nanoseconds for OC-3072, and          and 4.1)
• Redundant Requests: As repeated requests for the same            accesses around the memory system, and through the use of
  data cannot be randomized to different banks, normalizing        Galois fields show that it is possible to have a pseudo-random
  the latency for these requests could create the need for gi-     function that will work as well on any possible stride. While
  gantic queues. Instead we have built a novel form of merg-       address mapping such as skewing or linear transformations
  ing queues that combines redundant requests and acts as          can be used for constant stride, out of order accesses can
  a cache, but provides data playback at the right times to        efficiently handle a larger number of strides [20]. Corbal et
  maintain the illusion of a pipeline. (Section 3.4)               al. [5] present a command vector memory system (CVMS)
• Worst Case Analysis: In addition to the architectural chal-      where a full vector request is sent to the memory system in-
  lenges listed above, reasoning about the worst case behav-       stead of individual addresses to provide higher performance.
  ior of our system requires careful mathematical analysis.           While these optimizations are incredibly useful, industrial
  We show that it is provably hard for even a perfect adver-       developers working on devices for the core will not adopt
  sary to create stalls in our virtual pipeline with greater ef-   them due to the fact that certain deterministic traffic patterns
  fectiveness than random chance. (Sections 5.1 and 5.2)           could cause performance to sink drastically. Dropping a sin-
    To quantify the effectiveness of our system, we have per-      gle packet can have a enormous impact on network through-
formed a rigorous mathematical analysis, executed detailed         put (as this causes a cascade of events up the network stack)
simulation, created a synthesizable version, and estimated         and the customer needs to be confident that the device will
hardware overheads. In order to show that our approach will        uphold its line rate. In this domain it would be ideal if there
actually provide both high performance and ease of program-        was a general purpose way to control banked access such
ming, we have implemented a packet buffering scheme as a           that conflicts never occur. Sadly this is not possible in the
demonstration of performance, and a packet reassembler as a        general case [4]. However, if the memory access patterns
demonstration of usefulness, both using our memory system.         can be carefully constrained, algorithms can be developed
We show that despite the generality of our approach (it does       which solve certain special cases.
not assume the head-read tail-write property) it compares fa-         Removing Bank Conflicts in Special Cases – One of the
vorably with several special-purpose packet buffering archi-       most important special cases that has been studied is packet
tectures in both performance and area (Section 5.4).               buffering. Packet buffering is one of the most memory in-
                                                                   tensive operations that networks need to deal with [17, 12],
2   Related Work                                                   and high speed DRAM is the only way to provide both the
    Dealing with timing variance in the memory system has          density and performance required by modern routers. How-
certainly been addressed in several different ways in the past.    ever, in recent years researchers have shown special methods
Broadly the related work can be broken up into two groups:         for mapping these queues onto banks of memory in such a
scheduling and bank conflict reduction techniques that work         way that conflicts are either unlikely [14, 2, 22, 29] or im-
in the common case, and special purpose techniques that            possible [12, 17]. These techniques rely on the ability to
aim to put bounds on worst case performance on particular          carefully monitor the number of places in memory where
classes of access patterns.                                        a read or write may occur without a bank conflict, and to
    Common-case DRAM Optimizations – Memory bank                   schedule memory requests around these conflicts in various
conflicts not only effect the total throughput available from       ways. For example, in [12], a specialized structure similar
a memory system, they can also significantly increase the           to a reorder buffer is used to schedule accesses to the heads
latency of any given access. In the traditional processing do-     and tails of the different packet buffer queues. The tech-
main, memory latency can often be the most critical factor         nique combines clever algorithms with careful microarchi-
in terms of determining performance and several researchers        tectural design to ensure worst case bounds on performance
have proposed hiding this latency with bank-aware mem-             are always met in the case of packet buffering. Randomiza-
ory layout, prefetching, and other architectural techniques        tion has also been considered in the packet buffering space
[11, 21]. While latency is critical, traditional machines are      as well. For example, Kumar et. al. present a technique
far more tolerant of non-uniform latencies and reordering          for buffering large packets by randomly distributing parts
because many other mechanisms are in place to ensure the           of the packet over many different memory channels [19].
proper execution order is preserved. For example, in the           However, this technique can handle neither small packets
streaming memory controller (SMC) architecture, memory             nor bank conflicts. Another important special case is data-
conflicts are reduced by servicing a set of requests from one       plane algorithms which may also suffer from memory bank
stream before switching to a different stream [16]. A second       conflict concerns. Whether these banks of memory are on or
example is the memory scheduling algorithm where mem-              off chip, supporting multiple non-conflicting banked mem-
ory bandwidth is maximized by reordering various com-              ory accesses requires a significant amount of analysis and
mand requests [25]. In the vector processing domain [10],          planning. For example, a conflict reduced tree-lookup en-
a long stream requires conflict free access for larger number       gine was proposed by Baboescu et. al. [1]. A tree is broken
of strides. Rau et al. [24] use randomization to spread the        into many sub-trees, each of which are then mapped to parts
of a rotating pipeline. They prove that optimally allocating       other words it is the number of accesses that will have to be
the sub-trees is NP-complete and present a heuristic mapping       skipped before a bank conflict can be resolved. Throughout
instead. Similarly in [15], a conflict-free hashing technique       this paper we conservatively assume that there is one transfer
is proposed for longest prefix match (LPM) where conflicts           per cycle and we select the value of L=20 [26, 30]. If L is
are taken care of at an algorithmic level. While the above         smaller then our approach will be even more efficient.
methods are very powerful, they all require careful layout (by
the programmer or hardware designer) of each data structure        3.2 Building a Provably Strong Approach
into the particular bank structure of the system and allow nei-        To prove that our approach will deliver throughput with
ther changes to the data structures nor sharing of the memory      high confidence, we consider the best possible adversary and
hierarchy.                                                         show that adversary will never be able to construct a se-
   While our approach may have one stall on average once           quence of accesses that perform poorly. First, we map the
every 1013 cycles (on the order of hours), the benefit is that      data to banks in permutations that are provably as good as
no time has to be spent considering the effect of banking on       random. Universal hashes [3], an idea that has been ex-
already complex data structures. As we will describe in Sec-       tended by the cryptography community, provides a way to
tion 3, a virtually pipelined network memory system uses           ensure that an adversary cannot figure out the hash function
cryptographically strong randomization, several new types          without direct observation of conflicts. The virtual pipeline
of queues, and careful probabilistic analysis to ensure that       then prevents an adversary from even seeing those conflicts
deterministic latency is efficiently provided with provably         through specialized queues. This latency normalization not
strong confidence.                                                  only allows us to formally reason about our system, it also
                                                                   shields the processor from the problem of reordering, and
3   Virtually Pipelined Memory                                     greatly simplifies data structure analysis. While the latency
   Vendors need to have confidence that their devices will          of any given memory access will be increased significantly
operate at the advertised line rates regardless of operating       over the best possible case, the memory bandwidth deliv-
conditions, including when under denial of service attack by       ered by the entire scheme is almost equal to the case where
adversaries or in the face of unfortunate traffic patterns. For     there are no bank conflicts. While this may make little sense
this reason, most high-end network ASICs do not use any            in the latency-intolerant world of desktop computing, in the
DRAM for data-plane processing because the banks make              network space this can be a huge benefit.
worst-case analysis difficult or impossible. The major ex-
ception to this rule is packet buffering; even today it requires   3.3 Distributed Scheduling around Conflicts
an amount of memory that can only be satisfied through                 While Universal Hashing provides the means to prevent
DRAM and a great deal of effort has been expended to               our theoretical adversary from constructing sets of conflict-
map packet buffering algorithms into banks with worst-case         ing accesses with greater than random probability, even in a
bounds. Later in Section 5.4.1 we compare our implementa-          random assignment of data to banks a relatively large number
tion against several special purpose architectures for packet      of bank conflicts can occur due to the Birthday Paradox [18].
buffering.                                                         In fact if there was no queuing used, then it would take only
3.1 DRAM Banks                                                     O( B) accesses before the first stall would occur if there
    Modern DRAM designs try to expose the internal bank            are B banks. Clearly we will need to schedule around these
structure so accesses can be interleaved and the effective         conflicts in order to keep the virtual pipeline timing abstrac-
bandwidth can be increased [6, 13]. The various types of           tion. In our implementation, a controller for each bank is
DRAM differ primarily in their interfaces at the chip and bus      used, and each bank handles requests in-order, but each bank
level [8, 7, 23], but the idea of banking is always there. Ex-     is handled independently so the requests to different banks
perimental evidence [23] indicates that on average PC133           may be handled out-of-order. Each bank controller is then in
SDRAM works at 60% efficiency and DDR266 SDRAM                      charge of ensuring that for every access at time t, it returns
works at 37% efficiency, where 80 to 85% of the lost effi-           the result at time t + D for some fixed value of D. As long
ciency is due to the bank conflicts. To help address this prob-     as this holds, there is no need for the programmer to worry
lem RDRAMs expose many more banks [23]. For example,               about the fact that there is even such a thing as banks.
in Samsung Rambus MR18R162GDF0-CM8 each RDRAM                         One major benefit of our design is that the memory
device can contain up to 32 banks and each RIMM module             scheduling and reordering can be done in a fully parallel and
can contain up to 16 such devices, so the module can have          independent manner. If each memory bank has its own con-
up to 32 ∗ 16 = 512 independent banks [26].                        troller, there is exactly one request per cycle, and each con-
    A bank conflict occurs when two accesses require differ-        troller ensures that the result of a request is returned exactly
ent rows in the same bank. Only one can be serviced at a           D cycles later, then there is no need to coordinate between
time, and hence one will be delayed by L time steps. L             the controllers. When it comes to return the result at time
is the ratio of bank access time to data transfer time – in        t + D, a bank controller will know that it is always safe to
       typical operating mode                         short-cut accesses                                bank overload stall
        0       10   20   30       40   50   60        0       10   20   30       40   50   60   70      0    10     20   30       40       50   60       70       80
       A    B                  A                       A   B   AA                  A                     A    B                C   D    E                 requests
                                         requests                                            requests


       data ready         A    B                      data ready         A    B   AA
                                                                                                        data ready        A        B                  C    D   E
                                                  A                                                A

Figure 1: An example of how each bank controller will normalize the latency of memory accesses to a fixed delay
(D = 30). In all three graphs the x-axis is cycles and each memory access is shown as a row. The light white boxes
are the times during which the request is “in the pipeline”, while the dark grey box is the actual time that it takes to
access the bank (L=15). In this way a certain number of bank conflicts can be hidden as long as there are not too many
requests in a short amount of time. The graph on the left shows normal operation, while the middle graph shows what
happens when there are redundant requests for a single bank which therefore don’t require bank access. The graph on
the right shows what happens when there are too many requests to one bank (A-E) in a short period of time thus causing
a stall. Later in the analysis section we will also refer to Q which is the maximum number of overlapping requests that
can be handled, in this case Q is 30/15 = 2.

send the data to the interface because by definition it was the                          line. The invariant that a request at time t is satisfied at time
only one to get a request at time t.                                                    t + D must hold for this case as well, and in Section 5.2 we
3.4 What Can Go Wrong                                                                   describe how to design a special merging queue to address
                                                                                        this second problem. The idea behind our merging queue is
    If there are B banks in the system then any one bank will
                                                                                        that redundant memory accesses are combined into a single
only have a 1 in B chance of getting a new request on any
                                                                                        access and a single queue entry internally. If an access A
given cycle. 1 The biggest thing that can go wrong is that
                                                                                        comes at t1 and a redundant access A comes at t2 , a reply
we get so many requests to one bank that one of the queues
                                                                                        still needs to be made at t1 + D and at t2 + D even though
fill up and we need to stall. Reducing the probability of this
                                                                                        internally only one queue entry for A is maintained. In addi-
happening even for the worst cases access pattern requires
                                                                                        tion to handling the repeating pattern “A,A,A,A,...” we need
careful architectural design and mathematical analysis. In
                                                                                        to handle “A,B,A,B,...” with only two queue entries. In fact
fact there are two ways in which a bank can end up getting
                                                                                        if we need to handle Q bank conflicts without a stall, then
more than 1/B of the requests.
                                                                                        we will need to handle up to Q different sets of redundant
    The first way is that it could be unlucky, and just due to
                                                                                        accesses. In Figure 1 we show how the virtually pipelined
randomness more than 1/B of the requests go to a single
                                                                                        network memory works altogether for different type of ac-
bank (because we map them randomly). By keeping access
queues, we can ensure that the latency is normalized to D to
handle simultaneously occurring bank conflicts. How large                                4    Implementing the Interface
that number is, and how long it will take to happen in prac-                               At a high level, the memory controller implementing our
tice are discussed extensively in Section 5. In practice we                             virtual pipeline interface is essentially a collection of decou-
find that normalizing D to 1000 nanoseconds is more than                                 pled memory bank controllers. Each bank controller handles
enough, and is several orders of magnitude less than a typ-                             one memory bank, or one group of banks that act together
ical router latency of 2 milliseconds. While this a typical                             as a single logical bank. Figure 2 shows one possible im-
example, the actual value of D is dependent on L and the                                plementation where a memory controller contains all of the
size of bank access queue as described in Section 4. While                              bank controllers on-chip, and they all share one bus. This
there is a constant added delay to D due to universal hashing,                          would require no modification to the bus or DRAM architec-
the hash function can be fully pipelined and then it will not                           ture.
be any big impact to the normalized delay D.                                               The performance of our controller is limited by the single
    The second way we could get many accesses to one bank                               bus to the memory banks. If we have to service one memory
is that there could be repeated requests for the same memory                            request per cycle, then we need to have one outgoing access
                                                                                        on each cycle to the memory bus and the bus will become
    1 This is not to say that each bank will be responsible for exactly 1/B
                                                                                        a bottleneck. Hence, to keep up with the incoming address
of the requests as in round robin. Round robin will not work here because
                                                                                        per cycle and to prevent any accumulation of requests in the
requests must be satisfied by one bank that contains their memory. Although
we get to pick the mapping between memory lines and banks, the memory                   bank controller due to mismatched throughputs, we need to
access pattern will determine which actual memory lines are requested.                  support slightly more memory bus transaction/second than
                                                                   more multiple collisions). With randomization due to uni-
                          Controller                     Bank0     versal mapping, and a very high value of Mean-time-to-stall
                                                                   (around 1013 as described in Sections 5.1 and 5.2), the ability

                                         Bus Scheduler
                          Bank                                     to do this will be practically impossible. If such attacks are
                          Controller                     Bank1
        HU                                                         believed to be a threat, a further (and sightly more costly)
                                                                   option is to change the universal mapping function and re-
     address                                             Bank2     ordering the data on the occurrence of multiple stalls (an ex-
                                                                   pensive operation, but certainly possible with frequency on
     data                 Bank
                                                         Bank3     the order of once a day).
                                                                   4.1 Bank Controller Architecture
                                                                      Solving the challenges described in the introduction re-
Figure 2: Memory controller block diagram. After an ac-            quires a carefully designed bank controller. In particular, it
cess is randomized from universal hash engine (HU ), it is         must be able to queue the bank requests, store the data to a
directed to the corresponding bank controller for further          constant delay, and handle multiple redundant requests.
processing.                                                           The architecture block diagram of our bank controller is
                                                                   shown in Figure 3. From the figure we can see that the bank
allowed on the interface bus. We call the ratio of the re-
                                                                   controller consists of five main components described with
quest rate on the interface bus and request rate of memory
                                                                   the text next to each block. The primary tasks of these com-
bus as bus scaling ratio (R). The value of ’R’ is chosen
                                                                   ponents include queuing input requests, initiating a memory
slightly higher than 1 to provide slightly higher access rate
                                                                   request, sending data to the interface at a pre-specified time
on the memory side compared to the interface side. This
                                                                   slot to ensure the deterministic latency and each of these
mismatch ensures that idle slots in the schedule do not accu-
                                                                   components is designed to address one or more of challenges
mulate slowly over time. A round-robin scheduler arbitrates
                                                                   mentioned earlier.
the bus by granting access to each bus controller every B
cycles, where B is the number of banks. It might happen            4.2 Controller Operations
that some of the round-robin slots are not used when there is          At a high level each memory request goes through 4
no access for the particular bank or the memory bank is busy,      states: pending, accessing, waiting, and completed. New
although with further analysis or a split-bus architectures this   requests start out as pending, and when the proper request
inefficiency can be eliminated.                                     is actually sent out to the DRAM, the request is accessing.
    Once the determination of which bank a particular mem-         When the result returns from DRAM the request is waiting
ory request needs to access, the request is handed off to the      (until D total cycles have elapsed), and finally the request is
appropriate bank controller which takes care of everything         completed and results are returned to the rest of the system.
for that bank. Almost all of the latencies in the system are           When a new read request comes in, all the valid addresses
fully deterministic so there is no need to employ a compli-        of the address CAM in the delay storage buffer are searched.
cated scheduling mechanism. The only time the latencies are        On a match (a redundant access), the matched row counter
not fully deterministic is when there are a sufficient number       is incremented and the id of matched row is written to the
of memory accesses to a single bank in a sufficiently small         circular delay buffer (along with its valid bit). On a mis-
amount of time that cause the latency normalizing technique        match, a free row is determined using the first zero circuit
to stall. However, as we will show in Section 5, the parame-       and it is updated with the new address and the counter is
ters of the architecture can be chosen such that this happens      initialized to one. The id of the corresponding free row is
extremely infrequently (on the order of once every trillion        written to the circular delay buffer. During this mismatch
requests in the worst case).                                       case, we also add the row id combined with ’0’ bit (read) to
    Since stalls happen so infrequently and because the stall      the bank access queue (where it waits to become accessing).
time is also very low (in the worst case a full memory access      On an incoming write request, the write address and data is
latency of about 100 nanoseconds), stalls can be handled in        added to write buffer FIFO. A ’1’ bit (write) is written to
one of two ways. The first way is to simply stall the con-          the bank access queue. The row id is unused in this case as
troller, where the slowdown would not even be a fraction           we access the write buffer in FIFO order. It is also searched
of a percent, while the other alternative is to simply drop        in the address CAM and on a match, the address valid flag is
the packet (which would be noise compared to packet-loss           unset. But this row cannot be used for a new read request and
due to many other factors). In either case, an attacker can-       will service previous read requests until the counter reaches
not leverage information about a stall unless they can a) ob-      zero since the data until the current cycle is still valid. When
serve the exact instant of the stall, b) remember the exact se-    the counter reaches zero, then there are no pending requests
quence of accesses that caused the stall and c) are able to re-    for that row and the row can serve as free row for the new
play the stall causing events with minor changes (to look for      requests.
              Delay Storage Buffer -- The delay storage buffer stores the address of each pending and                                      Bank Access Queue -- The bank access
              accessing request, and stores the address and data of waiting requests. Each non-                                            queue keeps track of all pending read and write
              redundant request will have an entry allocated for it in the delay buffer for a total of D                                   requests that require access to the memory
              cycles. To account for repeated requests to the same address, a counter is associated with                                   bank. It can store up to Q interleaved read or
              each address and data. The buffer contains K rows, where each row contains an address of                                     write requests in FIFO order. To avoid keeping
              A bits, a one-bit address valid flag, a counter of C bits, and data words of W bits. The data                                Q copies of the address and data, each entry is
              words are buffered in these rows whenever the read access to memory bank completes,                                          just the index of a target row in the delay
              and one row is needed for each unique access                                                                                 storage buffer.

                Delay Storage Buffer                                                                                                         Bank Access Queue
                 v     address                  incr/decr                                     data words                                         r/w           row id

                                                                                                                                 K rows

                                                                                                                                                                                          Q rows
                                                                   first zero

                                                                                                                                                                                                      scheduled-access address

                                                                                                                                                                                                                                 scheduled-access data

                           A bits                C bits                                            W bits                                                  log2K bits
                      address           data words                                       Set 0                                    Set 1
                                                                                    0:                                     1:

                                                                                                                                                                        Control Logic
                                                                                    2: access[t-3]                         3: access[t-2]
                                                                b rows

                                                                                                                                                    d/2 rows
                                                                                    4: access[t]                           5: access[t-d]

                                                                                    6: access[t-d+1]                       7: access[t-d+2]

                                                                                                                 out ptr
                                                                                                        in ptr

                                                                                    d-1:                                   d:                                                                      to memory

                       A bits                   W bits                                   log2K bits                             log2K bits
               Write Buffer (FIFO)                                              Circular Delay Buffer
                                                                                                                                          Interface address                             Interface data

              Write Buffer -- The write buffer is         Circular Delay Buffer -- The circular delay buffer stores                                       Control Logic -- The control logic
              organized as FIFO structure, which          the request identifier of every incoming read request and                                       handles        the          necessary
              stores the address and data of all          triggers the final result to be written the output interface                                    communication                 between
              incoming write requests. Unlike read        after a deterministic latency (D). This circular delay buffer                                   components (while the interconnect
              request, we need not need to wait for the   is the only component which is accessed every cycle                                             inside the bank controller is drawn
              write requests to complete. We only need    irrespective of the input requests. Note that if we just                                        as a bus for simplicity, in fact it is a
              to buffer the write request until it gets   stored the full data here, instead of a pointer to the delay                                    collection of direct point-to-point
              scheduled to access the memory bank.        storage buffer, then we would need to have a huge                                               connections).
                                                          number of bytes to buffer all the data (2 to 3 orders of
                                                          magnitude more).

                          Figure 3: Architecture block diagram of the VPNM bank request controller.

    During each cycle, the controller scans the bank access                                            are some additional aspects of the bank controller design,
queue and reads from the circular delay buffer. If the bank                                            we cannot fully describe all of the low level implementation
controller is granted to schedule a memory bank request,                                               aspects in this paper due to space limitations.
then the first request in the bank access queue is dequeued
                                                                                                       4.3 Stall Conditions
for access. In the case of a read access, the address is read
from the delay storage buffer and put on the memory bank                                                   The aim of the VPNM bus controller architecture is to
address bus. While in the case of write access, the address                                            provide a provably small stall rate in the system through ran-
and the data words are dequeued from the write buffer and                                              domization, but the actual stall rate is a function of the pa-
the write command is issued to the memory bank. In the case                                            rameters of the system. There are three different cases which
of no incoming read requests in the current cycle, the control                                         require a stall to resolve, each of which is influenced by a
logic invalidates the current entry of circular delay buffer.                                          different subset of the parameters.
On every cycle, it also reads the D-cycle delayed request-id                                           • Delay storage buffer stall - The number of rows (K) in de-
from the circular delay buffer. If it is valid, then the data is                                         lay storage buffer are limited and a row has to be reserved
read from the data words present in delay storage buffer and                                             for D cycles for one data output. Hence, if there are no free
the data is put on the interface bus. Since we do one read                                               rows and it cannot reserve a row for a new read request,
and one write operation to circular delay buffer every cycle,                                            then it results in a delay storage buffer stall. This stall is
it is designed as a 2-set (single-ported) architecture with in                                           mainly dependent on the following parameters 1) Number
and out pointers to save the power consumption. While there                                              of rows (K) in delay storage buffer 2) Deterministic delay
                                                                                                         (D) 3) Number of banks (B). The deterministic delay is
  determined using the access latency (L) and the bank re-           Table 1: Parameters for the analysis of our controller
  quest queue size (Q), and this stall analysis is presented in
  Section 5.1.                                                             Q — number of entries in the bank access queue
• Bank request queue stall - When a new non-repeating                      K — number of rows in the delay storage buffer
  read/write request comes to a bank and the size of the bank              B — number of banks in the system
                                                                           L — latency of accessing one bank
  access queue is already Q, then the new request cannot be                D — delay to which all memory accesses are normalized
  accommodated in the queue. This condition results in bank                R — frequency scaling ratio
  request queue stall. There are three main parameters which
  control this stall - 1) average input rate, which is equal to    5.1 Delay Storage Buffer Stall
  1/B, where B is the number of banks. 2) Queue size (Q)               A delay buffer entry is needed to store the data associated
  3) the output rate, which is decided by the ratio (R) of fre-    with an access for the duration of D. A buffer will over-
  quency on the memory side and frequency on the interface         flow if there are more requests assigned to it over a period
  side. In Section 5.2 we discuss exactly how to perform the       of D cycles than there are places to store those requests. To
  confidence analysis for this stall.                               calculate the Mean Time to Stall (MTS) we need to deter-
• Write buffer stall - Write buffer (WB) stall happens when        mine the expected amount of time we will have to wait un-
  a write request cannot be added in the write buffer. As we       til one of the B banks gets K or more requests over D cy-
  keep the write buffer equal to half of bank request queue        cles. The mapping of requests to banks is random so we can
  size, the chances of stall rate in write buffer is much less     treat the bank assignments as a random sequence of integers
  than the stall rate in bank request queue. The analysis of the   (a1 , a2 , . . . , aT ), where each ai is drawn from the uniform
  write buffer stall is similar to the analysis of bank request    distribution on {1, 2, . . . , B}.
  queue and does not dominate the overall stall, so we will            If we want to know the probability of stall after T cycles,
  only discuss about the bank request queue and delay stor-        then for any i ≤ T − D + 1, we can detect a stall happen-
  age buffer stall in our mathematical analysis in Section 5.      ing when at least K − 1 of the symbols ai+1 , . . . , ai+D−1
                                                                                                                    D−1     1
                                                                   are equal to ai ; the probability of this is K−1 · ( B )K−1 ,
                                                                   so the probability of not having a delay buffer overfill over
5   Analysis of Design                                                                               D−1      1
                                                                   the given interval is 1 − K−1 · ( B )K−1 . Since we are
    The Virtually Pipelined Network Memory can stall in the        only concerned with the probability that at least one stall oc-
three ways described in Section 4.3. In any of these cases, the    curs and not how many, we can conservatively estimate the
buffer will have to stall, and it will not be able to take a new   probability of no stall occurring over the entire sequence as
                                                                           D−1            1
request that cycle. Because we randomize the mapping we            (1 − K−1 · ( B )K−1 )T −D+1 . This method assumes that
can formally analyze the probability of this happening and         stalls are independent, when in fact they are positively corre-
because we use the cryptographic idea of universal hashing         lated, and it actually counts some stalls multiple times. Solv-
we know that there is no deterministic way for an adversary        ing for a probability of 50% that a stall can happen, the Mean
to generate conflicts with greater than random probability          Time to Stall is:
unless they can directly see them. We ensure that the con-
flicts are not visible through latency normalization (queuing                                         log( 1 )
                                                                             MTS =                   D−1         1 K−1
both before and after a request) unless many many different                              log(1 − (   K−1     ·   B)    ))
combinations are tried. We quantify this number, and the
confidence we place in our throughput, as the Mean Time                 Figure 4 shows the impact of number of entries in storage
to Stall (MTS). It is important to maximize the MTS, a job         delay buffer (K) on this stall. We take the value of R = 1.3
we can perform through optimization of the parameters de-          in this case. Since B and Q are interrelated for this analysis,
scribed in Section 4 and summarized in Table 1.                    we select the optimal combination of B and Q. We set the
                                                                   higher limit of the MTS value to 1016 in all of our analysis 2 .
    To evaluate the effect of these parameters on MTS, we
                                                                   Figure 4 shows that for B = 32, the curve rises sharply with
performed three types of analysis: Simulation (for function-
                                                                   K and we can get a MTS of 1012 for K = 32. The curve
ality), Mathematical (for MTS), and Design (to quantify the
                                                                   for B = 64 follows very closely to the curve for B = 32.
hardware overhead). To get an understanding of the execu-
                                                                   Hence, having B = 32 is optimal in our case. For lower
tion behavior of our design, and to verify our mathematical
                                                                   number of banks (B < 32), we need much higher values of
models, we have built functional models in both C and Ver-
                                                                   K to even reach a MTS value of 108 .
ilog and we have synthesized our design using synopsys de-
sign compiler. However, in this paper we concentrate on the        5.2 Bank Access Queue Stall
mathematical analysis of delay storage buffer stall and bank          Performing an analysis similar to that presented in Sec-
access queue stall, the calculation of the mean time to stall      tion 5.1 will not work for the bank access queue because
(MTS) for both these cases, and a high level analysis of the          2 An MTS of 1012 is around one stall every 15 minutes with a very ag-

hardware required.                                                 gressive bus transaction speed of 1 GHz
                              10                                                                                                                         1
                                                                                                                  fa il

                                14                                                                                                          1
                              10                                                                                    1                       B        6                  I= 1              0            0            0          0        0        0       0
                                                                                                          1         B
                                                                                                          B                                              1- 1
    MTS in number of cycles

                                                                                                                                                     5                             1                                1
                                                                                                                                                                           1−                 0         0                      0    0       0        0
                                                                                                                                                         1- 1
                                                                                                                                                                                   B                                B
                              1010                                                                            1
                                                                                                              B                                                               1                                                1
                                                                                                                                                                           1−                 0         0           0               0       0        0
                                                                                                                                                     4                        B                                                B
                              108                                                                             1
                                                                                                                                                                                             1                                      1
                                                                                                                                                         1- 1
                                                                                                                                                            B                  0          1−            0           0          0            0        0
                                                                                                                                                                                             B                                      B
                                   6                                                                                                                 3                                                    1                                 1
                                                                                                                                                                               0              0        1−           0          0    0                0
                                                                                                              1                                                                                           B                                 B
                                                                                     B=4, Q=12                                                           1- 1
                                   4                                                 B=8, Q=12                                                                                                                        1                              1
                              10                                                    B=16, Q=12                                                       2                         0              0         0          1−          0    0       0
                                                                                     B=32, Q=8                                                                                                                        B                              B
                                                                                     B=64, Q=8                1                                          1- 1                                                                 1                      1
                                   2                                                                          B                                             B                  0              0         0           0      1−       0       0
                                                                                                                                                                                                                              B                      B
                                       0   16   32        48       64        80       96     112   128                                               1
                                                                                                                                                                                                                                        1            1
                                                Number of entries in delay storage buffer                                                                                      0              0         0           0          0   1−       0
                                                                                                                                                         1- 1                                                                           B            B

                                                                                                                                                                               0              0         0           0          0    0       0        1
Figure 4: MTS variation with number of entries in delay                                                                                          idle
                                                                                                                                                             1- 1
storage buffer (K) for memory controller with R = 1.3                                                                                                           B

there is no fixed window of time over which we can analyze                                                Figure 5: Markov Model that captures the fail probabil-
the system combinatorially. There is state involved because                                              ity of a Bank Access Queue with L = 3 and Q = 2. With
the queue may cause a stall or not depending on the amount                                               probability 1/B a new request will arrive at any given
of work left to be done by the memory bank. To analyze the                                               bank causing their to be L more cycles worth of work.
stall rate of the bank access queue we determined that the                                                                                  1014
                                                                                                                                                                                        4 banks
queue essentially acts as a probabilistic state machine.                                                                                                                                8 banks
                                                                                                                                                 12                                    16 banks
   To do the analysis, we need to combine this abstract state                                                                                                                          32 banks
                                                                                                                                                                                       64 banks
machine with the probabilities that any transition will occur.
                                                                                                                  MTS in number of cycles

Each cycle a new request will come to a given bank con-
                           1                                                                                                                    10
troller with probability B and the probability that there will
be no new request is 1 − B . The probabilistic state machine                                                                                    106
that we are left with, is a Markov Model. In Figure 5 we can
see the probabilistic model stored both as a directed graph,
and in adjacency matrix form labeled M .                                                                                                        10

   The adjacency matrix form has a very nice property:
given an initial starting state at cycle zero, stored as the vec-                                                                                        0          8          16                 24          32          40       48       56       64
tor I, to calculate the probability of landing in any state at                                                                                                                     Number of entries in bank access queue

cycle one, we simply multiply I by M . In the example given,                                             Figure 6: MTS variation with number of entries in bank
there is probability P of being in state 2, 1-P of still being in                                        access queue (Q) for our controller with R = 1.3
the idle state. This process can then be repeated, and to get
the distribution of states after t time steps we simply multiply                                         and B = 64, we see an exponential increase in MTS with
I by M t times, which is of course IM t . Note that the stall                                            the increasing value of Q. We can get an MTS of 1014 for
state is an absorbing state, so the probability of being in that                                         Q = 64 using 32 or 64 banks. If any application that does
state should tell us the probability of there ever being a bank                                          not demand a high value of MTS, but requires a lower value
overflow on any of the t cycles. To calculate that probability,                                           of normalized delay, then we can use the system with a lower
we simply need to calculate M t .                                                                        value of Q and with 32/64 banks. We did not calculate the
   We use this analysis to figure out the impact of bank re-                                              MTS values for B >= 128 because the large matrix size
quest queue size (Q) on MTS. The effect of normalized delay                                              makes our analysis very difficult (the matrix requires more
D can also be directly seen as D is directly proportional to                                             than 2 GB of main memory).
Q. If we decrease/increase the value of D, then we have to
decrease/increase the value of Q accordingly. For our mem-                                               5.3 Hardware Estimation
ory controller with a value of R = 1.3, the MTS graph is                                                     The structures presented in Section 4 ensure that only
shown in Figure 6. We find that for B = 32 and B = 64, the                                                probabilistically well formed modes of failure are possible
curve for MTS is almost the same. We can clearly see from                                                and that exponential improvements in MTS can be achieved
the figure that a lower number of banks (B < 32) can only                                                 for linear increases in resources. While the analysis above
provide a maximum MTS value of 102 for even larger values                                                allows us to formally analyze the stall rate of our system,
of Q. Hence, an SDRAM with its small number of banks                                                     without a careful estimate of the area and power overhead it
cannot achieve a reasonable MTS. However, for B = 32                                                     is hard to understand the tradeoffs fully. To explore this de-
                               1E+15                                        R=1.5 (33%)
                               1E+14       day                                                 R=1.4 (28%)    Table 2: Optimal design parameters for best MTS and
                                                                                               R=1.3 (23%)
                               1E+13       hour                                                               area overhead combination
 Mean Time to Stall (cycles)

                                                                                                                  Frequency      Area        MTS     Optimal
                                                                                                R=1.2 (16%)         Scaling    overhead       in     design              Energy
                                           second                                                                  ratio (R)   in mm2       cycles   parameters          in nJ
                               1E+08                                                                                  1.3         13.6    5.12e+05   B=32, Q=24, K=48     11.09
                                                                                                                      1.3         19.4    2.34e+07   B=32, Q=32, K=64     13.26
                                                                                                                      1.3         34.1    4.57e+10   B=32, Q=48, K=96     17.05
                                                                                                R=1.1 (9%)            1.3         53.2    6.50e+13   B=32, Q=64, K=8      21.51
                               1E+05                                                                                  1.4         13.6    1.14e+07   B=32, Q=24, K=48     10.79
                               1E+04                                                             R=1.0 (0%)           1.4         19.3    1.69e+09   B=32, Q=32, K=64     12.83
                               1E+03                                                                                  1.4         34.0    3.62e+13   B=32, Q=48, K=96     16.38
                               1E+02                                                                                  1.4         53.0    9.75e+13   B=32, Q=64, K=128    20.54

                                       0            10   20       30
                                                              Area (mm^2)
                                                                             40           50                     We calculate the optimal parameters from Figure 7 and
                                                                                                              we find the energy consumption for these optimal param-
Figure 7: MTS with area overhead for our memory con-                                                          eters. The optimal parameters along with all design con-
troller for different frequency ratios (R)                                                                    straints are shown in Table 2. The table shows that for R=1.3
                                                                                                              and R=1.4, we need around 32 banks, 32 to 48 bank access
sign space, we developed a hardware overhead analysis tool                                                    queue entries, and 64 to 96 storage delay buffer entries with
for our bank controller architecture that takes these design                                                  10 to 20 nJ energy consumption.
parameters (B,L,K,Q,R,tech) as inputs and provides area
and energy consumption for the set of all bank controllers.
                                                                                                              5.4 Applications Mapping
We use a validated version of the Cacti 3.0 tool [28] and                                                         To demonstrate the usefulness and generality of our ap-
our synthesizable Verilog model to design our overhead tool                                                   proach, in this section we show how our Virtually Pipelined
and use 0.13µm CMOS technology to evaluate the hardware                                                       Network Memory can be easily used to implement two dif-
overhead.                                                                                                     ferent high-speed memory intensive data-plane algorithms.
                                                                                                              By implementing Packet Buffering on top of VPNM we can
5.3.1                                  Optimal Parameters                                                     directly compare against special purpose hardware designs
   Since area overhead is one of the most critical concerns                                                   in terms of performance. While our approach hides the com-
as it directly affects the cost of the system, we take the total                                              plexity of banking from the programmer, it can match and
area overhead of all the bank controllers as our key design                                                   even exceed the performance of past work that requires spe-
parameter to decide the value of MTS. As a point of refer-                                                    cialized bank-aware algorithms. To further show the useful-
ence, one bank controller (which then needs to be replicated                                                  ness of our system, we have also mapped a Packet Reassem-
per bank) with L = 20, K = 24, and Q = 12, occupies                                                           bler (used in content inspection) to our design, a memory
0.15 mm2 . We run the hardware overhead tool for several                                                      bound problem for which there is no current bank-safe algo-
thousand configurations with varying architectural parame-                                                     rithm known.
ters and consider the Pareto optimal design points in terms                                                   5.4.1 Packet Buffering
of area, MTS, and bandwidth utilization (R). We also set                                                          Packets need to be temporarily buffered from the trans-
some baseline required values of MTS, which are 1 second                                                      mission line until the scheduler issues a request to forward
(109 ), 1 hour (3.6 × 1012 ), and 1 day (8.64 × 1013 ) for an                                                 the packet to the output port. According to current indus-
aggressive 1 GHz clock frequency. While this is not small,                                                    try practice, the amount of buffering required is 2RT [17],
our example parameter set describes a design that targets a                                                   where R is the line rate and T is the round trip time over
very aggressive bandwidth system and compares favorably                                                       the Internet. For 160 gbps line rate and a typical round trip
with past special purpose designs (see Section 5.4)                                                           time of 0.2 second [12], the buffer size will be 4 GB. The
   The Pareto-optimal curve for our memory controller is                                                      main challenge in packet buffering is to deal with constantly
shown in Figure 7. This figure shows an interesting tradeoff                                                   increasing line rate (10 gbps to 40 gbps and from 40 gbps to
between the MTS and the utilization of effective bandwidth                                                    160 gbps) and the number of interfaces (order of hundreds to
on the memory bus side. If we increase the value of R, then                                                   order of thousands).
we get better values of MTS with effective lower utilization                                                      Using DRAM as intermediate memory for buffering does
of memory bus. For R = 1.3, we need 23% extra memory                                                          not provide full efficiency due to DRAM bank conflicts
bus bandwidth, but with a much better stall rate compared to                                                  [22, 12]. In [22], an out-of-order technique has been pro-
R = 1.2 (16% extra bandwidth). We find that we can choose                                                      posed to reduce the bank conflict to provide packet buffer-
either R = 1.3 (one second MTS=109 for about 30 mm2 ) or                                                      ing requirement for 10 gbps. Iyer et al. [17] have used a
R = 1.4 (one hour MTS=3.6 × 1012 for about 30 mm2 ) to                                                        combination of SRAM and DRAM, where SRAMs are used
get the best values of MTS without compromising much of                                                       for storing some head and tail packets for each queue. This
the memory bus speed utilization.                                                                             combination allow them to buffer packets at 40 gbps using
some clever memory management algorithms (for example:             Table 3: Comparison of packet buffering schemes with
earliest critical queue first (ECQF)). But they do not con-         our generalized architecture
sider the effect of bank conflicts. Garcia et al. [12] take their
                                                                         Packet        Max.      SRAM      Area   Total     No. of
approach further by providing a DRAM subsystem (CFDS)
                                                                        buffering    line rate     size     in    delay   supported
that can handle bank conflicts (through a long reorder buffer
                                                                         scheme       (gbps)     (bytes)   mm2    in ns   interfaces
like structure) and schedule a request to DRAM every b cy-
cles, where b can be less than the random access time of               et al. [22]     10        520 KB    27.4     -      64000
DRAM. A comparison of their approach and RADS [17] re-                 RADS [17]       40        64 KB      10     53       130
veals that CFDS requires less head and tail SRAM and it can            CFDS [12]       160          -       60    10000     850
provide packet buffering at 160 gbps. The data granularity                 Our
                                                                        approach       160       320 KB    41.9    960      4096
for DRAM used in [12] is b cells, where the size of one cell
is 64 bytes.                                                       vided on the boundary of two reordered packets. By doing
    Since our architecture can handle any arbitrary access pat-    TCP packet reassembly as a preprocessing step, we can en-
terns (they don’t have to be structured requests directed by       sure that packets are always scanned in-order. In essence
a queue management algorithm), the packet buffering will           packet reassembly provides a strong front end to effective
just be a special case of our system to provide one write ac-      content inspection.
cess and one read access. Instead of keeping large head and           While Dharmapurikar et al. [9] have proposed a packet
tail SRAMs to store packets, we just need to store the head        reassembly mechanism which is robust even in the presence
and tail pointers of each queue in SRAM. On a read from            of adversaries, unlike the state of the art in packet buffering
a particular queue, the head pointer will be incremented by        techniques, their algorithm does not consider the presence
the packet size, whereas a write to a particular queue will        of memory banks (and thus the bounds on performance are
increment the tail pointer by the packet size. Our univer-         not tight). Of course algorithms designers would rather deal
sal hash hardware unit randomizes the address from these           with network problems than mapping their data structures
pointers uniformly across different banks. In our approach,        to banks by hand. VPNM provides exactly that ability and
a request can be issued per cycle, whereas in [12] a request       we have mapped their technique [9] to a virtually pipelined
can be issued every b cycle. Their architecture is very diffi-      memory system. Using the same data granularity for DRAM
cult to design for b = 1 as they have also said in their paper     as in [12] and processing 64 bytes or less each cycle, we
”The implementation of RR scheduling logic for OC-3072             find the need to perform one DRAM read access for access-
and b = 1 is certainly of difficult viability.”                     ing connection record, one DRAM access for accessing the
    As we just need to store the head and tail pointers for each   corresponding hole-buffer data structure, one DRAM access
queue (rather than actual entries in the queue), we can pro-       to update this data structure, one DRAM access to write the
vide support for a large number of queues (up to 4096 with         packet, and one DRAM access to finally read the packet in
an SRAM size of 32KB – which can be further increased to           future. Hence, for each 64-byte packet chunk, five DRAM
support even more queues). We use the same data granular-          accesses are required. Since our memory system can process
ity used in [12] and compare our results with [22], RADS           requests every cycle, with a 400MHz RDRAM [23] we can
[17], and CFDS [12] by taking into account the throughput,         get an effective throughput of (400 MHz/5)*64 bytes/sec =
area overhead, normalized delay and maximum number of              40 Gbps, which is more than enough to feed current gen-
supported interfaces. The comparison results are provided          eration of content inspection engines. We do require some
in Table 3 for 0.13 µm technology. Table 3 shows that our          amount of extra storage space compared to [9] as we need to
scheme and CFDS scheme [12] can provide data throughput            store each packet in FIFO for the duration of three DRAM
of 160 gbps because memory requests can be scheduled ev-           accesses (3 ∗ D), which requires 72 Kbytes of SRAM.
ery cycle in our case and every b cycles in CFDS scheme.
                                                                   6     Conclusion
But our scheme requires about 35% less area, introduces ten
times less latency, and can support about five times the num-          Network systems are increasingly asked to perform a vari-
ber of interfaces compared to the CFDS scheme.                     ety of memory intensive tasks at very high throughput. In or-
                                                                   der to reliably service traffic with guarantees on throughput,
5.4.2   Packet Reassembly                                          even under worst case conditions, specialized techniques are
   In an intrusion detection/prevention processing node, the       required to handle the variations in latency caused by mem-
content inspection techniques scan each incoming packet for        ory banking. Certain algorithms can be carefully mapped
any malicious content. Since most of these technique ex-           to memory banks in a way that ensures worst case perfor-
amine each packet irrespective of the ordering/sequence of         mance goals are met, but this is not always possible and re-
packets, they are less effective for intrusion detection be-       quires careful planning at the algorithm, system, and hard-
cause a clever attacker can craft out-of-sequence TCP pack-        ware levels. Instead we present a general purpose tech-
ets such that the worm/virus signature is intentionally di-        nique for separating these two concerns, virtually pipelined
network memory, and show that with provably high confi-                          [10] R. Espasa, M. Valero, and J. E. Smith. Out-of-order vector archi-
dence it can simultaneously solve the issues of bank con-                            tectures. In MICRO 30: Proceedings of the 30th annual ACM/IEEE
                                                                                     international symposium on Microarchitecture, pages 160–170, 1997.
flicts and bus scheduling for throughput oriented applica-                       [11] W. fen Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM la-
tions. To achieve this deep virtual pipeline, we had to solve                        tencies with an integrated memory hierarchy design. In HPCA ’01:
                                                                                     Proceedings of the 7th International Symposium on High-Performance
the challenges of multiple conflicting requests, reordering of                        Computer Architecture, page 301, 2001.
requests, repeated request, and timing analysis of the system.                  [12] J. Garcia, J. Corbal, L. Cerda, and M. Valero. Design and imple-
We have performed rigorous mathematical analysis to show                             mentation of high-performance memory systems for future packet
                                                                                     buffers. In MICRO 36: Proceedings of the 36th annual IEEE/ACM
that there is on order of one stall in every 1013 memory ac-                         International Symposium on Microarchitecture, page 373, 2003.
cesses. Furthermore, we have provided a detailed simulation,                    [13] M. Gries. A survey of synchronous RAM architectures. Technical
                                                                                     Report 71, Computer Engineering and Networks Laboratory (TIK),
created a synthesizable version to validate implementability                         ETH Zurich, Gloriastrasse 35, CH-8092 Zurich, Apr. 1999.
and estimated hardware overheads to better understand the                       [14] J. Hasan, S. Chandra, and T. N. Vijaykumar. Efficient use of mem-
tradeoffs. To demonstrate the performance and generality                             ory bandwidth to improve network processor throughput. In ISCA
                                                                                     ’03: Proceedings of the 30th annual international symposium on Com-
of our virtually pipelined network memory we have consid-                            puter architecture, pages 300–313, 2003.
ered the problem of packet buffering and packet reassembly.                     [15] J. Hasan, V. Jakkula, S. Cadambi, and S. Chakradhar. Chisel: A
For packet buffering application, we find that our scheme                             storage-efficient, collision-free hash-based packet processing archi-
                                                                                     tecture. In Proceedings of The 33rd Annual International Symposium
requires about 35% less area, about ten times less latency                           on Computer Architecture (ISCA 33), Boston, MA, June 2006.
and can support about five times more number of interfaces                       [16] S. I. Hong, S. A. McKee, M. H. Salinas, R. H. Klenke, J. H. Aylor,
                                                                                     and W. A. Wulf. Access order and effective bandwidth for streams
compared to the best existing scheme for OC-3072 line rate.                          on a direct Rambus memory. In HPCA ’99: Proceedings of the 5th
While we have presented the packet buffering and reassem-                            International Symposium on High Performance Computer Architecture,
bly implementation using our architecture, in the future we                          page 80, 1999.
                                                                                [17] S. Iyer, R. R. Kompella, and N. McKeown. Designing packet buffers
will explore the potential of mapping other data plane al-                           for router linecards. Technical Report TR02-HPNG-031001, Stan-
gorithms into DRAM including packet classification, packet                            ford University, Nov. 2002.
inspection, application-oriented networking and potentially                     [18] E. Jaulmes, A. Joux, and F. Valette. On the security of randomized
                                                                                     cbc-mac beyond the birthday paradox limit: A new construction.
even a broader class of irregular streaming applications.                            In FSE ’02: Revised Papers from the 9th International Workshop on Fast
                                                                                     Software Encryption, pages 237–251, 2002.
Acknowledgments                                                                 [19] S. Kumar, P. Crowley, and J. Turner. Design of randomized mul-
                                                                                     tichannel packet storage for high performance routers. In 13th
   We would like to thank the anonymous reviewers for pro-                           Annual Symposium on High Performance Interconnects (Hot Intercon-
viding useful comments. We also thank John Brevik from                               nects), Palo Alto, CA, August 2005.
                                                                                [20] T. Lang, M. Valero, M. Peiron, and E. Ayguade. Conflict-free ac-
UCSB for providing help and useful insights to the math-                             cess for streams in multimodule memories. IEEE Transactions on
ematical analysis. This work was funded in part by NSF                               Computers, 44(5):634–646, 1995.
                                                                                [21] B. K. Mathew, S. A. McKee, J. B. Carter, and A. Davis. Design
Career Grant CCF-0448654.                                                            of a parallel vector access unit for SDRAM memory systems. In
                                                                                     Proceedings of the Sixth International Symposium on High-Performance
References                                                                           Computer Architecture (HPCA-6), pages 39–48, 2000.
  [1] F. Baboescu, D. M. Tullsen, G. Rosu, and S. Singh. A tree based           [22] A. Nikologiannis and M. Katevenis. Efficient per-flow queueing
      router search engine architecture with single port memories. In                in DRAM at OC-192 line rate using out-of-order execution tech-
      ISCA ’05: Proceedings of the 32nd Annual International Symposium on            niques. In In the Proceedings of the IEEE International Conference
      Computer Architecture, pages 123–133, 2005.                                    on Communications (ICC’2001), pages 2048–2052, Helsinki, Finland,
  [2] G. A. Bouchard, M. Calle, and R. Ramaswami. Dynamic ran-                       June 2001.
      dom access memory system with bank conflict avoidance feature.             [23] RamBus. RDRAM Memory: Leading Performance and Value over
      United States Patents, US 6,944,731, September 2005.                           SDRAM and DDR, 2001.
  [3] J. L. Carter and M. N. Wegman. Universal classes of hash functions.       [24] B. R. Rau. Pseudo-randomly interleaved memory. In ISCA ’91:
      Journal of Computer and System Sciences, 18:143–154, 1979.                     Proceedings of the 18th annual international symposium on Computer
  [4] F. Chung, R. Graham, and G. Varghese. Parallelism versus memory                architecture, pages 74–83, 1991.
      allocation in pipelined router forwarding engines. In P. B. Gibbons       [25] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens.
      and M. Adler, editors, SPAA, pages 103–111. ACM, 2004.                         Memory access scheduling. In ISCA ’00: Proceedings of the 27th
  [5] J. Corbal, R. Espasa, and M. Valero. Command vector memory sys-                annual international symposium on Computer architecture, pages 128–
      tems: High performance at low cost. In PACT ’98: Proceedings of the            138, 2000.
      1998 International Conference on Parallel Architectures and Compilation   [26] Samsung. Samsung Rambus MR18R162GDF0-CM8 512MB 16bit
      Techniques, page 68, 1998.                                                     800MHz datasheet, 2005.
  [6] R. Crisp. Direct Rambus technology: The new main memory stan-             [27] T. Sherwood, G. Varghese, and B. Calder. A pipelined memory
      dard. IEEE Micro, 17(6):18–28, November/December 1997.                         architecture for high throughput network processors. In ISCA ’03:
  [7] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A performance com-                 Proceedings of the 30th annual international symposium on Computer
      parison of contemporary DRAM architectures. In ISCA ’99: Pro-                  architecture, pages 288–299, 2003.
      ceedings of the 26th annual international symposium on Computer ar-       [28] P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache
      chitecture, pages 222–233, 1999.                                               timing, power and area model. Technical Report Western Research
  [8] B. Davis, B. L. Jacob, and T. N. Mudge. The new DRAM inter-                    Lab (WRL) Research Report, 2001/2.
      faces: SDRAM, RDRAM and variants. In ISHPC ’00: Proceedings               [29] G. Shrimali and N. McKeown. Building packet buffers with inter-
      of the Third International Symposium on High Performance Computing,            leaved memories. In Proceedings of Workshop on High Performance
      pages 26–31, 2000.                                                             Switching and Routing, Hong Kong, May 2005.
  [9] S. Dharmapurikar and V. Paxson. Robust TCP reassembly in the              [30] J. Truong. Evolution of network memory. Samsung Semiconductor,
      presence of adversaries. In Proceedings of 14th USENIX Security                Inc., March 2005.
      Symposium, pages 65–80, Baltimore, MD, August 2005.