A Cluster-based_ Scalable Edge Router Architecture by bestt571


More Info
									                    A Cluster-based, Scalable Edge Router Architecture

                                     Prashant Pradhan Tzi-Cker Chiueh
                                       Computer Science Department
                                State University of New York at Stony Brook

                      Abstract                            hosts. While this model has been very successful in
                                                          enabling a rich suite of network-based applications,
   One of the major challenges in designing compu-        and in maitaining network stability through end-to-end
tationally versatile routers, especially routers at the   congestion control algorithms, it is being realized [1]
network edge, is to simultaneously provide both high      that placing functionality in the interior of the network
packet forwarding performance and versatile packet        can provide significant performance benefits [4, 2] as
processing capabilities. The diverse nature of packet     well as enable new classes of useful applications [3].
processing dictates that router architectures be based    In the hierarchy of network routers, the design prin-
upon general-purpose processors. However, the per-        ciple is to keep the backbone routers simple and fast,
formance limitations of general-purpose I/O archi-        and push complexity to lower levels in the hierarchy.
tectures, accruing from limited I/O bus bandwidth,        At the leaves of the this hierarchy, since the traffic load
interrupt overheads and synchronization overheads,        is less intensive, computation can be placed without
limit the level of achievable packet forwarding perfor-   sacrificing forwarding performance. However, edge
mance. This paper describes a scalable edge router        routers must simultaneously provide high packet for-
architecture called Suez, that utilizes clustering and    warding performance as well as rich packet processing
programmability of network interfaces, to systemati-      functionality. Consequently, edge routers become I/O-
cally eliminate these overheads. The first Suez proto-     intensive as well as compute-intensive systems. The
type, based upon a cluster of Pentium-II PC’s, Lanai      following considerations guide the design of an ideal
      network processors, and a -Gbits/sec Myrinet        edge router architecture:
interconnect, shows that for a single path through a
Suez router, this prototype can achieve a byte through-     1. Because the functionality to be placed in an edge
put of          Mbits/sec for large best-effort packets        router may be fairly diverse, its high-level packet
( ¼     bytes), a packet throughput of     ½     Kpack-        processing hardware should be based upon a
ets/sec for small best-effort packets ( bytes), a real-        general-purpose instruction set and should expose
time byte throughput of ½¿¾     Mbits/sec, and a packet        high-level programming abstractions.
throughput of ½ ¿      Kpackets/sec for small real-time
                                                            2. It should be possible to scale the performance of
packets (      bytes). The resulting system provides
                                                               an edge router architecture linearly with the addi-
a high-performance substrate over which a general
                                                               tion of extra processing hardware and switching
computation framework can be efficiently placed.

                                                            3. New network functions should be allowed to be
1 Introduction                                                 added to an edge router efficiently as well as
                                                               safely, i.e. without compromising system in-
   The traditional model of data networks is based             tegrity.
upon a simple routing and forwarding service im-
plemented in the network, whereas all complexity is         The last item, that calls for the provision of com-
pushed to end-to-end algorithms implemented in end        posable and safely extensible computation frameworks
in edge routers, is not the focus of this paper and is           asynchronously posted to the general-purpose CPU us-
discussed elsewhere (see [10, 8]). In this paper, we             ing lock-free queues, thus avoiding synchronization
address the design of a system architecture and effi-             and interrupt overheads. The scalability of this archi-
cient low-level system software, on top of which such            tecture results from both the number of router nodes
a computational framework can be efficiently placed.              and the total switching capacity of the router back-
   General-purpose hardware and software architec-               plane being incrementally expandable to a larger size.
tures, as exemplified by PCs and general-purpose oper-            The current Suez prototype is based upon Pentium-II
ating systems, typically treat I/O as an exception con-          PCs acting as the router nodes, a    ½¼  Gbps Myrinet
dition and are not optimized for I/O intensive systems           switch as the system backplane, and Myrinet network
like routers. This limitation manifests as per-packet            interfaces with Lanai       network processors.
interrupt and synchronization overheads, and bus arbi-              The main contribution of this work is a set of archi-
tration overheads due to shared-bus I/O architectures.           tectural and implementation techniques that we devel-
General-purpose PCs also suffer from relatively lim-             oped to construct scalable and extensible edge routers
ited I/O bandwidth, which limits their applicability to          based on PC clusters. Although PC clusters have been
high-end edge router designs. In this paper, our goal            used in the context of parallel computing, web proxies,
is to show that with efficient datapath algorithms and            firwalls, and media gateways, the architectural trade-
streamlined system software, PC clustering hardware              offs of applying PC clustering hardware to network
can be used as an effective platform for building scal-          packet forwarding and computation remain largely un-
able and extensible edge routers. Towards this end,              investigated.
we describe the design, implementation, and evalua-                 The rest of this paper is organized as follows. Sec-
tion of a scalable edge router architecture, called Suez .       tion 2 first presents the high-level system architec-
The basic building block of the architecture is a router         ture. Section 3 then briefly describes the datapath
node, which consists of a single general-purpose pro-            primitives. These include the routing table lookup al-
cessor and a set of programmable network interfaces.             gorithm, the link scheduling algorithm and the asyn-
The requirement from these interfaces is that they pro-          chronous data forwarding mechanism. The cluster-
vide a low-end processor (henceforth called the net-             specific implementation details of our prototype are
work processor), a small amount of memory and a                  described in section 4. Section 5 presents the results
DMA engine over which the network processor has                  of a performance evaluation study of this prototype.
direct control. Such router nodes are interconnected             Finally we conclude with a summary of the main re-
through a high-speed system area switch-based inter-             search results and an outline of on-going work.
connect that serves as the router’s system backplane.
   The main system design principle of Suez is to de-            2 System Architecture
couple packet forwarding from packet computation
whenever possible, but provide an asynchronous link-                Figure 1 shows the architecture of a -node and - ½¾
age between them. The system software is split be-               port router. On each router node, there is one CPU and
tween the network processors and the general-purpose             several network interfaces with one network proces-
processors, providing the key mechanism for this de-             sor per interface. Since the overall system consists of
coupling. Packets that do not require high-level pro-            a set of router nodes, one interface (henceforth called
cessing are handled almost entirely by the system soft-          the internal interface) on every node connects the node
ware running in the network processors. These pro-               to the system interconnect. All the other interfaces
cessors perform appropriate route lookup and packet              on a node act as external interfaces, and connect the
classification, and forward packets to their respective           router to the rest of the network. Hence, the fan-out
output interfaces in a pipelined fashion, the pipeline           of the router is the total number of external interfaces
being co-ordinated by the network processors lying in            on the router nodes. The typical path that a packet
a packet’s datapath. Redundant copying is avoided                takes is from an input interface of an ingress node,
by using peer-to-peer DMA operations whenever pos-               through the internal interface of the ingress node, over
sible. Packets requiring additional computation are              the router backplane, into the internal interface of the

                                                                         in accordance with their delay and bandwidth require-
                                         Interfaces                      ments. We describe these datapath operations in some
                  1111   00
                         00    0000
                               1111            External
                                                                         detail in the following sections.
                                                                         3 Datapath Primitives
             00                       00
             11                       11
                                                                         3.1 Routing Table Lookup and Packet Classifica-
                                              Processors                     tion
                  0000   00
                         11    1111
                               0000        High-Speed                       A classification rule is a condition expressed in
                                                                         terms of values of packet header fields. If a packet’s
                                                                         header fields satisfy the condition in the rule, the
                                                                         packet belongs to the rule’s class. A class, in turn,
   Figure 1. An instance of the Suez architecture                        implies some processing that the packet is bound to.
   consisting of 4 nodes and 12 ports. Internal inter-                   For example, in a router acting as a firewall, a classifi-
   faces connect router nodes to a high-speed inter-                     cation rule may have a condition ”destination address
   connect, whereas the external interfaces connect                      = X AND source address = Y”, and packets matching
   Suez to the rest of the network.                                      this rule may be processed by forwarding or dropping
                                                                            In general, classification involves matching several
egress node, and eventually out through an output in-                    packet header fields with values or ranges of values
terface of the egress node.                                              specified in the rules. However, classification on sev-
                                                                         eral fields can be reduced to multiple instance of clas-
   The first step in a packet’s datapath is to identify
                                                                         sification on a single field [11]. We shall discuss only
how to process the packet, which is done by packet
                                                                         the single field case here, using routing table lookup
classification1 . If the packet is a real-time or best ef-
                                                                         as an example. Routing table lookup can be mapped
fort packet requiring no further computation, it must
                                                                         to the following problem [6] : we are given several
directly be handed over to the link scheduler on the
                                                                         IP address ranges, each corresponding to an output
appropriate output interface. However, if it needs
                                                                         interface, and we intend to find which address range
to be processed by a function in the CPU, it must
                                                                         a packet’s destination address lies in. For this pur-
be handed over to the appropriate CPU function. In
                                                                         pose, instead of using full-blown search data struc-
our architecture, all packet handling requests are pro-
                                                                         tures within the network processor, we use an efficient
cessed asynchronously, that is, packets are posted to
                                                                         caching algorithm [6] to reuse the results of lookup
queues, and picked up later for processing by an ap-
                                                                         computation. There are two key ideas in the caching
propriate scheduling mechanism : in case of packet
                                                                         algorithm. First, instead of caching individual ad-
output, the link scheduler decides the packet’s depar-
                                                                         dresses, we cache address ranges, thus improving the
ture time ; in case of high-level packet processing, a
                                                                         cache’s coverage of the IP address space. Second,
CPU scheduler determines when to invoke a packet
                                                                         since we are interested only in finding the output in-
processing function (see [8]). Hence, in our architec-
                                                                         terface, we should merge any adjacent address ranges
ture, packet forwarding essentially reduces to posting
                                                                         that have the same output interface, thus yielding a
data to an appropriate queue. This operation is thus
                                                                         smaller number of larger-sized ranges. To this end,
the next step in the datapath. We shall describe how
                                                                         we choose a set of bits to index into the cache in such
this can be implemented in an interrupt-free and lock-
                                                                         a way that even non-contiguous address ranges can
free manner. Finally, the last step in a packet’s data-
                                                                         be merged into larger ranges. The resulting caching
path is link scheduling, where data is sent from flows
                                                                         scheme gives high hit rates as observed for a packet
     Routing table lookup and real-time flow identification are spe-       trace collected from a real-world edge router [6].
cial cases of packet classification.                                         The network processors maintain a classifier cache

in their memory and use this caching algorithm to                                               Bitmap : 1011
perform classification for incoming packets. Classi-             f1        00
                                                                     111111111                        1111
                                                                                                      0000Recv   f1
fication for packets that miss in the network proces-            f2        11
                                                                                               Mux    0000
                                                                                                      1111          1111
sor’s cache, is treated as additional computation. Such
packets are posted to a classifier function implemented
                                                                             111111111         Send      Demux   f4
                                                                f4                        f4
on the general-purpose CPU. The classifier function
performs the classification and posts the packet back
to the datapath to be forwarded.                                                    T
   Classification of real-time flows is simpler and re-
quires only a lookup into a flat mapping table to trans-           Figure 2. An example of DFQ operation. The
late flow ids to output queue ids.                                 scheduler multiplexes quanta from three eligible
                                                                  flows and sends them to a downstream router
3.2 Output Link Scheduling                                        which demultiplexes them using a run-length
                                                                  encoded bitmap. The instantaneous bandwidth
   Scheduling an output link to ensure that the perfor-           reservation for flow f4 is depicted as  .
mance guarantees of flows contending for the link are
satisfied, is a well-studied problem in networking liter-
ature [5]. A well-accepted scheduling model is based            of data from every eligible flow to form a transmis-
on the notion of weighted fluid fairness [9]. In our sys-        sion batch that is forwarded over the link, as shown
tem, we use a discretized variant of a fluid fair sched-         in Figure 2. Together with each transmission batch is
uler [7]. A discretized fair queueing scheduler (DFQ)           a run-length encoded bitmap that indicates the flows
maintains fairness at a fixed-granularity time scale,            that contribute a quantum to the batch. Operationally,
rather than at an infinitesimally small time scale, as in        neighboring routers need to agree on a simple mul-
an ideal fluid scheduler. Given a chosen time granular-          tiplexing/demultiplexing protocol on the connecting
ity Ì , it can be proved [7] that the differences between       link. Given the bitmap, a downstream router can de-
FFQ and DFQ, in terms of per-hop and end-to-end de-             rive the set of flow ids that have a quantum in the batch,
lay bounds, are proportional to Ì , but are independent         followed by a maptable lookup to yield the output flow
of the number of real-time connections at each hop.             ids (Figure 2).
This is an important result because the deviation from
ideal fluid bounds is constant, even though the imple-           3.3 Asynchronous Data Forwarding
mentation of DFQ is much simpler than schemes that
emulate an ideal fluid scheduler.                                   All data forwarding operations in the system are
   In our architecture, the link scheduler runs on the          enqueuing operations on some queue. For example,
main CPU of each node, because the network inter-               when a network processor classifies a packet and finds
faces do not have sufficient memory to maintain a large          that it is a real-time packet belonging to some flow ,
number of per-connection output queues, and memory              it must enqueue the packet to flow ’s link scheduler
accesses between the main CPU and the network pro-              queue on some output interface. In case classifica-
cessors is expensive.                                           tion says that it must be processed by a function in the
   Details on the link scheduling algorithm, and a              CPU, the network processor posts the packet to a CPU
reservation model that allows decoupling of band-               scheduler queue. Each queue represents a producer-
width and delay guarantees can be found in [7]. We              consumer interaction. However, the producer and con-
shall only describe some operational details here. The          sumer are independent entities, executing in physically
scheduler is a cycle-based scheduler. For a flow re-             distinct processors. Explicit synchronization between
serving an instantaneous bandwidth of , the sched-              them through interrupts or through a locking mecha-
uler serves an amount Ì of data from the flow in ev-             nism incurs overheads that limit packet throughput. In
ery cycle. We call Ì as the quantum of the flow. In              this section, we describe how the enqueuing operation
every cycle, the scheduler retrieves a quantum worth            may be performed without these overheads, using a

clever data structure and utilizing the programmability         teed to be invoked within a bounded interval from the
of the network processors.                                      event. This is enforced by the CPU scheduler. Details
   Besides data forwarding, instances of producer-              can be found in [8].
consumer interactions arise throughout the system in
various contexts. All the producer-consumer inter-              4 Prototype Implementation
actions in the system are first systematically broken
down into one-to-one producer-consumer interactions.               The current Suez prototype is based on 400-MHz
For instance, in case of the interaction between net-           Pentium-II nodes with programmable Myrinet [14] in-
work processors and the classifier function at a given           terfaces, and are connected by a Myrinet switch that
node, the classifier is a single consumer for multiple           has a 10-Gbit/sec backplane. We use Linux as the un-
producers. The classifier breaks this interaction by             derlying operating system, though the prototype im-
providing one request queue per input interface.                plementation’s dependency on Linux is rather mini-
   Given that all queues represent a one-to-one                 mal. The linux driver, for mapping Myrinet interface
producer-consumer interaction, we introduce a simple            memory and the loading of the control program, has
data structure called a typed queue. The only differ-           been modified from the driver provided by the BIP
ence between a typed queue and a normal queue is                group [15] at INRIA. In this section, we describe crit-
that each element has a type which dictates how the             ical datapath operations that we have encoded in the
element should be processed. Note that a synchro-               control program run by the Lanai         network pro-
nization problem between the producer and consumer              cessors on each Myrinet interface.
arises when they are concurrently acting on the same
queue element. A particular type, called void, is used          4.1 Intra-Cluster Packet Movement
for synchronization in this case. The consumer makes
                                                                4.1.1   Packet Traversal Path
a non-atomic check for queue emptiness when it con-
sumes the last element of the queue. If this non-atomic         As shown in Figure 3, in the most general case a
check finds the queue empty, the consumer consumes               packet traverses a path from the external interface of
but does not remove the last element from the queue,            the ingress router node, through the ingress node’s in-
and changes its type to void. The consumer’s action             ternal interface and the egress node’s internal interface,
ensures that the producer will always be enqueueing to          and eventually to the egress node’s external interface.
a non-empty queue. The consumer’s check for empti-              The external interface of the ingress node performs a
ness needs to be modified as follows : the queue is              packet classification operation to decide a queue that
considered non-empty if and only if it has a non-void           the packet should be enqueued to. This queue could
element at the head OR it has a void element at the             be local to the ingress node, or may reside on a dif-
head with a non-void successor. In the latter case, the         ferent node. We shall consider the general case when
void element can be safely discarded by the consumer            the ingress and egress nodes are different. In this case,
before consuming its successor. It can be shown that            the packet is first moved from the ingress node’s exter-
modifying the producer and consumer in this way en-             nal interface to its internal interface through a peer-to-
sures proper synchronization as long as the ’linking            peer DMA transaction, and then to the egress node’s
in’ operation for a produced element is (a) the last step       internal interface through a transfer over the switch.
of the enqueuing operation, and (b) the write opera-            Peer DMA transactions are possible since PCI allows
tion used for linking is atomic. This is true for simple,       DMA transfers between devices’ memory if to each
singly-linked queues.                                           device, the memory of the other device appears as a
   Although the above mechanism ensures that there              block of physical memory in the host’s address space.
is no synchronization problem, there remains a subtler          This means that to a Myrinet card, the memory of a
issue of triggering relevant computation on a synchro-          peer Myrinet card appears as the host’s physical mem-
nization event, e.g., when a queue changes from empty           ory, thus allowing DMA to be done between peer cards
to non-empty, its consumer should be scheduled to run.          directly over the I/O bus. The peer-to-peer DMA trans-
This problem is solved since all consumers are guaran-          action allows the packet to appear on the PCI bus ex-

                                Ingress Node                                              Egress Node
                classifier             classifier request

                                       peer request
              cache      miss
                                       peer pull            Switch
                       hit             peer response
                                                                             enqueue                         gather
       Data                                                                                                              Data
       In                                                                                                                Out
                  External                                  Internal                                  External
                  Input Interface                           Interfaces                                Output Interface

  Figure 3. A generic packet forwarding path through Suez includes two router nodes and four interfaces. The
  ingress node’s external interface classifies the packet and sends the packet to the egress node by performing a
  request-response protocol for peerDMA to its internal interface. Misses in the classifier cache at the external
  interface trigger a request to be posted to the CPU, which services the request and posts it back to the
  datapath. The egress node’s internal interface receives and enqueues the packet for the link scheduler to

actly once, thus improving I/O bandwidth. The peer-                  terface, exactly as above2 . This allows the packet to
to-peer transfer proceeds using a request-response pro-              “flow” back to the forwarding path from the CPU.
tocol. The source interface first writes a request (using
peerDMA) to a designated request area in the mem-
ory of the internal interface, and then blocks waiting               4.1.2    Pipelining
for a notification. The request specifies the location of              The movement of a packet through the forwarding
the data to be transferred, and the target queue. The                path is heavily pipelined among the network proces-
internal interface, which may get multiple such re-                  sors the CPUs lying in the path. Figure 4 shows the
quests from multiple source interfaces, services these               nine steps involved in moving a packet from the in-
requests in a round-robin fashion. Requests are ser-                 put link to its corresponding output link. Each step
viced by pulling the requested data from the source in-              takes a different amount of time. Here again, we uti-
terface’s memory through peer DMA, and then writing                  lize the programmability of the network interfaces to
a completion notification through peerDMA to a des-                   elegantly execute the pipeline without any hardcoding
ignated response area in the source interface’s mem-                 of pipeline stage lengths. Pipeline boundaries are en-
ory. The internal interface then initiates a transfer of             forced by the network processors purely in software,
this data, over the switch, to the desired egress node.              by performing appropriate checks for the completion
   Upon receipt of a packet, the egress node’s inter-                of various stages. Note that this also means that for dif-
nal interface enqueues it to an appropriate queue in the             ferent hardware, with different pipeline stage lengths,
node’s main memory, based upon the target queue in-                  the network processor software need not change to cor-
formation that the packet carries with it. Finally, the              rectly setup the pipeline.
link scheduler schedules data for transmission, by set-
                                                                        In Myrinet interface cards, it is possible to pipeline
ting up the output interface to perform a gather DMA
                                                                     the DMA between host memory and card memory and
from node memory out on the output link.
                                                                     the transfer between card memory and the link. In
   In the event that the external interface of the ingress           our architecture, we extend the scope of pipelining to
node encounters a classifier cache miss, it posts a clas-             cover the entire path of a packet through the router,
sification request to the ingress node’s classifier func-
tion. The classifier function, after serving the request,                 2
                                                                         Just like all source interfaces, the classifier also has its own
writes a peer transfer request to the node’s internal in-            designated request area in the internal interface’s memory.

        wire      peer peer                peer   wire send           enq to            gather  wire
        recv    request pull             response and recv            node            from node send
          1        2     3                  4       5                   6       7         8      9

            1       2       3    4        5           6       7       8           9

                        1            2        3   4       5                       6      7       8        9
                                          1           2       3   4       5                               6     7      8       9

                                                                                                 T                     T

  Figure 4. Pipeline boundaries for the four-interface forwarding path in Figure 3. The pipeline’s critical path
  in this example is the combination of stages 6, 7 and 8, and determines the pipeline’s cycle time T. The arrows
  between pipeline stages represent dependencies among individual pipeline stages due to resource contention.

which may involve upto Myrinet interfaces. Con-                                                      ¿
                                                                                  DMA of the   Ø packet from the scheduler’s queue
sider a flow of packets in which all packets take the                              will be overlapped with the wire send of the   Ø
same path through the router, traversing a sequence of                            packet. Since the internal interface and the output in-
4 independent interfaces. The stages of the pipeline                              terface do not explicitly synchronize on the   Ø      ¿
in figure 4 execute on different interfaces. Stage 1 in-                           packet, it might seem that there is no synchronization
volves a wire transfer of the Ø packet from the link                              enforced between them. However, two factors implic-
to the interface memory, and can be overlapped with                               itly lead to a synchronization. First, the scheduler does
the peerDMA request and the peerDMA pull of the                                   not free the packets chosen in a cycle until all data for
    Ø packet by the internal interface. Depending                                 that cycle has been sent out. Second, the internal inter-
upon which of these operations takes lesser time, ei-                             face can not enqueue data to the host if the host buffer
ther the input interface waits for the peer notification                           pool is empty. This means that if the data arrives at a
from the internal interface, or it waits for the wire re-                         rate higher than the throughput of the egress node, the
ceive of the Ø packet to finish before sending the                                 buffer pool will deplete and cause the internal interface
peerDMA request for the Ø packet and issuing the                                  to refrain from posting more data. Thus, any mismatch
wire receive of the       Ø
                             packet.                                              between the interface throughputs at the egress node
   At the internal interface of the ingress node, the pull                        is only transient, and even without explicit coordina-
of the the   Ø packet’s data is overlapped with the                               tion between network processors at the egress node,
wire transfer of the   Ø packet. Again, depend-                                   pipeline stages are implicitly synchronized.
ing upon the time taken by each, the internal interface                               Since some adjacent pipeline stages in figure 4 use
waits for the wire transfer or the pull to complete be-                           the same hardware resources, the pipeline is effec-
fore pulling the Ø packet.                                                        tively a -stage pipeline rather than a -stage one.
   At the egress node, the pipeline synchronization be-                           Specifically, steps ,   ¾ ¿ and effectively form one
tween the internal interface and the output interface                             pipeline stage because they all need to use the ingress
is implicit, and is enforced through the finiteness of                             node’s PCI bus, and steps , and form one pipeline
the host’s buffer pool. The internal interface at the                             stage because steps and need the egress node’s
egress node posts data to the link scheduler’s queues                             PCI bus. In case of our hardware platform, the “cy-
using a free buffer pool allocated by the node. Con-                              cle time” of this pipeline, i.e., the critical path delay,
tinuing our pipeline from the internal interface of the                           is dictated by the combined delay of the -th, -th and
ingress node, we see that the wire receive of the   Ø             ¾                 -th steps. However, we again emphasize that for dif-
packet can be overlapped with the host DMA recv of                                ferent hardware, the pipeline boundaries will dynam-
the   Ø packet. At the output interface, the send                                 ically change since pipeline boundaries are enforced

by network processor software. Also note that if suf-                     is setup to point to freebuf. The descriptor itself is
ficient memory in available on the Myrinet cards, per-                     then also DMAed to the free buffer, after the received
flow queues could be maintained in card memory, and                        data. Now, this descriptor must be “linked-in” to some
hence the link scheduler could be implemented eas-                        queue. For this purpose, the card must know the lo-
ily in the network processors because of its simplicity.                  cation of the tail descriptor of the target queue. The
This would balance the pipeline further, reducing its                     newly created descriptor can then be “linked-in” by
critical path delay.                                                      writing the location of the new descriptor in the tar-
                                                                          get queue’s tail descriptor using a short DMA. The
4.2 Memory and Queue Management                                           newly enqueued descriptor now becomes the tail de-
                                                                          scriptor of the target queue. So, the network proces-
   Buffer pools and all system queues are maintained                      sor locally updates this information for future enqueu-
as typed queues, mentioned in section 3.3. A typed                        ing operations. The consumer of the queue eventually
queue element is a small descriptor that may point to                     consumes this element and returns the used buffer to
a variable-sized chunk of memory. The descriptor has                      the free buffer pool. The entire operation involves no
a type, which specifies how the queue element should                       locks and no interrupts.
be processed. For instance, section 3.3 described how
the void type can be used for synchronization. Another                    4.3 Integrated Packet Queuing and Scheduling
example is an autoincrement descriptor, which carries
state indicating how much of the data that it points                         Recall from section 3.2 that in every cycle, the DFQ
to has been processed, and is used when the quantum                       scheduler picks a quantum from various flows and
sizes for a flow on the input and output links are differ-                 sends them out as a batch. This means that in a path
ent. Descriptors are also maintained to perform usage                     of routers with DFQ schedulers, batches of quanta will
accounting for every chunk of memory allocated in the                     be received by the routers. However, enqueuing each
system. Descriptor types are understood both by net-                      quantum to its corresponding per-flow queue would in-
work processors and node CPUs.                                            cur the overhead of a short DMA for every quantum,
   In this section, we highlight the use of typed queues                  which in turn would degrade packet throughput.
by a network processor to perform asynchronous,                              However, the per-quantum enqueuing cost can be
interrupt-free enqueuing operations. Assume that a                        eliminated by a more efficient queuing and schedul-
network processor receives a packet that must be en-                      ing algorithm that incurs only the overhead of a single
queued to a queue in host memory. A free buffer pool                      DMA per batch, but maintains the constant scheduling
is pre-allocated by the host CPUs for every network                       overhead of DFQ. This is done by using a consume-
interface. This pool is a simple, singly-linked typed                     and-thread algorithm. The working of the algorithm
queue, with descriptors pointing to free buffers. As-                     is illustrated in figure 5. Batches of quanta are en-
sume that the network processor is informed about the                     queued to a single queue, which we call the ”horizon-
location of the “head” of this queue at startup. When                     tal” queue. In every cycle, the DFQ scheduler picks
data is received, the network processor uses DMA to                       quanta from those flows which are eligible for sending
read those fields in the head descriptor that hold in-                     according to the scheduling algorithm. For example, in
formation about the location of the free buffer (say
freebuf) and the location of the next queue ele-
                                                                          figure 5, the scheduler picks flows and      ¾¼ in the first
                                                                          cycle. The quanta that are not picked in a cycle must be
ment (say nxt)3 . Using this information, the network                     ’carried over’ to the next cycle. To do this, a quantum
processor first updates the location of the new “head”                     may need to be horizontally as well as vertically en-
of the free buffer pool to nxt. The received data is                      queued. In the example of figure 5, after the first cycle,
then DMAed into freebuf. To enqueue this data to                                                ½¼
                                                                          the quantum for flow needs to be carried over. Since
some queue, a descriptor must be created to guard the
data. This descriptor is created in card memory and
                                                                          there is also a quantum for flow    ½¼ in the next batch,
                                                                          this quantum needs to be horizontally threaded with its
   3                                                                      successor. Moreover, in the second cycle the scheduler
                                                                                           ½¼                          ½
    These fields are contiguous in the descriptor data structure, to
avoid multiple DMA reads.                                                 must visit flow       between visits to flow and flow

               horizontal queue                                    to the sink. Inter-packet gap is measured at the sink
                                                                   and if it is within a small percentage of the inter-packet
                                                         f10       gap at the source, the source further reduces the inter-
    f1        f1          Select      f1                           packet gap. This adjustment continues until Suez ’s
    f10       f10         f1, f20     f10
                                                                   throughput saturates and no further reduction in inter-
    f25       f20                     f25                          packet gap at the sender side is possible. All mea-
              f30                                                  surements have been taken using the cycle count regis-
                                                                   ter on the Pentium-II processor of the source and sink
    Figure 5. An example 2-cycle run of the
                                                                   hosts. In this setup each cycle is worth ¾     nsec.
    consume-and-thread algorithm. Unsent quanta
    of a batch are threaded both vertically and hor-               5.1 Throughput Results
    izontally into the next batch, which is to be ser-
    viced by the link scheduler subsequently.                         Non-FIFO link scheduling such as DFQ ensures
                                                                   that the output link bandwidth be shared among com-
                                                                   peting flows based on their performance requirements,

¾  . Hence, this quantum should also be ”vertically”
                                                                   but may incur additional scheduling overhead. The
                                                                   scheduling overhead of DFQ is due to the multiplexing
threaded with the next batch, between the quanta for
          ½   ¾
flows and . The amortized cost of these threading
                                                                   and demultiplexing operations required for quantum-

operations is Ç     ´½µ
                     per flow per cycle, thus maintain-
                                                                   size rather than packet-size transmission. Figures 6
                                                                   and 7 show the differences in byte throughputs and
ing the constant per-flow scheduling overhead of DFQ.               packet throughputs between FIFO link scheduling and
However, the overhead of multiple short DMAs by the                DFQ link scheduling, as the quantum size used in DFQ
network processor is completely eliminated.                        and the packet size vary. For DFQ link scheduling, we
                                                                   also vary the quantum size, which is the unit of “dis-
5 Performance Evaluation                                           cretization” as compared to Fluid Fair Queuing. For
                                                                   a given quantum size, we measure the router through-
   Performance measurements have been made on the                  puts only for packet sizes that are larger than the quan-
current Suez prototype, which consists of Pentium-II               tum size.
 ¼¼  MHz PC’s as router nodes, each of which hosts                    For FIFO scheduling, the byte throughput increases
two Myrinet interfaces, one as the internal and the                with the packet size, because each DMA transaction at
other as the external interface. The network interface             the sender side amortizes its per-transaction overhead
from Myrinet [14] has a Lanai 4.X network proces-                  over a larger packet and is therefore more efficient. For
sor. The system interconnect that links router nodes               DFQ scheduling, the byte throughput is independent of
together is a -Gbps, -port Myrinet switch, provid-                 the packet size but depends on the quantum size, be-
ing full-duplex  ½¾   Gbps port-to-port bandwidth. All             cause the size and thus efficiency of each DMA trans-
the reported results are based on measurements from                action in DFQ is dependent on the quantum size, re-
a single packet forwarding path between two router                 gardless of the packet size. The larger the quantum
nodes and thus four network interfaces. Because the                size is, the more efficient DFQ’s DMA transactions
router backplane supports significantly higher band-                are. Since while receiving, DFQ only needs to per-
width than what can be saturated by individual paths,              form a single DMA transaction for a batch of quanta,
the aggregate performance of a Suez router is Æ times              whereas FIFO requires one DMA transaction for each
the measurements reported below, where Æ is the                    independent packet, the FIFO scheduler shows a lower
number of disjoint pairs of router nodes. Two other                byte throughput than DFQ even when DFQ’s quantum
PC’s are used as the source and sink hosts of the packet           size is the same as the packet size. However, FIFO
path to drive traffic into and receive packets from the             continues to exploit the increasing packet size to im-
Suez prototype. Byte and packet throughputs are mea-               prove the DMA transactions’ efficiency at the sender
sured by sending packets back to back from the source              end and eventually out-performs all DFQ instances

                        500.0                                                                                      30.0
                                                     RT (Qsize=16)
                                                                                                                                                             RT (Qsize=16)
                                                     RT (Qsize=32)
                                                     RT (Qsize=64)                                                                                           RT (Qsize=32)
                                                     RT (Qsize=128)                                                                                          RT (Qsize=64)
                                                     NRT                                                                                                     RT (Qsize=128)

                                                                                        Pkt Rate (in 10 Kpkts/s)
 Throughput (Mbits/s)




                          0.0                                                                                       0.0
                                10    100                 1000        10000                                               10    100                 1000                 10000
                                     Pkt Size (Log Scale, bytes)                                                               Pkt Size (Log Scale, bytes)

                Figure 6. Throughputs in bytes/sec for FIFO and                         Figure 7. Throughputs in packets/sec for FIFO and
                DFQ schedulers with varying packet size and quan-                       DFQ schedulers with varying packet size and quan-
                tum size.                                                               tum size.

with a fixed quantum size, in terms of byte through-                                shown in figure 8, where we see, for        -byte pack-            ½¾
put.                                                                               ets and varying quantum sizes, that the packet rate for
   For the FIFO scheduler, as the packet size increases,                           DFQ stays almost constant with the number of real-
the byte throughput increases but the number of pack-                              time connections.
ets transmitted with a unit time decreases, and the
overall net effect is that the packet throughput de-                               5.2 Latency Results
creases. For the DFQ scheduler, only the number of
packets transmitted with a unit time decreases with
increase in packet size, but the byte throughput re-                                  Latency is an orthogonal dimension, besides
mains unchanged. Therefore, the slope of the decrease                              throughput, to evaluate the system performance. La-
in packet throughput for DFQ is steeper than that for                              tency measurements also show how effective the
FIFO.                                                                              pipelined datapath implementation is, since overlap
   The byte and packet throughput differences be-                                  between pipeline stages reduces the critical pipeline
tween DFQ and FIFO represent the cost of real-time                                 cycle time. In figure 4, the combined latency of stages
link scheduling. As shown in Figure 6 and 7, this                                  ½ through is the latency at the ingress node, whereas
cost is relatively small for packet sizes smaller than                             the combined latency of stages though is the la-
or equal to 1000 bytes. For even smaller packet sizes,                             tency at the egress node. Table 1 shows the effective-
this cost is actually negative, because batched receives                           ness of the pipelined implementation by showing that
in DFQ improve the DMA efficiency. In addition to                                   the pipeline cycle time in each case is less than the
low scheduling overhead compared to FIFO schedul-                                  total latency as well as the latency of the bottleneck
ing, DFQ is also more scalable in that its per-flow                                 node (viz the egress node in this case). These numbers
scheduling overhead does not depend on the number of                               correspond to a batched transfer of -quanta batches.         ¿¾
real-time connections that share the same output link,                             The amount by which the pipeline cycle time is shorter
which is the case for other real-time link schedulers                              reflects the amount of overlap that has been gained by
based on packetized weighted fair queuing. This is                                 the pipelined operation of the datapath.

                          20.0                                                                                                       25.0
                                                                           Qsize=16                                                                                   Consume−and−thread
                                                                           Qsize=32                                                                                   Per−connection queueing

                                                                                                          Pkt Rate (in 10 Kpkts/s)
Pkt Rate (in 10Kpkts/s)




                           0.0                                                                                                       10.0
                              0.0      1000.0           2000.0        3000.0           4000.0                                            0.0       50.0              100.0                 150.0
                                                No. of RT Connections                                                                               Quantum Size (bytes)

Figure 8. The constant scheduling overhead of DFQ                                                         Figure 9. Performance improvement in packet
allows Suez ’s packet rate to remain unaffected by                                                        rate from the consume-and-thread algorithm, which
increasing numbers of real-time connections.                                                              avoids unnecessary DMA’s, decreases with increases
                                                                                                          in quantum size.

                                                                                                        To reduce the delay of the -th pipeline stage in Fig-
                                                                                                     ure 4, the consume-and-thread algorithm was devel-
                                                                                                     oped to avoid per-quantum enqueuing overhead. Fig-
Quantum                             Ingress Node         Egress Node            Pipeline             ure 9 shows that the optimization improves the over-
  Size                                Latency              Latency               Cycle               all packet throughput significantly, especially for small
 (bytes)                                (ms)                (ms)               Time(ms)              packets. When the packet size is large, the overheads
    16                                 141.45              198.24               142.48               of short DMA’s are overshadowed by the per-byte data
    32                                 152.66              213.07               148.19               transfer cost.
    64                                 176.61              237.99               163.25
   128                                 222.13              310.49               246.98               5.3 Routing Lookup Overhead

Table 1. Latency measurements (in msec) on                                                              In this section, we evaluate the performance of rout-
the ingress and egress nodes for real-time data.                                                     ing lookup in our prototype. When a network process-
The ingress node latency is the combined latency                                                     sor on a Suez node’s external interface fails to classify
of pipeline stages 1 through 5 in figure 4 and                                                        an incoming packet due to a cache miss, it posts a re-
the egress node latency is the combined latency                                                      quest to its associated CPU, which then services the re-
of stages 5 through 9. The last column shows                                                         quest and posts the packet back to the forwarding path.
the pipeline cycle time for the entire operation,                                                    We measure the hit access time and the miss penalty of
which is always less than the total latency, due to                                                  a lookup operation. In our case, a lookup resulting in
overlap between pipeline stages.                                                                     a hit takes                               ¼
                                                                                                                         Pentium cycles. The Lanai proces-
                                                                                                     sor’s cycle time is worth approximately        cycles of             ½¾
                                                                                                     the Pentium. This implies that the lookup operation
                                                                                                     takes                            ¿¾
                                                                                                                    Lanai cycles. The miss penalty is the ex-
                                                                                                     tra overhead that has to be paid for posting data to the

classifier, rather than directly sending it to the internal        [3]   E. Amir, S. McCanne, R. H. Katz; ”An Active Service
interface via peer DMA, and is measured to be           ¼               Framework and its Application to Real Time Multi-
Pentium cycles for -byte packets.                                       media Transcoding”; Proc. ACM SIGCOMM 1998.
   Using a  ½¼¾   -entry cache in the network processor,          [4]   P. Pradhan, T. Chiueh, A. Neogi; ”Aggregate TCP
and a real-world packet trace used in [6], we measured                  Congestion Control Using Multiple Network Prob-
the cache hit ratio to be    ¾¿ ± . This gives an average               ing”; to appear, Proc. ICDCS 2000.
routing table lookup overhead of       ¾¿   cycles. If the        [5]   H. Zhang; ”Service Disciplines For Guaranteed Per-
routing table lookup is the bottleneck, then a packet                   formance Service in Packet-Switching Networks”,
rate of   ½¾    Kpackets/sec can be sustained for this                  Proceedings of the IEEE, 83(10), Oct 1995.
trace. If we do a macro throughput measurement for                [6]   P. Pradhan, T. Chiueh; ”Cache Memory Design for
the trace over the complete datapath through the router,                Network Processors”; Proc. IEEE HPCA 2000.
data transfer turns out to be the bottleneck and the re-
sulting throughput is    ¾     Kpackets/sec for -byte
                                                                  [7]   P. Pradhan,
                                                                        Fluid Fairness :
                                                                                          T.Chiueh;     ”Discretization in
                                                                                                  Formulation and Im-
packets.                                                                plications”;       ECSL        Technical      Report
                                                                        (http://www.cs.sunysb.edu/ prashant/docs/fq.ps.gz).
6 Conclusion
                                                                  [8]   P.     Pradhan,    T.Chiueh;        ”A     Computa-
                                                                        tion Framework for an Extensible Net-
   The contribution of this work is the development of                  work      Router”;    ECSL      Technical    Report
a scalable edge router architecture, and its realization                (http://www.cs.sunysb.edu/ prashant/docs/suezos.ps.gz).
and performance evaluation on a functional prototype.
                                                                  [9]   A. K. Parekh, R. G. Gallagher; ”A generalized proces-
We have developed a decoupled system architecture                       sor sharing approach to flow control in integrated ser-
that cleanly separates packet forwarding and packet                     vices networks: the multiple node case”; IEEE/ACM
computation paths. Using the programmability of the                     Transactions on Networking, April 1994, Volume 2,
router’s network interfaces, the datapath functionality                 Number 2, pp. 137 - 150.
has been cleanly separated between general-purpose                [10] T. Chiueh, G. Venkitachalam, P. Pradhan; ”Integrat-
CPUs and network processors, and efficient techniques                   ing Segmentation and Paging Protection for Safe, Ef-
have been used to sytematically eliminate interrupt,                   ficient and Transparent Software Extensions”; Proc.
synchronization and redundant I/O overheads. The ar-                   ACM SOSP 1999.
chitecture uses clustering to achieve scalability, allow-         [11] T.V. Lakshman, D. Stiliadis; ”High Speed Policy-
ing it to overcome the limitations of general-purpose                  based Packet Forwarding Using Efficient Multi-
PC hardware. Some unique algorithmic features, like                    dimensional Range Matching”; Proc. ACM SIG-
an efficient route caching scheme and an efficient dis-                  COMM 1998.
cretized fair queuing link scheduler provide efficient             [12] L. Peterson, S. Karlin, K. Li; ”OS Support for
datapath primitives for the system. We are currently                   General-Purpose Routers”; Proc. IEEE HotOS 1999.
developing a safely extensible computational frame-
                                                                  [13] P. Pradhan, T. Chiueh; ”Operating Systems Support
work on top of this architecture, to produce a complete                for Programmable Cluster-Based Internet Routers”;
system working as a versatile, high-performance edge                   Proc. IEEE HotOS 1999.
                                                                  [14] Myricom     Inc.;    ”LANai4.X      specification”;
References                                                             development/LANai4.X.doc.txt).
[1]   J. Smith, K. Calvert, S. Murphy, H. Orman, L. Peter-        [15] ”The BIP Project, RESAM Laboratory”; (http://-
      son; ”Activating Networks”; IEEE Computer, April                 resam.univ-lyon1.fr/index bip.html).
[2]   G. Apoustolopoulos, D. Aubespin, V. Peris, P. Prad-
      han, D. Saha; ”Design, Implementation and Perfor-
      mance of a Content-Based Switch”, Proc. IEEE Info-
      com 2000.


To top