Edge router, also known as "access router", is located in the network perimeter (edge) router. Located in the center of the network router is called the core router. Edge routers and core routers are relative concept, they all belong to the router, but have different sizes and capacity, the core router is a layer of another layer of the edge router.
A Cluster-based, Scalable Edge Router Architecture Prashant Pradhan Tzi-Cker Chiueh Computer Science Department State University of New York at Stony Brook Abstract hosts. While this model has been very successful in enabling a rich suite of network-based applications, One of the major challenges in designing compu- and in maitaining network stability through end-to-end tationally versatile routers, especially routers at the congestion control algorithms, it is being realized  network edge, is to simultaneously provide both high that placing functionality in the interior of the network packet forwarding performance and versatile packet can provide signiﬁcant performance beneﬁts [4, 2] as processing capabilities. The diverse nature of packet well as enable new classes of useful applications . processing dictates that router architectures be based In the hierarchy of network routers, the design prin- upon general-purpose processors. However, the per- ciple is to keep the backbone routers simple and fast, formance limitations of general-purpose I/O archi- and push complexity to lower levels in the hierarchy. tectures, accruing from limited I/O bus bandwidth, At the leaves of the this hierarchy, since the trafﬁc load interrupt overheads and synchronization overheads, is less intensive, computation can be placed without limit the level of achievable packet forwarding perfor- sacriﬁcing forwarding performance. However, edge mance. This paper describes a scalable edge router routers must simultaneously provide high packet for- architecture called Suez, that utilizes clustering and warding performance as well as rich packet processing programmability of network interfaces, to systemati- functionality. Consequently, edge routers become I/O- cally eliminate these overheads. The ﬁrst Suez proto- intensive as well as compute-intensive systems. The type, based upon a cluster of Pentium-II PC’s, Lanai following considerations guide the design of an ideal ½¼ network processors, and a -Gbits/sec Myrinet edge router architecture: interconnect, shows that for a single path through a Suez router, this prototype can achieve a byte through- 1. Because the functionality to be placed in an edge put of Mbits/sec for large best-effort packets router may be fairly diverse, its high-level packet ( ¼ bytes), a packet throughput of ½ Kpack- processing hardware should be based upon a ets/sec for small best-effort packets ( bytes), a real- general-purpose instruction set and should expose time byte throughput of ½¿¾ Mbits/sec, and a packet high-level programming abstractions. throughput of ½ ¿ Kpackets/sec for small real-time 2. It should be possible to scale the performance of packets ( bytes). The resulting system provides an edge router architecture linearly with the addi- a high-performance substrate over which a general tion of extra processing hardware and switching computation framework can be efﬁciently placed. capacity. 3. New network functions should be allowed to be 1 Introduction added to an edge router efﬁciently as well as safely, i.e. without compromising system in- The traditional model of data networks is based tegrity. upon a simple routing and forwarding service im- plemented in the network, whereas all complexity is The last item, that calls for the provision of com- pushed to end-to-end algorithms implemented in end posable and safely extensible computation frameworks in edge routers, is not the focus of this paper and is asynchronously posted to the general-purpose CPU us- discussed elsewhere (see [10, 8]). In this paper, we ing lock-free queues, thus avoiding synchronization address the design of a system architecture and efﬁ- and interrupt overheads. The scalability of this archi- cient low-level system software, on top of which such tecture results from both the number of router nodes a computational framework can be efﬁciently placed. and the total switching capacity of the router back- General-purpose hardware and software architec- plane being incrementally expandable to a larger size. tures, as exempliﬁed by PCs and general-purpose oper- The current Suez prototype is based upon Pentium-II ating systems, typically treat I/O as an exception con- PCs acting as the router nodes, a ½¼ Gbps Myrinet dition and are not optimized for I/O intensive systems switch as the system backplane, and Myrinet network like routers. This limitation manifests as per-packet interfaces with Lanai network processors. interrupt and synchronization overheads, and bus arbi- The main contribution of this work is a set of archi- tration overheads due to shared-bus I/O architectures. tectural and implementation techniques that we devel- General-purpose PCs also suffer from relatively lim- oped to construct scalable and extensible edge routers ited I/O bandwidth, which limits their applicability to based on PC clusters. Although PC clusters have been high-end edge router designs. In this paper, our goal used in the context of parallel computing, web proxies, is to show that with efﬁcient datapath algorithms and ﬁrwalls, and media gateways, the architectural trade- streamlined system software, PC clustering hardware offs of applying PC clustering hardware to network can be used as an effective platform for building scal- packet forwarding and computation remain largely un- able and extensible edge routers. Towards this end, investigated. we describe the design, implementation, and evalua- The rest of this paper is organized as follows. Sec- tion of a scalable edge router architecture, called Suez . tion 2 ﬁrst presents the high-level system architec- The basic building block of the architecture is a router ture. Section 3 then brieﬂy describes the datapath node, which consists of a single general-purpose pro- primitives. These include the routing table lookup al- cessor and a set of programmable network interfaces. gorithm, the link scheduling algorithm and the asyn- The requirement from these interfaces is that they pro- chronous data forwarding mechanism. The cluster- vide a low-end processor (henceforth called the net- speciﬁc implementation details of our prototype are work processor), a small amount of memory and a described in section 4. Section 5 presents the results DMA engine over which the network processor has of a performance evaluation study of this prototype. direct control. Such router nodes are interconnected Finally we conclude with a summary of the main re- through a high-speed system area switch-based inter- search results and an outline of on-going work. connect that serves as the router’s system backplane. The main system design principle of Suez is to de- 2 System Architecture couple packet forwarding from packet computation whenever possible, but provide an asynchronous link- Figure 1 shows the architecture of a -node and - ½¾ age between them. The system software is split be- port router. On each router node, there is one CPU and tween the network processors and the general-purpose several network interfaces with one network proces- processors, providing the key mechanism for this de- sor per interface. Since the overall system consists of coupling. Packets that do not require high-level pro- a set of router nodes, one interface (henceforth called cessing are handled almost entirely by the system soft- the internal interface) on every node connects the node ware running in the network processors. These pro- to the system interconnect. All the other interfaces cessors perform appropriate route lookup and packet on a node act as external interfaces, and connect the classiﬁcation, and forward packets to their respective router to the rest of the network. Hence, the fan-out output interfaces in a pipelined fashion, the pipeline of the router is the total number of external interfaces being co-ordinated by the network processors lying in on the router nodes. The typical path that a packet a packet’s datapath. Redundant copying is avoided takes is from an input interface of an ingress node, by using peer-to-peer DMA operations whenever pos- through the internal interface of the ingress node, over sible. Packets requiring additional computation are the router backplane, into the internal interface of the 2 Internal in accordance with their delay and bandwidth require- Interfaces ments. We describe these datapath operations in some 0000 1111 00 11 11 00 11 00 0000 1111 External detail in the following sections. Interfaces 3 Datapath Primitives 11 00 11 00 00 11 00 11 00 11 11 00 3.1 Routing Table Lookup and Packet Classiﬁca- General-Purpose Processors tion 11 00 1111 0000 00 11 00 11 1111 0000 High-Speed A classiﬁcation rule is a condition expressed in Switch terms of values of packet header ﬁelds. If a packet’s header ﬁelds satisfy the condition in the rule, the packet belongs to the rule’s class. A class, in turn, Figure 1. An instance of the Suez architecture implies some processing that the packet is bound to. consisting of 4 nodes and 12 ports. Internal inter- For example, in a router acting as a ﬁrewall, a classiﬁ- faces connect router nodes to a high-speed inter- cation rule may have a condition ”destination address connect, whereas the external interfaces connect = X AND source address = Y”, and packets matching Suez to the rest of the network. this rule may be processed by forwarding or dropping them. In general, classiﬁcation involves matching several egress node, and eventually out through an output in- packet header ﬁelds with values or ranges of values terface of the egress node. speciﬁed in the rules. However, classiﬁcation on sev- eral ﬁelds can be reduced to multiple instance of clas- The ﬁrst step in a packet’s datapath is to identify siﬁcation on a single ﬁeld . We shall discuss only how to process the packet, which is done by packet the single ﬁeld case here, using routing table lookup classiﬁcation1 . If the packet is a real-time or best ef- as an example. Routing table lookup can be mapped fort packet requiring no further computation, it must to the following problem  : we are given several directly be handed over to the link scheduler on the IP address ranges, each corresponding to an output appropriate output interface. However, if it needs interface, and we intend to ﬁnd which address range to be processed by a function in the CPU, it must a packet’s destination address lies in. For this pur- be handed over to the appropriate CPU function. In pose, instead of using full-blown search data struc- our architecture, all packet handling requests are pro- tures within the network processor, we use an efﬁcient cessed asynchronously, that is, packets are posted to caching algorithm  to reuse the results of lookup queues, and picked up later for processing by an ap- computation. There are two key ideas in the caching propriate scheduling mechanism : in case of packet algorithm. First, instead of caching individual ad- output, the link scheduler decides the packet’s depar- dresses, we cache address ranges, thus improving the ture time ; in case of high-level packet processing, a cache’s coverage of the IP address space. Second, CPU scheduler determines when to invoke a packet since we are interested only in ﬁnding the output in- processing function (see ). Hence, in our architec- terface, we should merge any adjacent address ranges ture, packet forwarding essentially reduces to posting that have the same output interface, thus yielding a data to an appropriate queue. This operation is thus smaller number of larger-sized ranges. To this end, the next step in the datapath. We shall describe how we choose a set of bits to index into the cache in such this can be implemented in an interrupt-free and lock- a way that even non-contiguous address ranges can free manner. Finally, the last step in a packet’s data- be merged into larger ranges. The resulting caching path is link scheduling, where data is sent from ﬂows scheme gives high hit rates as observed for a packet 1 Routing table lookup and real-time ﬂow identiﬁcation are spe- trace collected from a real-world edge router . cial cases of packet classiﬁcation. The network processors maintain a classiﬁer cache 3 in their memory and use this caching algorithm to Bitmap : 1011 perform classiﬁcation for incoming packets. Classi- f1 00 11 00 11 000000000 111111111 000000000 111111111 1111 0000Recv f1 0000 1111 ﬁcation for packets that miss in the network proces- f2 11 00 11 00 000000000 000000000 111111111 111111111 Mux 0000 1111 0000 1111 1111 0000 11 00 0000 1111 f3 sor’s cache, is treated as additional computation. Such packets are posted to a classiﬁer function implemented f3 000000000ρ 111111111 000000000 111111111 Send Demux f4 f4 f4 on the general-purpose CPU. The classiﬁer function performs the classiﬁcation and posts the packet back to the datapath to be forwarded. T Classiﬁcation of real-time ﬂows is simpler and re- quires only a lookup into a ﬂat mapping table to trans- Figure 2. An example of DFQ operation. The late ﬂow ids to output queue ids. scheduler multiplexes quanta from three eligible ﬂows and sends them to a downstream router 3.2 Output Link Scheduling which demultiplexes them using a run-length encoded bitmap. The instantaneous bandwidth Scheduling an output link to ensure that the perfor- reservation for ﬂow f4 is depicted as . mance guarantees of ﬂows contending for the link are satisﬁed, is a well-studied problem in networking liter- ature . A well-accepted scheduling model is based of data from every eligible ﬂow to form a transmis- on the notion of weighted ﬂuid fairness . In our sys- sion batch that is forwarded over the link, as shown tem, we use a discretized variant of a ﬂuid fair sched- in Figure 2. Together with each transmission batch is uler . A discretized fair queueing scheduler (DFQ) a run-length encoded bitmap that indicates the ﬂows maintains fairness at a ﬁxed-granularity time scale, that contribute a quantum to the batch. Operationally, rather than at an inﬁnitesimally small time scale, as in neighboring routers need to agree on a simple mul- an ideal ﬂuid scheduler. Given a chosen time granular- tiplexing/demultiplexing protocol on the connecting ity Ì , it can be proved  that the differences between link. Given the bitmap, a downstream router can de- FFQ and DFQ, in terms of per-hop and end-to-end de- rive the set of ﬂow ids that have a quantum in the batch, lay bounds, are proportional to Ì , but are independent followed by a maptable lookup to yield the output ﬂow of the number of real-time connections at each hop. ids (Figure 2). This is an important result because the deviation from ideal ﬂuid bounds is constant, even though the imple- 3.3 Asynchronous Data Forwarding mentation of DFQ is much simpler than schemes that emulate an ideal ﬂuid scheduler. All data forwarding operations in the system are In our architecture, the link scheduler runs on the enqueuing operations on some queue. For example, main CPU of each node, because the network inter- when a network processor classiﬁes a packet and ﬁnds faces do not have sufﬁcient memory to maintain a large that it is a real-time packet belonging to some ﬂow , number of per-connection output queues, and memory it must enqueue the packet to ﬂow ’s link scheduler accesses between the main CPU and the network pro- queue on some output interface. In case classiﬁca- cessors is expensive. tion says that it must be processed by a function in the Details on the link scheduling algorithm, and a CPU, the network processor posts the packet to a CPU reservation model that allows decoupling of band- scheduler queue. Each queue represents a producer- width and delay guarantees can be found in . We consumer interaction. However, the producer and con- shall only describe some operational details here. The sumer are independent entities, executing in physically scheduler is a cycle-based scheduler. For a ﬂow re- distinct processors. Explicit synchronization between serving an instantaneous bandwidth of , the sched- them through interrupts or through a locking mecha- uler serves an amount Ì of data from the ﬂow in ev- nism incurs overheads that limit packet throughput. In ery cycle. We call Ì as the quantum of the ﬂow. In this section, we describe how the enqueuing operation every cycle, the scheduler retrieves a quantum worth may be performed without these overheads, using a 4 clever data structure and utilizing the programmability teed to be invoked within a bounded interval from the of the network processors. event. This is enforced by the CPU scheduler. Details Besides data forwarding, instances of producer- can be found in . consumer interactions arise throughout the system in various contexts. All the producer-consumer inter- 4 Prototype Implementation actions in the system are ﬁrst systematically broken down into one-to-one producer-consumer interactions. The current Suez prototype is based on 400-MHz For instance, in case of the interaction between net- Pentium-II nodes with programmable Myrinet  in- work processors and the classiﬁer function at a given terfaces, and are connected by a Myrinet switch that node, the classiﬁer is a single consumer for multiple has a 10-Gbit/sec backplane. We use Linux as the un- producers. The classiﬁer breaks this interaction by derlying operating system, though the prototype im- providing one request queue per input interface. plementation’s dependency on Linux is rather mini- Given that all queues represent a one-to-one mal. The linux driver, for mapping Myrinet interface producer-consumer interaction, we introduce a simple memory and the loading of the control program, has data structure called a typed queue. The only differ- been modiﬁed from the driver provided by the BIP ence between a typed queue and a normal queue is group  at INRIA. In this section, we describe crit- that each element has a type which dictates how the ical datapath operations that we have encoded in the element should be processed. Note that a synchro- control program run by the Lanai network pro- nization problem between the producer and consumer cessors on each Myrinet interface. arises when they are concurrently acting on the same queue element. A particular type, called void, is used 4.1 Intra-Cluster Packet Movement for synchronization in this case. The consumer makes 4.1.1 Packet Traversal Path a non-atomic check for queue emptiness when it con- sumes the last element of the queue. If this non-atomic As shown in Figure 3, in the most general case a check ﬁnds the queue empty, the consumer consumes packet traverses a path from the external interface of but does not remove the last element from the queue, the ingress router node, through the ingress node’s in- and changes its type to void. The consumer’s action ternal interface and the egress node’s internal interface, ensures that the producer will always be enqueueing to and eventually to the egress node’s external interface. a non-empty queue. The consumer’s check for empti- The external interface of the ingress node performs a ness needs to be modiﬁed as follows : the queue is packet classiﬁcation operation to decide a queue that considered non-empty if and only if it has a non-void the packet should be enqueued to. This queue could element at the head OR it has a void element at the be local to the ingress node, or may reside on a dif- head with a non-void successor. In the latter case, the ferent node. We shall consider the general case when void element can be safely discarded by the consumer the ingress and egress nodes are different. In this case, before consuming its successor. It can be shown that the packet is ﬁrst moved from the ingress node’s exter- modifying the producer and consumer in this way en- nal interface to its internal interface through a peer-to- sures proper synchronization as long as the ’linking peer DMA transaction, and then to the egress node’s in’ operation for a produced element is (a) the last step internal interface through a transfer over the switch. of the enqueuing operation, and (b) the write opera- Peer DMA transactions are possible since PCI allows tion used for linking is atomic. This is true for simple, DMA transfers between devices’ memory if to each singly-linked queues. device, the memory of the other device appears as a Although the above mechanism ensures that there block of physical memory in the host’s address space. is no synchronization problem, there remains a subtler This means that to a Myrinet card, the memory of a issue of triggering relevant computation on a synchro- peer Myrinet card appears as the host’s physical mem- nization event, e.g., when a queue changes from empty ory, thus allowing DMA to be done between peer cards to non-empty, its consumer should be scheduled to run. directly over the I/O bus. The peer-to-peer DMA trans- This problem is solved since all consumers are guaran- action allows the packet to appear on the PCI bus ex- 5 Ingress Node Egress Node classifier classifier request scheduler peer request cache miss peer pull Switch lookup hit peer response enqueue gather Data Data In Out External Internal External Input Interface Interfaces Output Interface Figure 3. A generic packet forwarding path through Suez includes two router nodes and four interfaces. The ingress node’s external interface classiﬁes the packet and sends the packet to the egress node by performing a request-response protocol for peerDMA to its internal interface. Misses in the classiﬁer cache at the external interface trigger a request to be posted to the CPU, which services the request and posts it back to the datapath. The egress node’s internal interface receives and enqueues the packet for the link scheduler to send. actly once, thus improving I/O bandwidth. The peer- terface, exactly as above2 . This allows the packet to to-peer transfer proceeds using a request-response pro- “ﬂow” back to the forwarding path from the CPU. tocol. The source interface ﬁrst writes a request (using peerDMA) to a designated request area in the mem- ory of the internal interface, and then blocks waiting 4.1.2 Pipelining for a notiﬁcation. The request speciﬁes the location of The movement of a packet through the forwarding the data to be transferred, and the target queue. The path is heavily pipelined among the network proces- internal interface, which may get multiple such re- sors the CPUs lying in the path. Figure 4 shows the quests from multiple source interfaces, services these nine steps involved in moving a packet from the in- requests in a round-robin fashion. Requests are ser- put link to its corresponding output link. Each step viced by pulling the requested data from the source in- takes a different amount of time. Here again, we uti- terface’s memory through peer DMA, and then writing lize the programmability of the network interfaces to a completion notiﬁcation through peerDMA to a des- elegantly execute the pipeline without any hardcoding ignated response area in the source interface’s mem- of pipeline stage lengths. Pipeline boundaries are en- ory. The internal interface then initiates a transfer of forced by the network processors purely in software, this data, over the switch, to the desired egress node. by performing appropriate checks for the completion Upon receipt of a packet, the egress node’s inter- of various stages. Note that this also means that for dif- nal interface enqueues it to an appropriate queue in the ferent hardware, with different pipeline stage lengths, node’s main memory, based upon the target queue in- the network processor software need not change to cor- formation that the packet carries with it. Finally, the rectly setup the pipeline. link scheduler schedules data for transmission, by set- In Myrinet interface cards, it is possible to pipeline ting up the output interface to perform a gather DMA the DMA between host memory and card memory and from node memory out on the output link. the transfer between card memory and the link. In In the event that the external interface of the ingress our architecture, we extend the scope of pipelining to node encounters a classiﬁer cache miss, it posts a clas- cover the entire path of a packet through the router, siﬁcation request to the ingress node’s classiﬁer func- tion. The classiﬁer function, after serving the request, 2 Just like all source interfaces, the classiﬁer also has its own writes a peer transfer request to the node’s internal in- designated request area in the internal interface’s memory. 6 wire peer peer peer wire send enq to gather wire schedule recv request pull response and recv node from node send 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 T T Figure 4. Pipeline boundaries for the four-interface forwarding path in Figure 3. The pipeline’s critical path in this example is the combination of stages 6, 7 and 8, and determines the pipeline’s cycle time T. The arrows between pipeline stages represent dependencies among individual pipeline stages due to resource contention. which may involve upto Myrinet interfaces. Con- ¿ DMA of the Ø packet from the scheduler’s queue sider a ﬂow of packets in which all packets take the will be overlapped with the wire send of the Ø same path through the router, traversing a sequence of packet. Since the internal interface and the output in- 4 independent interfaces. The stages of the pipeline terface do not explicitly synchronize on the Ø ¿ in ﬁgure 4 execute on different interfaces. Stage 1 in- packet, it might seem that there is no synchronization volves a wire transfer of the Ø packet from the link enforced between them. However, two factors implic- to the interface memory, and can be overlapped with itly lead to a synchronization. First, the scheduler does the peerDMA request and the peerDMA pull of the not free the packets chosen in a cycle until all data for ½ Ø packet by the internal interface. Depending that cycle has been sent out. Second, the internal inter- upon which of these operations takes lesser time, ei- face can not enqueue data to the host if the host buffer ther the input interface waits for the peer notiﬁcation pool is empty. This means that if the data arrives at a from the internal interface, or it waits for the wire re- rate higher than the throughput of the egress node, the ceive of the Ø packet to ﬁnish before sending the buffer pool will deplete and cause the internal interface peerDMA request for the Ø packet and issuing the to refrain from posting more data. Thus, any mismatch wire receive of the Ø ·½ packet. between the interface throughputs at the egress node At the internal interface of the ingress node, the pull is only transient, and even without explicit coordina- ½ of the the Ø packet’s data is overlapped with the tion between network processors at the egress node, ¾ wire transfer of the Ø packet. Again, depend- pipeline stages are implicitly synchronized. ing upon the time taken by each, the internal interface Since some adjacent pipeline stages in ﬁgure 4 use waits for the wire transfer or the pull to complete be- the same hardware resources, the pipeline is effec- fore pulling the Ø packet. tively a -stage pipeline rather than a -stage one. At the egress node, the pipeline synchronization be- Speciﬁcally, steps , ¾ ¿ and effectively form one tween the internal interface and the output interface pipeline stage because they all need to use the ingress is implicit, and is enforced through the ﬁniteness of node’s PCI bus, and steps , and form one pipeline the host’s buffer pool. The internal interface at the stage because steps and need the egress node’s egress node posts data to the link scheduler’s queues PCI bus. In case of our hardware platform, the “cy- using a free buffer pool allocated by the node. Con- cle time” of this pipeline, i.e., the critical path delay, tinuing our pipeline from the internal interface of the is dictated by the combined delay of the -th, -th and ingress node, we see that the wire receive of the Ø ¾ -th steps. However, we again emphasize that for dif- packet can be overlapped with the host DMA recv of ferent hardware, the pipeline boundaries will dynam- ¿ the Ø packet. At the output interface, the send ically change since pipeline boundaries are enforced 7 by network processor software. Also note that if suf- is setup to point to freebuf. The descriptor itself is ﬁcient memory in available on the Myrinet cards, per- then also DMAed to the free buffer, after the received ﬂow queues could be maintained in card memory, and data. Now, this descriptor must be “linked-in” to some hence the link scheduler could be implemented eas- queue. For this purpose, the card must know the lo- ily in the network processors because of its simplicity. cation of the tail descriptor of the target queue. The This would balance the pipeline further, reducing its newly created descriptor can then be “linked-in” by critical path delay. writing the location of the new descriptor in the tar- get queue’s tail descriptor using a short DMA. The 4.2 Memory and Queue Management newly enqueued descriptor now becomes the tail de- scriptor of the target queue. So, the network proces- Buffer pools and all system queues are maintained sor locally updates this information for future enqueu- as typed queues, mentioned in section 3.3. A typed ing operations. The consumer of the queue eventually queue element is a small descriptor that may point to consumes this element and returns the used buffer to a variable-sized chunk of memory. The descriptor has the free buffer pool. The entire operation involves no a type, which speciﬁes how the queue element should locks and no interrupts. be processed. For instance, section 3.3 described how the void type can be used for synchronization. Another 4.3 Integrated Packet Queuing and Scheduling example is an autoincrement descriptor, which carries state indicating how much of the data that it points Recall from section 3.2 that in every cycle, the DFQ to has been processed, and is used when the quantum scheduler picks a quantum from various ﬂows and sizes for a ﬂow on the input and output links are differ- sends them out as a batch. This means that in a path ent. Descriptors are also maintained to perform usage of routers with DFQ schedulers, batches of quanta will accounting for every chunk of memory allocated in the be received by the routers. However, enqueuing each system. Descriptor types are understood both by net- quantum to its corresponding per-ﬂow queue would in- work processors and node CPUs. cur the overhead of a short DMA for every quantum, In this section, we highlight the use of typed queues which in turn would degrade packet throughput. by a network processor to perform asynchronous, However, the per-quantum enqueuing cost can be interrupt-free enqueuing operations. Assume that a eliminated by a more efﬁcient queuing and schedul- network processor receives a packet that must be en- ing algorithm that incurs only the overhead of a single queued to a queue in host memory. A free buffer pool DMA per batch, but maintains the constant scheduling is pre-allocated by the host CPUs for every network overhead of DFQ. This is done by using a consume- interface. This pool is a simple, singly-linked typed and-thread algorithm. The working of the algorithm queue, with descriptors pointing to free buffers. As- is illustrated in ﬁgure 5. Batches of quanta are en- sume that the network processor is informed about the queued to a single queue, which we call the ”horizon- location of the “head” of this queue at startup. When tal” queue. In every cycle, the DFQ scheduler picks data is received, the network processor uses DMA to quanta from those ﬂows which are eligible for sending read those ﬁelds in the head descriptor that hold in- according to the scheduling algorithm. For example, in formation about the location of the free buffer (say freebuf) and the location of the next queue ele- ½ ﬁgure 5, the scheduler picks ﬂows and ¾¼ in the ﬁrst cycle. The quanta that are not picked in a cycle must be ment (say nxt)3 . Using this information, the network ’carried over’ to the next cycle. To do this, a quantum processor ﬁrst updates the location of the new “head” may need to be horizontally as well as vertically en- of the free buffer pool to nxt. The received data is queued. In the example of ﬁgure 5, after the ﬁrst cycle, then DMAed into freebuf. To enqueue this data to ½¼ the quantum for ﬂow needs to be carried over. Since some queue, a descriptor must be created to guard the data. This descriptor is created in card memory and there is also a quantum for ﬂow ½¼ in the next batch, this quantum needs to be horizontally threaded with its 3 successor. Moreover, in the second cycle the scheduler ½¼ ½ These ﬁelds are contiguous in the descriptor data structure, to avoid multiple DMA reads. must visit ﬂow between visits to ﬂow and ﬂow 8 horizontal queue to the sink. Inter-packet gap is measured at the sink and if it is within a small percentage of the inter-packet f10 gap at the source, the source further reduces the inter- f1 f1 Select f1 packet gap. This adjustment continues until Suez ’s f10 f10 f1, f20 f10 throughput saturates and no further reduction in inter- f25 f20 f25 packet gap at the sender side is possible. All mea- f30 surements have been taken using the cycle count regis- f30 ter on the Pentium-II processor of the source and sink Figure 5. An example 2-cycle run of the hosts. In this setup each cycle is worth ¾ nsec. consume-and-thread algorithm. Unsent quanta of a batch are threaded both vertically and hor- 5.1 Throughput Results izontally into the next batch, which is to be ser- viced by the link scheduler subsequently. Non-FIFO link scheduling such as DFQ ensures that the output link bandwidth be shared among com- peting ﬂows based on their performance requirements, ¾ . Hence, this quantum should also be ”vertically” but may incur additional scheduling overhead. The scheduling overhead of DFQ is due to the multiplexing threaded with the next batch, between the quanta for ½ ¾ ﬂows and . The amortized cost of these threading and demultiplexing operations required for quantum- operations is Ç ´½µ per ﬂow per cycle, thus maintain- size rather than packet-size transmission. Figures 6 and 7 show the differences in byte throughputs and ing the constant per-ﬂow scheduling overhead of DFQ. packet throughputs between FIFO link scheduling and However, the overhead of multiple short DMAs by the DFQ link scheduling, as the quantum size used in DFQ network processor is completely eliminated. and the packet size vary. For DFQ link scheduling, we also vary the quantum size, which is the unit of “dis- 5 Performance Evaluation cretization” as compared to Fluid Fair Queuing. For a given quantum size, we measure the router through- Performance measurements have been made on the puts only for packet sizes that are larger than the quan- current Suez prototype, which consists of Pentium-II tum size. ¼¼ MHz PC’s as router nodes, each of which hosts For FIFO scheduling, the byte throughput increases two Myrinet interfaces, one as the internal and the with the packet size, because each DMA transaction at other as the external interface. The network interface the sender side amortizes its per-transaction overhead from Myrinet  has a Lanai 4.X network proces- over a larger packet and is therefore more efﬁcient. For sor. The system interconnect that links router nodes DFQ scheduling, the byte throughput is independent of ½¼ together is a -Gbps, -port Myrinet switch, provid- the packet size but depends on the quantum size, be- ing full-duplex ½¾ Gbps port-to-port bandwidth. All cause the size and thus efﬁciency of each DMA trans- the reported results are based on measurements from action in DFQ is dependent on the quantum size, re- a single packet forwarding path between two router gardless of the packet size. The larger the quantum nodes and thus four network interfaces. Because the size is, the more efﬁcient DFQ’s DMA transactions router backplane supports signiﬁcantly higher band- are. Since while receiving, DFQ only needs to per- width than what can be saturated by individual paths, form a single DMA transaction for a batch of quanta, the aggregate performance of a Suez router is Æ times whereas FIFO requires one DMA transaction for each the measurements reported below, where Æ is the independent packet, the FIFO scheduler shows a lower number of disjoint pairs of router nodes. Two other byte throughput than DFQ even when DFQ’s quantum PC’s are used as the source and sink hosts of the packet size is the same as the packet size. However, FIFO path to drive trafﬁc into and receive packets from the continues to exploit the increasing packet size to im- Suez prototype. Byte and packet throughputs are mea- prove the DMA transactions’ efﬁciency at the sender sured by sending packets back to back from the source end and eventually out-performs all DFQ instances 9 500.0 30.0 RT (Qsize=16) RT (Qsize=16) RT (Qsize=32) RT (Qsize=64) RT (Qsize=32) RT (Qsize=128) RT (Qsize=64) 400.0 NRT RT (Qsize=128) NRT Pkt Rate (in 10 Kpkts/s) Throughput (Mbits/s) 20.0 300.0 200.0 10.0 100.0 0.0 0.0 10 100 1000 10000 10 100 1000 10000 Pkt Size (Log Scale, bytes) Pkt Size (Log Scale, bytes) Figure 6. Throughputs in bytes/sec for FIFO and Figure 7. Throughputs in packets/sec for FIFO and DFQ schedulers with varying packet size and quan- DFQ schedulers with varying packet size and quan- tum size. tum size. with a ﬁxed quantum size, in terms of byte through- shown in ﬁgure 8, where we see, for -byte pack- ½¾ put. ets and varying quantum sizes, that the packet rate for For the FIFO scheduler, as the packet size increases, DFQ stays almost constant with the number of real- the byte throughput increases but the number of pack- time connections. ets transmitted with a unit time decreases, and the overall net effect is that the packet throughput de- 5.2 Latency Results creases. For the DFQ scheduler, only the number of packets transmitted with a unit time decreases with increase in packet size, but the byte throughput re- Latency is an orthogonal dimension, besides mains unchanged. Therefore, the slope of the decrease throughput, to evaluate the system performance. La- in packet throughput for DFQ is steeper than that for tency measurements also show how effective the FIFO. pipelined datapath implementation is, since overlap The byte and packet throughput differences be- between pipeline stages reduces the critical pipeline tween DFQ and FIFO represent the cost of real-time cycle time. In ﬁgure 4, the combined latency of stages link scheduling. As shown in Figure 6 and 7, this ½ through is the latency at the ingress node, whereas cost is relatively small for packet sizes smaller than the combined latency of stages though is the la- or equal to 1000 bytes. For even smaller packet sizes, tency at the egress node. Table 1 shows the effective- this cost is actually negative, because batched receives ness of the pipelined implementation by showing that in DFQ improve the DMA efﬁciency. In addition to the pipeline cycle time in each case is less than the low scheduling overhead compared to FIFO schedul- total latency as well as the latency of the bottleneck ing, DFQ is also more scalable in that its per-ﬂow node (viz the egress node in this case). These numbers scheduling overhead does not depend on the number of correspond to a batched transfer of -quanta batches. ¿¾ real-time connections that share the same output link, The amount by which the pipeline cycle time is shorter which is the case for other real-time link schedulers reﬂects the amount of overlap that has been gained by based on packetized weighted fair queuing. This is the pipelined operation of the datapath. 10 20.0 25.0 Qsize=16 Consume−and−thread Qsize=32 Per−connection queueing Qsize=64 Qsize=128 15.0 Pkt Rate (in 10 Kpkts/s) Pkt Rate (in 10Kpkts/s) 20.0 10.0 15.0 5.0 0.0 10.0 0.0 1000.0 2000.0 3000.0 4000.0 0.0 50.0 100.0 150.0 No. of RT Connections Quantum Size (bytes) Figure 8. The constant scheduling overhead of DFQ Figure 9. Performance improvement in packet allows Suez ’s packet rate to remain unaffected by rate from the consume-and-thread algorithm, which increasing numbers of real-time connections. avoids unnecessary DMA’s, decreases with increases in quantum size. To reduce the delay of the -th pipeline stage in Fig- ure 4, the consume-and-thread algorithm was devel- oped to avoid per-quantum enqueuing overhead. Fig- Quantum Ingress Node Egress Node Pipeline ure 9 shows that the optimization improves the over- Size Latency Latency Cycle all packet throughput signiﬁcantly, especially for small (bytes) (ms) (ms) Time(ms) packets. When the packet size is large, the overheads 16 141.45 198.24 142.48 of short DMA’s are overshadowed by the per-byte data 32 152.66 213.07 148.19 transfer cost. 64 176.61 237.99 163.25 128 222.13 310.49 246.98 5.3 Routing Lookup Overhead Table 1. Latency measurements (in msec) on In this section, we evaluate the performance of rout- the ingress and egress nodes for real-time data. ing lookup in our prototype. When a network process- The ingress node latency is the combined latency sor on a Suez node’s external interface fails to classify of pipeline stages 1 through 5 in ﬁgure 4 and an incoming packet due to a cache miss, it posts a re- the egress node latency is the combined latency quest to its associated CPU, which then services the re- of stages 5 through 9. The last column shows quest and posts the packet back to the forwarding path. the pipeline cycle time for the entire operation, We measure the hit access time and the miss penalty of which is always less than the total latency, due to a lookup operation. In our case, a lookup resulting in overlap between pipeline stages. a hit takes ¼ Pentium cycles. The Lanai proces- sor’s cycle time is worth approximately cycles of ½¾ the Pentium. This implies that the lookup operation takes ¿¾ Lanai cycles. The miss penalty is the ex- tra overhead that has to be paid for posting data to the 11 classiﬁer, rather than directly sending it to the internal  E. Amir, S. McCanne, R. H. Katz; ”An Active Service interface via peer DMA, and is measured to be ¼ Framework and its Application to Real Time Multi- Pentium cycles for -byte packets. media Transcoding”; Proc. ACM SIGCOMM 1998. Using a ½¼¾ -entry cache in the network processor,  P. Pradhan, T. Chiueh, A. Neogi; ”Aggregate TCP and a real-world packet trace used in , we measured Congestion Control Using Multiple Network Prob- the cache hit ratio to be ¾¿ ± . This gives an average ing”; to appear, Proc. ICDCS 2000. routing table lookup overhead of ¾¿ cycles. If the  H. Zhang; ”Service Disciplines For Guaranteed Per- routing table lookup is the bottleneck, then a packet formance Service in Packet-Switching Networks”, rate of ½¾ Kpackets/sec can be sustained for this Proceedings of the IEEE, 83(10), Oct 1995. trace. If we do a macro throughput measurement for  P. Pradhan, T. Chiueh; ”Cache Memory Design for the trace over the complete datapath through the router, Network Processors”; Proc. IEEE HPCA 2000. data transfer turns out to be the bottleneck and the re- sulting throughput is ¾ Kpackets/sec for -byte  P. Pradhan, Fluid Fairness : T.Chiueh; ”Discretization in Formulation and Im- packets. plications”; ECSL Technical Report (http://www.cs.sunysb.edu/ prashant/docs/fq.ps.gz). 6 Conclusion  P. Pradhan, T.Chiueh; ”A Computa- tion Framework for an Extensible Net- The contribution of this work is the development of work Router”; ECSL Technical Report a scalable edge router architecture, and its realization (http://www.cs.sunysb.edu/ prashant/docs/suezos.ps.gz). and performance evaluation on a functional prototype.  A. K. Parekh, R. G. Gallagher; ”A generalized proces- We have developed a decoupled system architecture sor sharing approach to ﬂow control in integrated ser- that cleanly separates packet forwarding and packet vices networks: the multiple node case”; IEEE/ACM computation paths. Using the programmability of the Transactions on Networking, April 1994, Volume 2, router’s network interfaces, the datapath functionality Number 2, pp. 137 - 150. has been cleanly separated between general-purpose  T. Chiueh, G. Venkitachalam, P. Pradhan; ”Integrat- CPUs and network processors, and efﬁcient techniques ing Segmentation and Paging Protection for Safe, Ef- have been used to sytematically eliminate interrupt, ﬁcient and Transparent Software Extensions”; Proc. synchronization and redundant I/O overheads. The ar- ACM SOSP 1999. chitecture uses clustering to achieve scalability, allow-  T.V. Lakshman, D. Stiliadis; ”High Speed Policy- ing it to overcome the limitations of general-purpose based Packet Forwarding Using Efﬁcient Multi- PC hardware. Some unique algorithmic features, like dimensional Range Matching”; Proc. ACM SIG- an efﬁcient route caching scheme and an efﬁcient dis- COMM 1998. cretized fair queuing link scheduler provide efﬁcient  L. Peterson, S. Karlin, K. Li; ”OS Support for datapath primitives for the system. We are currently General-Purpose Routers”; Proc. IEEE HotOS 1999. developing a safely extensible computational frame-  P. Pradhan, T. Chiueh; ”Operating Systems Support work on top of this architecture, to produce a complete for Programmable Cluster-Based Internet Routers”; system working as a versatile, high-performance edge Proc. IEEE HotOS 1999. router.  Myricom Inc.; ”LANai4.X speciﬁcation”; (http://www.myri.com/scs/documentation/mug/- References development/LANai4.X.doc.txt).  J. Smith, K. Calvert, S. Murphy, H. Orman, L. Peter-  ”The BIP Project, RESAM Laboratory”; (http://- son; ”Activating Networks”; IEEE Computer, April resam.univ-lyon1.fr/index bip.html). 1999.  G. Apoustolopoulos, D. Aubespin, V. Peris, P. Prad- han, D. Saha; ”Design, Implementation and Perfor- mance of a Content-Based Switch”, Proc. IEEE Info- com 2000. 12
Pages to are hidden for
"A Cluster-based_ Scalable Edge Router Architecture"Please download to view full document