019 by yaoyufang

VIEWS: 9 PAGES: 4

									Address Processor and Classifier Co-Processors from Silicon Access Networks:
A Family of Search Coprocessors for Terabit Routers with OC192 Blades

                             Mike O’Connor & Syed Mahmud
                                 Principal Architect
                               Silicon Access Networks
                           mike.oconnor@siliconaccess.com

Introduction

Advanced IP networks are being designed to provide improvements in Quality of
Service to allow operators to offer differentiated service levels and to
support the addition of voice and video data types on connectionless datagram
networks.   Routing at 10 Gbps line speeds requires numerous lookups into
large tables to support the demands of such networks. Typically, these
lookups include both longest-prefix-match lookups into a forwarding table and
large multi-dimensional lookups into a flow classification table for access
control, billing, or QoS purposes.      The flow classification lookups may
include both source and destination IP addresses and TCP ports among other
things.   Advanced products such as server load balancers require lookups
using session and application data such as SAP session, URLs, and cookies.

As shown in Figure 1, the Address Processor and Classifier can be configured
as Co-Processors that target different search applications.       The Address
Processor is used for longest-prefix and exact match searches, and its
architecture is based upon a tree search algorithm using dense DRAM
technology.   The Classifier is used for large multi-dimensional lookups and
it is based upon dynamic ternary CAM technology. The use of fast and dense
embedded DRAM enables a high-performance solution at significantly lower cost
and lower power consumption relative to other SRAM based solutions.




                                      Classification and
                                      Forwarding ASIC                     Data Path
                                                                           10 Gb/s


                 ZBT SRAM
                 Bus 133 MHz


                                Address                                 Routing and
                                                   Classifier
                               Processor                                Management
                                                                           CPU




                       •Use Address Processor for exact match, longest prefix match, unique flows
                       •Use Classifier for aggregated flows and ACLs




Figure 1. System Configuration with Address Processor(s) and Classifier(s)
on ZBT SRAM bus




                                                                                                    Page 1
In addition to lookups, the Address Processor and Classifier also provide
collection of statistics in associated memory and on-the-fly modification and
an automated table maintenance capability. Each product is optimized for its
task in terms of memory density, power, and functionality. At the same time,
both chips are designed to reside simultaneously on a standard 133-MHz ZBT
SRAM bus with up to 128 data pins, allowing the routing system designer
maximum flexibility. Multiple devices may be placed upon the bus in a mix-
and-match fashion.    The two chips share a common application-programming
interface.

The Address Processor and Classifier are designed to allow a large number of
concurrent pipelined lookup requests- one for each key- to access and update
associated data.   Write transactions on the ZBT bus specify a command and
also provide data to the chips such as the key to look up and the operation
to perform on the associated data.    Such operations are flexible as to the
fields affected and include several modifications including increment by a
constant and add.     Several write transactions may be required for all
necessary data to be transferred for a given command. Once the all the data
has been transferred, the request is scheduled to begin execution through the
pipeline.

A read request is used to access some or all the results of a given command
from the result buffer.   These read requests appear as standard SSRAM read
transactions.



Address Processor

Extensive use of purpose-built embedded DRAM arrays enables a single Address
Processor chip to store up to 256K 48-bit longest-prefix-match ranges, with
no loss in lookup performance of 66 million lookups per second. The Address
Processor supports key sizes of 48 bits (up to 256K entries), 96 bits (up to
128K entries), and 144 bits (up to 80K entries).      Arbitrary Ipv4 keys are
supported since the Address Processor can have up to 33 nested levels of
prefixes. The Address Processor can update table entries at an average of 1M
updates/sec. Each lookup result indexes one of 256K associated 96-bit user
data words.   Each of these 96-bit words, in turn, contains a 13-bit field
that refers to one of 8K additional 256-bit user data words.

The Address Processor supports on-the-fly read-modify-write operation on the
96-bit associated words and half (128-bits) of the 256-bit user data words.
Each "per route entry" can perform a read, add and write on two fields of a
96-bit data word.   Similarly, the next-hop information indexed by the per-
route information can do a read, add and write on 2 fields in a 128-bit data
word. Doing these 4 sets of read-modify-write operations and storing the
results would require approximately 15 instructions per lookup for another
processor in the system. Since the Address Processor supports 66 Million
lookups per second, this capability enables the user to offload over one
billion statistics maintenance operations per second from other processors in
the system for each included Address Processor. Line cards built with such
chips will require lower off-chip processing power and fewer or zero external
SRAM memory components for associated memory tables.

Each Address Processor performs up to 66 million searches per second along
with statistics updates associated with each search. This allows two lookups
and statistics updates per 40-Byte sized packet at 10 Gbps for WAN router


                                                                 Page 2
applications.    Alternatively, three lookups      and    statistics   updates   are
supported for a 64-Byte packet environment.

Storage configuration for the Address Processor is shown as follows.      The
Address Processor includes a multi-level search tree in L0-L3 as well as
associated statistics data in L4 and L5. As shown in Table 1, the routing
table is stored in a 25 Mb block of embedded DRAM, organized as 8,192 rows of
3200 bits.   Three levels of indexing tables, very similar to a B-Tree, are
used to select a row of the L3 memory in a pipelined manner.      These three
memories are a total of ~1.2 Mb and are implemented using embedded SRAM. The
correct entry is selected based on the 3200 bits in the selected L3 row,
based on a proprietary patent pending algorithm.

                       Address Processor Memory Organizations

             Level      Width    Height            Size/Memory Type
             L0         4712     1                  ~4.6 Kb SRAM
                        bits
             L1         2310     32                ~72.2 Kb     SRAM
                        bits
             L2         2310     512                ~1.1 Mb     SRAM
                        bits
             L3         3200     8K                   25 Mb     DRAM
                        bits
             L4 Data    100      256K                 25 Mb     DRAM
                        bits
             L5 Data    256      8K                      2 Mb   DRAM
                        bits
             Total                                  ~1.2 Mb     SRAM
                                                      52 Mb     DRAM

Table 1.     Address Processor storage levels and memory organizations


Classifier

The Classifier product is based upon embedded Ternary CAM memory arrays as
well as DRAM arrays.    The Classifier stores up to 192K 48-bit key entries,
with no loss in lookup performance of 66 Million lookups per second. The
Classifier supports variable key sizes of 48-bit (up to 192K entries), 96-bit
(up to 96K entries), 144-bit (up to 64K entries), 192-bit (up to 48K
entries), 288-bit (up to 32K entries) and 576-bit (up to 16K entries). The
product can support 288-bit operation at 66 Million lookup requests per
second.   Each lookup result indexes an associated variable width (multiples
of 32-bit) user data word. The number of variable width associated data words
is equal to the number of keys (e.g. 192K for 48-bit keys).

The Classifier supports on-the-fly read-modify-write operation on the
associated data.   Each data entry can perform a read, add and write on two
fields of the associated data word. Performing these 2 sets of read-modify-
write operations and storing the results would require approximately 8
instructions per lookup for another processor in the system. Since the
Classifier supports 66 Million lookups per second, this capability enables
the user to offload over half a billion statistics maintenance operations per
second from other processors in the system for each included Classifier.




                                                                       Page 3
Each Classifier performs up to 66 million searches per second along with
statistics updates associated with each successful search. This allows up to
two lookups and statistics updates per 40-Byte sized packet at 10 Gbps for
WAN router applications.

Storage configuration for the Classifier is shown as follows. The Classifier
includes a single-level TCAM array as well as associated data in L4. Total
user-accessible memory in the Classifier is 15 Mb (9 Mb of TCAM and 6 Mb
DRAM)

                             Classifier Storage Levels

Level            # Entries     Content           L4 data
TCAM             192K          48-bit Key        192K x 32 bits
TCAM             96K           96-bit Key        96K x 64 bits
alternate
TCAM             64K           144-bit Key       64K x 96 bits
alternate
TCAM             48K           192-bit Key       48K x 128 bits
alternate
TCAM             32K           288-bit Key       32K x 192 bits
alternate
TCAM             16K           576-bit Data      16K x 384 bits
alternate

                         Classifier Memory Organizations

Level       Width       Height         Size/Memory Type
TCAM        48 bits    192K(max)           9 Mb TCAM
L4 Data     32 bits    192K(max)           6 Mb DRAM
Total                                      9 Mb TCAM
                                           6 Mb DRAM


Table 2.    Classifier storage levels and memory organizations

Conclusions

Deep-packet processing at OC192 wire speeds requires highly memory-intensive
processing. Traditional solutions using SRAM require a large number of chips,
causing high power dissipation and high chip counts. Silicon Access Networks’
Address Processor and Classifier Co-Processor chips make extensive use of
fast, embedded Smart Memory to offer a unique cost-effective, yet high-
performance solution. OC192 line cards built with such chips will require
lower off-chip processing power and fewer or zero external SRAM memory
components for associated memory tables.



The conference presentation will focus on the lookup requirements for packet
classification,   implementation  details   of  the   Address  Processor   and
Classifier chips, and how a typical high-end router system might use them.




                                                                  Page 4

								
To top