Investigating Design Criteria for Searching Databases by Jharan

VIEWS: 107 PAGES: 13

									WHITE PAPER Investigating Design Criteria for Searching Databases
By Michael Miller, VP and CTO, Systems Technology, IDT, Inc. and Bertan Tezcan, Design Engineer, Systems Technology Group, IDT, Inc. Introduction Whether it is at the core, edge or access platform, network routers have one basic function: packet forwarding. Edge routers, however, have the need to perform additional processing and deeper examination of incoming packets – in the most efficient manner. To accomplish this, designers need to know how to analyze their design in order to utilize the most appropriate searching technology. With the emergence of reduced latency DRAM (RLDRAM) and high-density fast ternary content addressable memory (TCAM)enabled network search engines, the designer has many new options to consider. As designs and applications become more and more complex, it becomes readily apparent that there are many design trade offs – one size does not fit all. There are many factors that must be taken into account, including performance requirements and types of databases, as well as application-specific factors that must be evaluated when applying the chosen technology. This two-part tutorial will describe in detail the design complexities of selecting a particular search algorithm or technology. Part I investigates the key design considerations, including development issues and attributes that help determine which diverse memory architecture to incorporate into your design. Part II provides examples of applications in which these technologies can be applied. Router Functions and Searching Databases The job of the intelligent edge router, also called an ingress router, is to take incoming packets and send them to the network switch fabric. The edge router typically resides in the network service provider’s central office, and resides on a line card in a network-processing unit (NPU). These line cards do not simply forward packets; they must act intelligently in deciding when and how to forward them. For instance, the edge router must implement quality of service (QoS) protocols to ensure that customers get the right share of the data channel’s bandwidth. Implementing these QoS protocols and other intelligent packet forwarding requires the use of several specialized search databases. The general structure of the line card’s logic flow for packet forwarding appears in Figure 1, which shows the various databases and their connection to the line card’s tasks. Three databases are of special interest: the flow cache, the Layer-2 (L2) and Layer-3 (L3) tables, and the access control lists (ACL). Each has to be searched repeatedly as part of the packet forwarding, and each database has different search requirements.
RSVP Flow C ach e Policy ARP ACL 1
L2 Tab le L3 Tab le D atabase Maintenance


Control Plane Proce ssor



Classificatio n & Po licy A dm in

Forward ing Eng ine

Traffic Mana ger

Header & Content

Encrypt, Decrypt Sig n, A uthenticate

H eader U pdate

QoS Queues

Ingress D ata Channel
Packet B uffer

Switch Input

Figure 1 – Line card logic flow

Investigating Design Criteria for Searching Databases – Feb 2005 Page 1 of 13

The flow cache is used to monitor flows and may be used to speed router operation, taking advantage of packet traffic’s “flow” nature. A given application is likely to send a series of packets with the same header characteristics. By keeping a cache of recently processed headers together with their associated forwarding, classification, and policy data, the line card can avoid processing the header each time and simply forward the packet using the flow data generated the first time. The flow cache can also store packet header characteristics that the Resource Reservation Protocol (RSVP) gives to the router to specify how a specific flow should be processed. A search of the flow cache must use an exact-match operation. For traditional IPv4/6 packets, the line card examines the L2 link and L3 network address databases in order to determine the next hop in the packet’s route. In this operation, the routing decision may not need to use the full packet header. To speed operation, the search of this database uses a longest-prefix matching (LPM) algorithm. Access control lists (ACLs) are the result of implementing rules for each packet header combination. These rules allow the line card to verify that the services associated with routing the packet match the network status as well as the terms of the user’s contract. Users may have restricted access to some services, have contract-related bandwidth limitations, or not have authorization to access some network resources. In order to implement these rules, the line card needs to classify the packet according to its access rights. Because many users are subject to the same rules, but with different outcomes, the search on the ACL database uses a best-match algorithm with wildcards to mask arbitrary sections of the header from the search. For example, an access rule may say “decline all packets coming from any source IP address destined for Port 80 on any IP destination address” by masking all but the port address in the header. Line Card Performance All these database searches must be performed at speeds sufficient to keep pace with the incoming traffic. If these searches are not performed at line rate, packets will be dropped. There are two components to the line card’s search performance requirements. The first is the traffic rate. The second is the number of database searches a packet requires. Simple forwarding may only require one or two searches, but access control and billing may require many more. Figure 2 shows the data rates that line cards must handle for a variety of function and speed combinations, as well as the types of equipment that provide those functions.
Number of 40-byte Packets per Second (In Millions)
0.5 2 8 32 128 16






Content Aware Billing/Counting IPSec Access Control

Public Access CPE



Mobile Access & Corporate CPE

Metro Edge Routers

Core Edge Routers

Searches per Packet
8 4

QoS Classification L3 Forwarding

Broadband Access

Enterprise Switch

L2 Forwarding
155M 622M

1 2.5G 10G 40G

Switch/Router or Line Card Capacity

Figure 2 – Search rate requirements for network search engines

Investigating Design Criteria for Searching Databases – Feb 2005 Page 2 of 13

In addition to performing the search function, the line card must maintain the search databases. The frequency of updates on the databases depends on how quickly the network is changing. In relatively static networks using 16-4-4-4-4 trie algorithm for classification, adding or deleting an entry in L2 or flow cache tables requires access to 256 accesses per second (assuming an addition of prefix 8). The Border Gateway Protocol (BGP) maintains the L3 table and typically requires about 3K entries per second during extreme route flap conditions. For algorithm-based search tables, this translates to 6 million-memory accesses per second, about 2 percent of 250 MHz quad data rate (QDR) memory bus used on many line cards .The memory accesses become much more frequent, however, if the network is updating the flow table or allocating VLAN traffic to switched virtual connections. In this case, table updates can reach more than 100K entries per second, or 25 percent of bus capacity, creating the potential for a memory bottleneck. Unlike the L2/L3 tables, the current ACL tables are fairly static all the time. This may change in nextgeneration networks. Internet security is becoming more important and networks may need to adjust quickly and adaptively in order to foil cyber attacks, which will place an additional burden on ACL table updates. In addition to performance, there are other hardware issues that must be kept in mind when choosing the searching implementation for a line card. One is power. The central office where line cards reside is typically small and closed, creating a heat trap. Line cards can easily dissipate as much as 300W each, so cooling becomes a major problem for central offices. Design of the classification database memory allows a trade-off among speed, latency, board space, and power dissipation. Selecting the right database architecture, then, is an important factor in setting overall system power. Other factors to consider in line card design include the density and availability of memory. Memory density will determine the size of databases that are practical within the board’s physical constraints, thus determining its performance. The availability of memory will affect both product cost and production schedule. The wrong memory choice could result in a product that cannot be produced on time, which negatively impacts a designer’s ability to meet customer requirements. Although the search requirements for line cards can be high, there are many search algorithms that can satisfy them. For line card design, however, the algorithm also dictates the architecture. For example, the hashing and Trie algorithms use standard SRAM. The same algorithms can be used with RLDRAM or other memories by replicating the database over several devices so that searches can run in parallel to attain speeds comparable to SRAM. In one example, four RLDRAM devices with 20 ns random access time can store one database replicated four times to reach up to 200 MHz of memory performance. This solution may be less costly than an SRAM solution when only memory cost is compared, but the system costs go up when the memory bus is replicated four times. Additionally, the replication of databases further complicates the updating of tables. With either architecture, however, the algorithms force a tradeoff between search, update speed and memory size. Direct Index is Simplest One of the simplest search methods is to use the search parameter, or key, as the address to an SRAM table that contains an index into the relevant database. This direct indexing provides an exact match, as used in flow caching, using the packet header field as the search key. The approach is fast, delivering the index in a single memory access, but uses memory inefficiently. A search key containing “W” bits needs an index memory of 2W locations. As a result, keys much larger than 25 bits become impractical. The hashing algorithm provides a fast method of indexing large tables with much wider keys, up to a few hundred bits. Hashing compresses the wide search key to a narrow key of approximately 20 bits, after which the narrow key is used in the direct indexing method. Various methods are available for hashing the search key; one approach is to use an encryption algorithm such as data encryption standard (DES) with a fixed key. It is possible that hashing a wide key will result in two search keys equating to the same bit sequence in the first 20 bits. The keys thus clash, generating the same index, and the algorithm must resolve the ambiguity. Typically, the resolution requires the algorithm to directly compare the search key to the clashing table entries. This added comparison slows down the hashing algorithm, but the probability of a clash is often quite low. The performance reduction may be significant in worst case situations. Investigating Design Criteria for Searching Databases – Feb 2005 Page 3 of 13

The hashing algorithm, however, is limited to an exact match, so every bit is significant. While this is suitable to flow caching, the L3 tables need to support longest prefix matching, and ACL needs to use wildcards. These databases need a different algorithm. The binary search retrieval algorithm (Trie) allows wildcards in the lower-order bits of a search key. The algorithm performs a bit-by-bit examination of an incoming search key, altering left and right pointers with each bit examined, as shown in Figure 3. When the algorithm reaches the first wildcard, or “don’t care” condition, the search stops. The pointer’s value now provides the table index. This result is a “best match” search. Pointers with shorter prefixes are identified during the search, like P4 in the figure, but the algorithm keeps track of the best match information (in this case P3) at every level.

Best Match
Lookup 10111 => P3

Trie node
Prefix Pointer right-ptr left-ptr

0 1




P1 P2 P3 P4 P5

0110* 01* 101* 1* 1110*

1 0


1 0

Figure 3 – Binary retrieval (Trie) algorithm These binary search tries are memory efficient, but slow. Each bit in the key triggers a memory access. Because most tables are too large for a CPU’s on-board cache, these accesses have the additional drawback of being across an external bus. Parallel examination of multiple search keys can help speed the search, but this requires additional memory interfaces, which adds pin count and design complexity because of the need to interleave memory accesses. As a result, the algorithm is best suited to a small database with wide keys, such as the ACL. Multiple-Level Trie A variation of this algorithm, the multiple-level Trie, can increase performance and is popular for large databases with narrow keys, such as forwarding. This algorithm examines several bits of the search key (the “stride factor”) at each level. The results of the first level point to one of several tables, or nodes, in the second level. The results of the second level point to one of the nodes for the third level, and so on. This approach requires a new node for each possible pointer in the previous level. If the stride factor is “K0” bits at the first level, the algorithm wants 2K0 nodes at the second level. Each pointer from each node at the second level wants a node at the third level, and so on. Theoretically, the number of nodes the algorithm calls for quickly becomes unmanageable. In practice, however, the number of nodes actually needed at each level will be no greater than the number of entries (N) in the database, so the memory requirements are significantly less than a pure geometric progression predicts. A calculation showing the memory requirements for a 3-level Trie structure with stride values of 16, 8, and 8 and 4096 table entries appears in Table 1. This is a common configuration for IPv4 forwarding. Maintaining multi-level Trie nodes depends on the type of entry being made in the database. An IPv4 (32bit) address masked to the first 12 bits, for instance, requires a change to only 16 locations in the first node and no changes to the other levels for the 16-8-8 structure of the example. Masked to 28 bits, the entry Investigating Design Criteria for Searching Databases – Feb 2005 Page 4 of 13

would require one change in the root node, one change in the relevant node at the next level, and 16 changes at the third level. Masking to fewer than 8 bits never occurs. Other structures are possible for the multi-level Trie. A five-level 16-4-4-4-4 structure would need a total of 10.5 Mbits of memory for 4096 IPv4 entries, as opposed to the 69.2 Mbits needed for 16-8-8. The fivelevel structure also requires few maintenance entries. The tradeoff for the smaller memory requirement is a longer search path. Multiple Field Searches The Trie algorithms become complex when applied to the multiple field searches demanded by access control. In these searches, a wild card may be located at several points in the search key, as opposed to being restricted to the lower-order bits as in the forwarding searches. The combination of wild cards and multiple search fields means the algorithm must perform a hierarchical search, so it searches for matches in the first field. Where wildcards exist, the search must then jump to start searching in the next field as well as continue searching the next bit in the first field. This branching continues through as many fields as needed. An example of a hierarchical search through a set of rules composed of two fields is shown in Figure 4.
Rule R1 R2 R3 R4 R5 F1 11* 0* 1* 0* 11* F2 10* 01* 0* 011* 00*

Root 0 F1 1

Lookup 110,000

F1 1

F2 0 F2 0 1 R2 1 R4 R3



0 Hit 0 F2


1 F2 0

R5 R1 Hit

Figure 4 - Hierarchical search using multiple fields Because a wildcard may exist at any point, if the first field has W1 bits, a second field has W2 bits, and so on, the maximum number of searches needed can be as great as W1xW2xW3… which can easily run into millions of memory access per search. Alternative algorithms are available to reduce search time, but they require proportionally larger memory. There are more efficient multiple-field search algorithms in the research stage. These heuristic techniques typically reduce the problem by limiting search key configurations, which requires a thorough understanding of the rules being implemented and often use pre-calculated tables. This limits their usefulness to static tables. Next-generation equipment will typically require dynamic table management. All these direct and Trie search algorithms depend on random-access memory for their implementation. An alternative is to use an architecture that employs ternary content-addressable memory (TCAM), such as a network search engine (NSE). TCAM-based NSEs are specialty memory devices that make a simultaneous comparison to the search key, with wildcards, of all entries in the memory. The NSE returns the address of the first matching entry, which serves as the index into the database. These TCAM-based searches can be Investigating Design Criteria for Searching Databases – Feb 2005 Page 5 of 13

used for all three search types used in line cards: exact match, longest prefix (best) match and multiple field match. Single-cycle Search with TCAM-based NSEs The TCAM-based NSE is able to search its entire memory array in a single clock cycle by incorporating a comparison function into a standard RAM cell. The bit lines of the RAM cell work normally when writing data into memory. When making a comparison, however, they hold the pattern to be matched. Additional transistors in the TCAM form an exclusive-NOR with the RAM cell value and the bit lines and the output of this logic ties to a “match” line for that address. Additional transistors allow a wildcard value to force a match on the corresponding bit. All bits at a given address tie to the same match line, so any mismatch discharges the line. Only if the stored value at an address matches the comparison value will the match line remain high. Each comparison in the NSE requires the pre-charging of the match line, so the devices can exhibit higher power consumption than a conventional SRAM. Advanced TCAM-based NSE designs, however, include segmented power management features, such as the IDT Dynamic Database Management feature. If an NSE holds several databases, the power management can pre-charge only those lines associated with an active database. The NSE devices not only perform a search in a single clock cycle, they can be updated in a single clock cycle. This makes them much faster to update than SRAM or RLDRAM implementing a Trie algorithm. TCAM-based NSEs that can search at 250 million searches per second (MSPS) are available from vendors such as IDT, making searches and updates at network line rates possible. If performance were the only issue, TCAM-based devices would be the memory of choice for all search operations. Cost, power, and density are all important design factors, however, and must be traded off against speed. The exact tradeoff depends on the size and performance requirements of the search, which requires a system level comparison of the algorithms and architectures. To perform their primary function, packet forwarding, edge routers need to efficiently search a variety of databases. The performance, services, and features of these applications will determine the structure of those databases, and how they must be searched to provide the performance required by the end user. These characteristics will ultimately determine the type of search algorithm and memory architecture required to implement the database search function. Finding the right match requires careful deliberation of the size, speed, and type of search in order to implement the appropriate architecture. We have reviewed the various development issues and decisions that need to be considered around incorporating diverse memory architectures into your design. Now we will consider examples of the types of applications where these technologies can be applied. Comparing Raw Performance Each of the types of tables to be explored below can be implemented using SRAM, RLDRAM or a network search engine (NSE). The search performance will be directly impacted by the raw performance of the technology. SRAM is becoming available that can be operated at 333 MHz with 6ns latency. RLDRAM will have comparable cycle times, but a latency of 20ns. Thus, any algorithm that requires iterative random accesses will suffer from the latency of RLDRAM. SRAM-like speeds can be achieved by using four or eight banks of RLDRAM at the expense of four to eight times the memory requirements of duplicated memory. With this technique, RLDRAM and SRAM can be used interchangeably. TCAM-based technology used in an NSE can cycle at 250 MHz for match rates of 125 million searches per second (MSPS) without the need to duplicate data. Memory Design for Searching Databases

Investigating Design Criteria for Searching Databases – Feb 2005 Page 6 of 13

Three types of databases are of special interest in edge router design: the flow cache, the L2 and L3 tables, and the access control lists (ACL). Each has to be searched repeatedly as part of the packet forwarding, and each must be searched differently. In the case of the flow cache, the edge router searches to see if it has already processed a header. If it finds an exact match, it can reuse the previous results and save itself some effort. The router uses the L2/3 tables to determine a packet’s next hop. Since the router may not need to match the entire header, the database search needs to provide a longest-prefix match (LPM). The access control lists (ACLs) help the router implement rules to govern packet handling. The rules in these lists can broadly or precisely describe the packets to which they apply, so the router needs a best-match algorithm with wildcards in order to match a packet with all its governing rules. There are several algorithms that can be implemented in random access memory, either SRAM or RLDRAM, which perform these different types of matching. Direct-index and hashing algorithms provide an exact match. Binary and multi-level Trie algorithms offer LPM, and hierarchical Trie and heuristic algorithms offer best match with wildcards. The NSE, a specialty device based on ternary content addressable memory (CAM) technology, has matching logic built-in. Binary CAMs perform straight comparisons between the key and the memory array whereas ternary CAMs (TCAMs) allow the use of wildcards for each entry to mask bits during the comparison. With so many possibilities, selecting a database’s optimum algorithm and architecture requires a systemlevel analysis of performance, cost and power for each option. The simplest case to consider is the flowcaching database for Layer-2 Media Access Control (MAC), which requires an exact match. Direct-index, hashing, and binary Trie algorithms are all suitable for direct matching operations, as are binary and ternary CAMs. Exact matches for Flow Caches and MAC addresses Direct-index searching simply takes the W-bit search key and uses it to address a memory array that contains an index to the proper entry in the database. This only requires one memory access cycle per search, but requires a memory array of 2W entries to be searched. If the array has a bit-width of “Z” (typically 32-36 bits), then direct index searching requires 6*Z*2W transistors, assuming a 6-transistor SRAM memory cell. This may be feasible for smaller search keys, but becomes infeasible for wider keys that reach 104-125 bits. The hashing algorithm compresses a larger key to a suitable size for indexing a table. This is often accomplished by randomizing the search key and then selecting the first “M” bits to use as the look up index from a table of “N” entries. Because the randomization and selection process can cause two different keys to have the same hashing value, resulting in a search collision, the array must store the original key so that the collisions can be resolved. Each time an index is made into the table, the key is compared to the stored value. If a mismatch occurs, then a secondary entry is explored for a possible match. Sometimes a third and fourth potential key must be searched. Thus, the hashing approach has a certain variable latency aspect. The minimum memory size for a hashing database is thus 6*N*(W+M), much less than for direct indexing. In order to be able to resolve collisions, however, the database needs more memory to track all the keys that have the same hashing result. As a rule of thumb, the memory array should be twice the minimum size. The number of memory cycles needed for a hashing search is at least equal to the time required to calculate the hashing value, plus one (for the direct-index search). Any need to resolve collisions, however, will add memory cycles. The CAM-based approach combines the memory efficiency of hashing with the simplicity of direct indexing. In the CAM-based approach, all memory contents get compared to the search key simultaneously. This would imply that the number of memory accesses per search are the same as direct indexing, i.e., one. In addition, the performance is high and has a fixed latency unlike the hashing algorithm. On the downside, there are more transistors per cell in a CAM-based device than in standard SRAM, so the final multiplication factor is greater than 6. Binary CAMs have 10 transistors per cell and ternary CAMs have 16 transistors per cell. Because these databases need an exact match, the binary CAM is the better Investigating Design Criteria for Searching Databases – Feb 2005 Page 7 of 13

choice and the value 10 is used. In a cost comparison, SRAM and DRAM benefit from larger scales of production. The practical result of this analysis is the ability to estimate the board density and performance of the implementations. The board density for each option can be inferred directly from the transistor count. The memory performance needed for the flow cache depends on the incoming data rate. Packets are a minimum of 40 bytes in length and only the header information is used in the flow cache search, so the rate at which searches must be conducted is much less than the line rate. Based on the number of memory cycles needed for each search, these options are all able to support data rates as great OC-192 using available memory technology, although excessive collision activity will slow down the hashing approach below this rate. So CAM is only used for exact match when the highest levels of performance in searching and updates are required. Longest Prefix Matching The analysis of the worst-case performance for exact match algorithms, along with other search algorithms, is summarized in Table 1. The derived values for arbitrary database configurations are valid for most network conditions. Memory requirements are given for SRAM, and the results should be multiplied with the replication factor (i.e. 4 for RLDRAM solutions) for DRAM solutions running at the same speed. The algorithmic LPM matching needed for Layer-3 IPv4 and IPv6 forwarding, however, will vary considerably depending on the nature of the traffic being forwarded. Analyzing the performance of the Trie and Heuristic algorithms used for LPM needs to be performed for a real database design.

Table 1: Worst-Case Analysis for Search Algorithms The LPM analysis uses a one million-entry Border Gateway Protocol (BGP) table from large routers such as Mae West and Telestra. The number of memory locations actually needed as a function of the number of IP addresses handled appears in Figure 1. Data appears for three configurations: a 16-4-4-4-4 stride-multilevel Trie, an 8-8-8-8 multi-level Trie, and a TCAM-based approach. The Trie configurations use an SRAM-based design. An RLDRAM-based design would need four times as many memory locations due to the need to replicate the database in order to achieve speeds that match SRAM performance.

Investigating Design Criteria for Searching Databases – Feb 2005 Page 8 of 13


Figure 1: BGP table memory requirements The example shows that the TCAM technology uses far fewer memory locations than the Trie implementations. To put the comparison on a transistor-level basis, we need to recall that a TCAM cell has the equivalent of two SRAM cells. Even adjusted for transistor count, however, the SRAM implementation uses more than ten times the transistor count of TCAM for smaller databases. TCAM technology also establishes a deterministic memory requirement for an LPM database while trie algorithms may require different size memories for different distributions. Usually system engineers allocate double the SRAM/RLDRAM memory size than the realistic case in Figure 1 to accommodate other distributions. This doubles the memory efficiency number between TCAM and SRAM/RLDRAM solutions. An RLDRAM implementation uses one transistor per memory location instead of the six for SRAM. Factoring in the 4x replication needed for performance matching, the RLDRAM implementation appears to be comparable to the TCAM-based implementation in transistor count for large databases. But the 4x replication brings with it the need for four memory buses, which will consume the available memory bandwidth of network processors or ASICs handling the forwarding operation. Because of replication, RLDRAM solutions will be four times slower than equivalent SRAM solutions in updating the database. As a result, RLDRAM implementations choke at higher processing speeds. The advantage of using a TCAM-based approach becomes more obvious when considering IPv6 addressing, which uses 128 bits. The TCAM memory size scales linearly with increasing key length. The Trie algorithm memory size, on the other hand, grows exponentially with increasing search key size because each possible result at one level needs a full lookup table at the next level. The speed performance of multi-level Trie implementations depends on the number of levels used. In the example above that uses IPv4 addresses (32-bit), the 16-4-4-4-4implementation uses five levels. The algorithm requires one memory access per level, so the TCAM-based approach is five times faster in

Investigating Design Criteria for Searching Databases – Feb 2005 Page 9 of 13

forwarding. If the memory subsystem is using a 250 MHz quad data rate bus, the Trie algorithm will achieve 50 MSPS, while the TCAM-based approach achieves 250 MSPS. The performance difference increases when you consider the database’s maintenance needs. The Trie algorithm keeps in its database the longest prefix match for a given search key. It must, however, maintain a prefix vector entry database that holds all entries with shorter matching prefixes that have “don’t care” elements. If the system deletes the database entry, it must look at the prefix vector entries to see if a shorter match is available to take the deleted entry’s place. For example, if, in a 16-bit stride factor design, an 8-bit prefix entry gets deleted, the system must make 2560 memory accesses to find a replacement. At a bus rate of 250 MHz, that allows only 97K updates/sec. Combine that bandwidth consumption with instruction overhead, and a network processor might not be able to maintain dynamic tables with 100K virtual route tables using a Trie approach. The TCAM-based replacements are as fast as the look-ups, so its update rate goes to 100 million updates/sec or higher. Best Match ACL tables The ACL database has a wide key, as much as 269 bits and up to 100K entries. The complexity of the matches is much more complex than that of LPM, because wild card values can exist at multiple places within the table entry to represent ranges of network address and port numbers in both source and destination fields. Structures such as multi-level Tries can be used. However, due to the key widths and the potential number of range fields, the memory requirements can grow rather quickly. This forces it to use more memory-efficient algorithms than the exponentially growing needs of multi-level Trie. Algorithms like the Hierarchical Trie and Heuristic algorithms are more memory efficient. Unfortunately, they are not speed efficient. Table 2 shows the number of memory cycles needed to perform a search using each of these algorithms as well as the TCAM-based approach. The performance calculations in the table are based on the number of memory accesses needed assuming a 300 MHz SRAM speed and a 250 MSPS NSE. The multi-level Trie, at 17.64 MSPS, is barely able to keep up with OC-48 speeds. The TCAM-based approach clearly outperforms the other alternatives.

Table 2: Memory Access Requirements of ACL Search Algorithms The memory size needed for an ACL database depends on the number of rules it implements. Typical sizes range from 4K to 32K rules. Figure 2 shows the number of transistors needed for the three alternatives. The Heuristic approach looks promising for lower rule sizes, but it is inapplicable to larger sizes. Heuristic algorithms are based on analysis of the rules in order to find patterns that allow the designer to structure the search mechanism. With large numbers of rules, the analysis becomes unworkable. Additionally, studies have shown that the time to compute new entries for larger rules sets grows geometrically.

Investigating Design Criteria for Searching Databases – Feb 2005 Page 10 of 13


Figure 2: Memory requirements for ACL search algorithms System-level Analysis All of the analysis so far considers each type of database independently. In practice, however, edge routers use a common memory to hold all the databases simultaneously. Edge routers aggregate multiple physical links, thus resulting in line cards with aggregate rates from 2.5 to 10 Gbps. The final analysis must compare the performance of RAM- and CAM-based designs under typical conditions. Table 3 shows search key width requirements and the depth (number of rules) for all the databases assuming representative high-end scenario with traffic that is a mix of IPv4 and IPv6 packets.

Table 3: Database Requirements for System Analysis The forwarding and flow-cache operations determine the traffic throughput the rest of the router must achieve. The greater of the IPv6 forwarding or the IPv4 forwarding with flow-cache acceleration thus sets the speed requirements for the ACL database. Using the relationships given in Table 1, we can model the total memory storage and access cycles needed to support the router’s overall performance. The analysis for the SRAM/RLDRAM approach, shown in Figure 3, chooses the highest-performing algorithm for each search type. For Trie algorithms, a 16-4-4-4-4 stride is assumed. The number of table entries and the key width are typical for each database.

Investigating Design Criteria for Searching Databases – Feb 2005 Page 11 of 13


Figure 3: SRAM configuration stride factor 4 (16-4-4-4-4) The performance assessments use both 250 MHz and 333 MHz memory bus bandwidths and are color coded to indicate their success. The green area indicates performance that is adequate for the given line rate with a 250 MHz bus, while yellow shows adequate performance at 333 MHz. The red areas indicate inadequate performance; that is, these configurations will not be able to keep pace with the given line rate. Along with the performance analysis, Figure 3 shows the number of memory cells needed for the implementation and the number of 36-Mbit devices needed to provide this amount of memory. The analysis shows that SRAM-based designs are incapable of handling IPv6 forwarding and access control at line rates above 2.5 Gbps. They are adequate for 2.5 Gbps rates, however, but only if bus loading is not a problem. Connecting eight memory devices in parallel on the network processor’s bus may slow it down well below the anticipated 333 MHz data rate. For RLDRAM solutions, you may not have a busloading problem, as each device will have a dedicated memory bus. In this solution, you only need four 256Mbit RLDRAM devices each on separate buses. One benefit for this design is that it is not memory intensive. The memory devices running at the 333 MHz bus rate consume approximately 470 mW each, so this search database requires about 2W of total power. There are eight devices, however, which require a considerable amount of board space. TCAM-based NSE Architecture Outperforms An analysis for the same design parameters, but using a TCAM-based NSE design instead of SRAM/RLDRAM is shown in Figure 4. The TCAM-based NSE devices easily handle all the required operations, even at the 10 Gbps data rate. Bus loading is not a problem because the devices connect in series, not in parallel. The design is also compact, with the databases fitting into two 512Lx36 (18-Mbit) TCAM-based NSEs.

Figure 4: Diagram of a QDR TCAM configuration Investigating Design Criteria for Searching Databases – Feb 2005 Page 12 of 13

On the surface, one drawback to the TCAM-based design is its power demand. If the application searches the entire memory array, these devices can require a decent portion of a systems overall power budget. However, a database selection feature built into the IDT NSEs can help moderate the power consumption. The selection feature, known as dynamic database management, allows the system to shut down the sections of the chip that do not contain the active database. This selective power-down reduces average power needs by as much as 70+ percent bringing the TCAM-based NSE power requirements to an easily managed level. For larger forwarding tables, additional power consumption can be controlled by searching a subset of the table and using a few of the most significant bits to further qualify a portion of the database to be searched. The analysis indicates that SRAM/RLDRAM-based designs are adequate for modest stand-alone performance levels and tables like flow and IPv4, but can quickly become unmanageable as search keys widen for IPv6 or if more complex matching of ACL tables is needed. At higher performance levels, SRAM-based designs simply cannot meet the requirements. When you consider applications with multiple databases, TCAM-based designs offer compact, high-performance designs with plenty of room for growth.

Investigating Design Criteria for Searching Databases – Feb 2005 Page 13 of 13

To top