Bridges_ Switches_ Routers
VLAN (Virtual Local Area Network) is a local area network equipment will be logically divided into one network segment to realize a virtual working group on emerging data exchange technology. This emerging technology is mainly used in switches and routers, but among the mainstream or in the switch. But not all switches have this feature, only the VLAN protocol before the third layer above the switch has this feature, this can view the manual switch can be learned.
Chapter 1 Bridges, Switches, Routers 1.1 Introduction ¯ Packet vs circuit (and virtual circuit) circuit switching ¯ Network—mesh interconnection of links and switches – LANs (multiaccess, broadcast or shared medium Ethernet: 10BT—1000BT, Cat 3 UTP) – WANs switches connected by point to point links ¯ Packet processors—Bridges, Routers, ATM switches 1 2 CHAPTER 1. BRIDGES, SWITCHES, ROUTERS Congestion control Reservation Control Routing path Switching Data path Policing Scheduling per packet processing Figure 1.1: Packet processor functions may involve the data path or control path. 1.2 Packet processor functions Routing — creating and distributing information that deﬁnes path between source and destination and determining the best path Switching — per-packet forwarding decisions, and sending packet towards destination Other functions — congestion control, reservations, policing, scheduling Control functions performed infrequently; datapath functions are performed per packet. 1.3. TRANSPARENT BRIDGING IEEE 802.1D 3 L1 R 20 D 10 R 10 R = root port for bridge D = dedicated port for LAN B3 B1 B2 D 10 D 10 10 L2 L3 20 R D D 20 20 B4 L4 L5 STP 1. Determine root bridge, and set its ports L1 in forwarding mode. B3 B1 B2 2. Each bridge deterimines root port, and sets it in forwarding mode. 3. Bridges determine designated port for L3 L2 each LAN segment. B4 B4 4. All other ports are in blocked state. L4 L5 Figure 1.2: Bridged extended LAN and corresponding graph. Bridge forwards frames along span- ning tree, according to FDB. 1.3 Transparent bridging IEEE 802.1D Ethernet LANs broadcast each packet to every device on the LAN. The throughput per host de- creases with number of hosts connected to the LAN. See Problem 1. Transparent bridging prevents this by interconnecting LAN segments (collision domains) and for- wards unicast packets according to ﬁltering database (FDB). Broadcast, multicast, and unknown unicast are ﬂooded to all LANs. So all segments form a single broadcast domain. A bridge has two or more ports. Packets from incoming ports are forwarded to outgoing ports along a spanning tree to prevent loops, according to FDB. See Figure 1.2. ¯ spanning tree algorithm: one root, then shortest path to root; ¯ learning process: produces FDB by relating MAC source address to incoming port and re- moving unrefreshed entries. Bridges exchange conﬁguration messages to establish topology and topology-change messages to indicate that STA should be rerun. With a ﬁxed number of bridge ports, througput per LAN segment decreases with the number of segments in an extended LAN. See Problem 2. 4 CHAPTER 1. BRIDGES, SWITCHES, ROUTERS Figure 1.3: LAN vs VLAN topology Figure 1.4: VLAN tags 1.4 LAN switches IEEE 802.1Q A LAN switch is a bridge with as many ports as number of LAN segments, and with enough capacity to handle trafﬁc on all segments. Problem 2 is solved through VLAN. Virtual LANs or VLANs is a collection of LAN segments and attached devices with the properties of an independent LAN. Each VLAN is a separate broadcast domain: trafﬁc on one VLAN is restricted from going to another VLAN. Trafﬁc between VLANs goes through a router. VLAN tags or VID (4-byte) are added to MAC frames so switches can forward packets to ports with same VID. FDB is augmented to include for each VID the ports (the member set) through which members of that VLAN can be reached. 1.4. LAN SWITCHES IEEE 802.1Q 5 The member set is derived from VLAN registration information: (i) explicitly by management action or by (ii) GARP VLAN registration protocol (GVRP). GARP is generic attribute registration protocol. Multicast ﬁltering A VLAN is a single broadcast domain. If multicast messages are broadcast, the througput is limited by the slowest link: A switch with 124 10-Mbps ports has a capacity of 1.24 Gbps but can transmit at most 6 1.5Mbps multicast video channels. GARP Multicast Registration Protocol (GMRP) (IEEE 802.1P) allows switches to limit multicast trafﬁc along the ST. (See IGMP.) ¯ JOIN host sends this message to express interest in joining a multicast group. Switch adds port to multicast group and forwards multicast source to these ports. JOIN messages are sent once every JOINTIME timeout. ¯ LEAVE message sent by host. Switch removes this port from multicast group unless another host on that port sends JOIN message before LEAVETIME timeout. ¯ LEAVEALL message peroidically sent by switch. When a host sends IP data to a multicast (Class D IP) address, the host inserts the low order 23 bits in the low order 23 bits of the MAC address. So a NIC that is not part of the group ignores these data. 6 CHAPTER 1. BRIDGES, SWITCHES, ROUTERS Figure 1.5: The IP header provides precedence and type of service ﬁelds Quality of service The 3-bit precendence allows 8 priority levels. The ToS bits are D-min delay, T-max throughput, R-max reliability, C-min cost. 802.1 provides no support for priority. 802.1P provides in-band QoS signalling with 8 COS levels. A conforming bridge or switch maintains 8 queues. (VLAN tags may also carry priority information.) 1.5. PROBLEMS 7 1.5 Problems 1. “The throughput per host decreases with number of hosts connected to the LAN.” Formulate two mathematical models, one deterministic and one stochastic, in which this quote is an assertion. Then prove or disprove the assertion. You will have to model LAN speed, host load and throughput. Hint: Use M/M/1 model of section 3.3. 2. Follow Figure 1.2 and propose a graph model for an extended bridged LAN in which bridges may have multiple ports. (a) Use the graph to formulate two mathematical models, one deterministic and one stochas- tic, within which one can determine the throughput per LAN segment. (b) How would you formulate as a mathematical assertion the statement “the througput per LAN segment increases with the number of ports in the bridge”? Hint: try the Jackson network model of section 3.3. 3. Discuss the differences between STP and OSPF in terms of throughput or efﬁciency in link utilization. 8 CHAPTER 1. BRIDGES, SWITCHES, ROUTERS Chapter 2 Processor architecture 2.1 Datapaths When packet arrives at bridge, ¯ DA is searched in forwarding table (DA output ports). If not found, packet is broadcast to all output ports; ¯ If found, it is forwarded across switching fabric to appropriate output port (or ports for mul- ticast); ¯ SA is learned and added to forwarding table; ¯ During transfer to fabric packet may be stored or dropped if storage is full; ¯ Packet is stored in output port queue (usually FIFO) and eventually transmitted. When packet arrives at router, ¯ DA is searched in forwarding table. If not found, packet is dropped; ¯ If found, next-hop MAC address is appended, TTL is decremented, new Header Checksum is calculated, and packet is forwarded across switching fabric to output port or ports; ¯ During transfer to fabric packet may be stored: if storage is full, this (or another) packet may be dropped; ¯ Packet is stored in output queue (FIFO or more complex) and eventually transmitted. When cell arrives at ATM switch, ¯ Its VCI is searched in forwarding table (VC translation table: (VCI in, Port in) (VCI out, Port out)). If not found, cell is dropped; 9 10 CHAPTER 2. PROCESSOR ARCHITECTURE ¯ If VCI is policed, policing function determines if cell is conformant. If not, it may be dropped. If yes, cell is forwarded across switching fabric to output port; ¯ During transfer, cell may be stored: if storage is full, this or another cell may be dropped; ¯ Cell is stored in output queue and eventually transmitted. Service discipline may be FIFO or very elaborate. 2.1. DATAPATHS 11 CPU CPU packet Memory packet Memory Line card #1 CPU memory Line card #1 Line card #4 Line card #2 CPU memory Line card #2 Line card #5 Line card #3 CPU memory Line card #3 Line card #6 A B CPU CPU packet Memory packet Memory Line card #1 CPU memory Line card #1 CPU memory Line card #1 #2 CPU memory Line card #2 CPU memory Line card #1 #3 CPU memory Line card #3 CPU memory C D Figure 2.1: Basic packet processor architecture ¯ Throughput in A is limited by CPU speed; ¯ In B, there is a choice about which CPU to forward packet; ¯ In C, packet travels bus only once, so throughput limited by bus speed; ¯ In D, several packets can be forwarded through crossbar. General purpose CPUs are not well-suited for applications in which packets ﬂow through. CPU’s are better when same data are examined several times, making use of cache. 12 CHAPTER 2. PROCESSOR ARCHITECTURE Congestion control Reservation Control Routing path Switching Data path Policing Scheduling per packet processing Forwarding Switching Policing Scheduling decision fabric Figure 2.2: Elaboration of datapath functions 2.2 Performance The packet delay through switch fabric consists of time (1) for forwarding decision, and (2) to transfer packet across switch. Packet delay through processor consists of time (1) for policing decision, (2) forwarding decision, (3) to transfer across switch, and (4) for output scheduling decision. 2.3. FORWARDING DECISION 13 Time Output scheduling decision time Switch transfer time Forwarding decision time packet arrival rate min back-to-back Header packet size arrival time packet size Figure 2.3: Delay of switch and packet processor 2.3 Forwarding decision Criteria: (1) speed of address lookups depends on number of memory references; (2) size of memory ATM switches perform direct lookup, ﬁgure 2.4 VCI address space is ¾¾ = 16 M. Most switches contain ¾½ or fewer entries, since it is downstream switch that chooses VCI that ﬁts in supported address space (PNNI). For multicast, lookup returns list of output ports, each with different VCI. Address (port, new VCI) Data VCI DRAM Figure 2.4: ATM switches perform direct lookup 14 CHAPTER 2. PROCESSOR ARCHITECTURE Network Associated address data associated data net address hit 48 bits location of entry log2N bits (size N memory) Figure 2.5: CAM or Content addressable memory. The 48-bit MAC address is presented. A suc- cessful parallel search asserts “hit” signal and returns pointer to entry where forwarding information for the MAC address is stored. Bridge Address space is ¾ so direct lookup is not possible. Three indirect lookup techniques: Associative memory. Figure 2.5. Typical CAM size is Æ ½¼¾ entries. Not suitable for large LANs which support ¾½ ¼¼¼ entries. 2.3. FORWARDING DECISION 15 address 48 bits 16 bits log2N Hashing data DRAM function address of M addresses N linked lists Figure 2.6: A 48-bit address is presented and the hashing function returns a pointer to one of Æ linked lists. The search through a linked list takes a random time proportional to length of list. Hashing. For large LANs hashing is an option. Suppose the LAN has Å hosts. A hashing function, , maps a hosts 48-bit address to a forwarding table with, say, Æ ¾½ entries as in Figure 2.6. Two addresses Ü Ý may collide: ´Üµ ´Ýµ. The entry points to a linked list of (MAC address, forwarding data) of MAC addresses that map into the same entry. The list must be searched sequen- tially to locate the MAC address. The duration of search is proportional to the length of the list. Suppose maps the Å MAC addresses Ü½ ÜÅ into the Æ linked lists ½ Æ . Assume that ´Ü½ µ ´ÜÅ µ are independent uniformly distributed over ½ Æ . The length of the th list is the ramdom number Å Ò ½´Ü µ ½ Æ (2.1) ½ Let « Æ Å . If « is small (number of lists larger than number of possible addresses), the lists will usually have 0 or 1 element. Problem 3 asks to ﬁnd the distribution of Ò . For Æ Å (« ½), the mean length of the list is about ¼ ´½ · «µ. However, Ò being random, there is a chance that some lists (and corresponding search time) may be very large. For real-time applications, you may store forwarding tables in such a way (e.g. as trees) that retrieval has a deterministic bound, 16 CHAPTER 2. PROCESSOR ARCHITECTURE Preﬁx Outgoing port 184.108.40.206 /16 1 220.127.116.11/24 7 18.104.22.168/32 3 Figure 2.7: Forwarding table with CIDR IP routers. With CIDR, router forwarding table entries are identiﬁed by a pair, (route preﬁx/preﬁx length), with preﬁx length between 0 and 32 bits. See Figure 2.7. The entry 22.214.171.124/16 is a 16-bit long entry. The forwarding decision must ﬁnd the longest preﬁx match between the packet’s destination IP address and the preﬁxes in the forwarding table. CIDR reduces table, but the forwarding decision is more complex. See . With declining memory cost, it may be more economical to expand the preﬁxes and use simpler, exact matching algorithms. 2.3. FORWARDING DECISION 17 Caching. The forwarding decision delay can be reduced by caching. Idea is that the IP destination addresses of successive packets are correlated. The cache stores the full source and destination IP address and the corresponding forwarding deci- sion (including perhaps the entire replacement IP header). When packet arrives SA and DA are used to do a full match in the local cache. If the addresses are not there, the packet is forwarded to a central routing processor. A cache replacement rule is needed if there is a cache miss. The improvement in delay depends on (1) the ratio of cache size to the size of the forwarding table, and (2) the temporal locality. The latter is likely to be higher in a campus router than an edge router and larger there than in a core router. See Problem 4. Multicast. Some routers support multicast. The simplest rule is RPF (reverse-path forwarding): If a multicast packet arrives on port È from source Ë , look up Ë in the forwarding table. If È is the best port to reach Ë , forward the packet on all ports except È . Switching fabrics Need some queuing models. 18 CHAPTER 2. PROCESSOR ARCHITECTURE 2.4 Problems 1. For a commercial LAN switch, ﬁnd the various times in Figure 2.3. Also give the throughput. See, for example, www.bcr.com/bcrmag/08/98p25.htm 2. If forwarding decision, switch transfer, and output scheduling can be pipelined, what is the throughput of the processor? 3. Find the (marginal) distribution of the Ò given in (2.1), and calculate the mean length Ò of a list. Show that for « ½ small, the mean is approximately ¼ ´½ · «µ. Find the joint distribution Ô´Ò½ ¡ ¡ ¡ ÒÆ µ. Verify that it has the product form: ÉÆ Ô´Ò µ Ô´Ò ¡ ¡ ¡ ÒÆ µ È ÉÆ ½ Ô´Ò µ ½ Ò¾ ½ Here Ò ÈÒ Å , so the denominator is the normalizing constant. Take Å Æ ¾½ . Find the probabililty that Ò ½¼¼¼. Suppose a memory access takes 100 ns, « ½. Consider back-to-back Ethernet packets. What is the average throughput of this switch using the model of Figure2.3 and ignoring the output scheduling decision delay. 4. The packets arriving at a line card belong to several multiplexed TCP connections. (a) Formulate a model of packet arrivals with say Å simultaneous connections and in which connections last a random amount of time with a geometric distribution and mean Ì . (b) Suppose the size of the cache is Æ . If there is a cache miss, an existing entry is replaced by the missing entry. How would you calculate the hit ratio as a function of Å Æ Ì ? (c) Suppose you are given a ‘typical’ trace of the addresses of packet arrivals, but no model of the arrival process. You want to know how big a cache you would need so that the hit ratio is a certain value, say ¼ . What would you do? (d) The time to search a cache is Ì , the time to search the central forwarding table is Ì , the hit ratio is . How would you decide if it’s worth having a cache? Chapter 3 Queuing 3.1 Discrete time Markov chains Ü ÜÒ Ò ¼ is a Markov chain with ÜÒ ¾ ﬁnite or countable, stationary probability matrix È´ µ ¾ , initial distribution ´ µ ¾ ¼ . So È ´Ü ¼ ¼ ¡ ¡ ¡ ÜÒ Òµ ¼ ´ µÈ ´ ¼ ¼ ½ µ ¢ ¡ ¡ ¡ ¢ È ´ Ò ½ Òµ (3.1) for all Ò ¼ ¼ ¡¡¡ Ò ¾ . Ò is the marginal distribution of ÜÒ written as a row vector. From (3.1) Ò ¼ ÈÒ (3.2) is invariant if it satisﬁes the balance equations È (3.3) Ü is irreducible if it goes from any state to any other state (with positive prob). Irreducible chains have at most one invariant distribution. The chain is positive recurrent if it has one invariant distribution. If Ü is irreducible, Æ ½ ÐÑ Æ ½Æ ½´ÜÒ µ × ¾ (3.4) Ò ¼ i.e. is the fraction of time Ü spends in state . Ü is aperiodic if ½, where gcd Ò ½ È Ò´ µ ¼ ¾ If ½, Ü is periodic with period . 19 20 CHAPTER 3. QUEUING If Ü is aperiodic and irreducible, with invariant distribution , then for any initial distribution, ÐÑ Ò½ Ò (3.5) See Problems 1, 2. Theorem Suppose Ü is irreducible and Î ¼ ½µ. The drift of Î at is ¡´ µ Î ´ÜÒ µ Î ´ÜÒ µ ÜÒ ·½ Suppose Ë is a ﬁnite subset of and there are constants ¼ ½ so that ¡´Üµ Ü¾Ë ¡´Üµ Ü¾ Then Ü is positive recurrent. See Problem 4. 3.2. CONTINUOUS-TIME MARKOV CHAINS 21 3.2 Continuous-time Markov chains A random variable is exponentially distributed with rate if È´ Øµ Ø Ø ¼ Its mean is ½ ´ µ and it is memoryless, È Ø·× × È´ Øµ ×Ø ¼ A rate matrix É Õ´ µ on a countable set satisﬁes ¼ Õ´ µ ½ Õ´ µ Õ´ µ Õ´ µ ½ ¾ 22 CHAPTER 3. QUEUING xt 0 0 4 2 2 1 1 3 3 t Figure 3.1: Constructing a continuous-time Markov chain Given rate matrix É and distribution ¼ on . Construct Ü ÜØ Ø ¼ thus: 1. Select Ü¼ with È ´Ü¼ µ ¼ ´ µ. 2. If Ü¼ , select exponential with rate Õ ´ µ. Let ÜØ ¼ Ø 3. At Ø Ü takes a jump from to , independently of and according to Õ´ µ ÈÜ Ü ´ µ ¼ Õ´ µ 4. Return to step 3 with Ü , independently of process before . Then Ü is a Markov process with right-continuous sample paths. Figure3.1 shows a sample path. É is regular if ½ Ò ½ × Ò ¼ Note, È ´Ü ¼ ÜØ µ ¼ ´ µÕ´ µØ · Ó´Øµ È ´Ü ¼ ÜØ µ ½ Õ´ µØ · Ó´Øµ 3.2. CONTINUOUS-TIME MARKOV CHAINS 23 xt S1 S2 S3 t t1 t2 t3 Figure 3.2: A trajectory in of Theorem Theorem (Markov property) For any set of trajectories È ´Ü× × Øµ ¾ ÜØ ÜÙ Ù Ø È ´Ü× × ¼µ ¾ Ü ¼ Such is of the form Ü ÜØ ¾Ë ½ ¡¡¡ Ã ¼ Ø ½ ¡¡¡ Ø , Ë ,Ã ½. See Figure 3.2. 24 CHAPTER 3. QUEUING É is irreducible if Õ´ µ ¼ for all ¾ if is irreducible, where ´ Õ´ µ ´ µ Õ´ µ ¼ Theorem Suppose Ü is c-t Markov chain with rate É and initial distribution . Then 1. is invariant, È ´ÜØ µ ´µ Ø ¼ ¾ iff balance equation ´ µÕ ´ µ ¼ (3.6) ¾ 2. Ü has at most one invariant distribution and then Ð Ñ È ´ÜØ Ø ½ µ ´µ ¾ ½ Ì ÐÑ Ì ½Ì ½´Ü× µ × ´µ ¾ ¼ 3. If Ü has no invariant distribution, Ð Ñ È ´ÜØ Ø ½ µ ¼ ¾ ½ Ì ÐÑ Ì ½Ì ½´Ü× µ × ¼ ¾ ¼ 3.2. CONTINUOUS-TIME MARKOV CHAINS 25 Theorem (Time reversal) Suppose Ü is stationary, c-t, Markov with rate É, distribution . The time-reversed process Ü ÜØ ÜÌ Ø ¼ Ø Ì is stationary, Markov, with distribution and rate É where ´ µÕ´ µ Õ´ µ ´µ ¾ Why? È ´Ü ¼ ÜØ µ ´ µÕ´ µØ · Ó´Øµ and È ´Ü ¼ ÜØ µ ´ µÕ ´ µØ · Ó´Øµ È ´ÜÌ ÜÌ Ø µ È ´ÜØ Ü ¼ µ È ´Ü ¼ ÜØ µ ´ µÕ´ µØ · Ó´Øµ 26 CHAPTER 3. QUEUING 0 1 2 3 µ µ µ xt 3 2 1 0 t arrivals departures Figure 3.3: Diagrams for M/M/1 system. Arrivals (blue) and departures(red) form Poisson pro- cesses. 3.3 M/M/1 model See Figure 3.3. The balance equation (3.6) is ´¼µ ´½µ ´Òµ´ · µ ´Ò ½µ · ´Ò · ½µ Ò ½ which has a (unique) solution iff : ´Òµ ´½ µ Ò Ò ¼ with (3.7) 3.3. M/M/1 MODEL 27 The queue ÜØ is time-reversible, because ´ µÕ ´ µ ´ µÕ ´ µ ¼ so the rate matrix of the time-reversed process, ÜÌ Ø , is the same as that of ÜØ . So the departures before time Ø form a Poisson process with rate , independent of Ü . Surprise! Ø The mean queue length is ½ ½ ´ÜØ µ Ò ´Òµ Ò´½ µ Ò Ò ¼ Ò ¼ ½ For ¼ , the mean is 10 packets. Above, av. number of exponential packet arrivals per sec av. number of packets that can be transmitted per sec av. utilization È ´ÜØ ¼µ 28 CHAPTER 3. QUEUING A packet arriving at time Ø sees ÜØ packets in queue with È ÜØ Ò packet arrives in ´Ø Ø · ¯µ È ´packet arrives in ´Ø Ø · ¯µµ ÜØ Ò È ´ÜØ Òµ È ´ packet arrives in ´Ø Ø · ¯µ ¯ ´Òµ ´Òµ ¯ so the average time between departure and arrival (including packet service or transmission time) is ½ ½ ½ Ò·½ ½ Ì ´Ò · ½µ ´Òµ ´½ µ Ò Ò ¼ Ò ¼ Alternatively, Ì ½· ÜØ ½ . Example Consider a 10 Gbps link. Packet lengths are exponentially distributed with mean length 10,000 bits.1 So ½¼½¼ ¢ ½¼ ½¼ packets/s and ½ ½ s per packet. Link utilization is 90 percent, i.e. ¼ . Then the average number of packets in buffer is ´½ µ ½ . The average delay faced by a packet including its own service (transmission) time is 10 s. If the packet goes through 10 nodes the average delay is 100 s (assuming independence of nodes). For a 100 Mbps link, with same packet length distribution, ¼ , ½ ½¼ ¢ ½¼ ½¼¼ s/packet, and the average delay is 1000 s per link. The probability of 100 or more packets in buffer is ´½ µ ½¼ ¢ ¼ ½¼¼ ´Òµ Ò ½¼¼ ÜÜÜ Ò ½¼¼ Ò ½¼¼ ½ Compare queuing delay with propagation delay of ¿ ¼¼¼ ¢ s/km = 15 ms for 3,000 km link. Possible number of bits in the 3,000 km, 10 Gbps link is ½ ¢ ½¼ ¢ ½¼ ½ ¼ ¢ ½¼ . ¿ ½¼ 1 What is a more realistic distribution? 3.3. M/M/1 MODEL 29 Alternative formulation Ø Ø ¼ is a Poisson counting process with rate —the arrival process. Ë ËØ Ø ¼ be a Poisson counting process with rate —the virtual service process. Ë are independent. The queue at Ø is given by Ø ÜØ Ü · ¼ × ½´Ü× ¼µ Ë× ¼ The departure counting process is , Ø Ø ½´Ü× ¼µ Ë× ¼ is also Poisson. Moreover, ¯ Future arrivals, × Ø × Ø , and current state, ÜØ , are independent; ¯ Past departures, Ø × × Ø , and current state, ÜØ , are independent. 30 CHAPTER 3. QUEUING external traffic traffic line i from network Switch external traffic line rate is µi pkt/sec rate is i pkt/sec j i r(j,i) Figure 3.4: Parameters of Jackson network Jackson network See Figure 3.4. Assumptions: ¯ Independent, exponential service times with rate ; ¯ Markovian routing Ö ´ µ; ¯ Poisson external arrivals at rate packets/sec; Aggregate arrivals into node is where · Ö´ µ all (3.8) Let ÜØ ´ÜØ½ ¡ ¡ ¡ ÜÂ µ be queue-length process. This is Markovian. Problem 5 asks to ﬁnd its rate Ø matrix. 3.3. M/M/1 MODEL 31 Theorem Assume , all . Then Ü has an invariant distribution of the product form: ´Ü ½ ¡ ¡ ¡ ÜÂ µ ½ ´Ü µ ¡ ¡ ¡ ½ Â ´Ü Âµ where ´Òµ ´½ µ Ò Ò ¼ with This is a surprising result. The departure from any node in the Jackson network need not be Poisson, unlike the case of a single M/M/1 system. 32 CHAPTER 3. QUEUING µ route to first free server µ 0 1 2 3 m-1 m m+1 µ 2µ 3µ ( µ m−1) mµ mµ Figure 3.5: The M/M/m/½ system 3.4 Other M/M/m/n models M/M/m, the m server case The received request is routed to the ﬁrst of Ñ available servers, Figure 3.5. The buffer is inﬁnite. The balance equations are ´ · Ñ µ ´Òµ ´Ò ½µ · Ñ ´Ò · ½µ Ò Ñ ´ · Ò µ ´Òµ ´Ò ½µ · ´Ò · ½µ ´Ò · ½µ ¼ Ò Ñ ´¼µ ´½µ This gives ´ Ò ´Òµ ´¼µ ÑÒ ´ Ñ µ Ò Ò Ñ (3.9) ´¼µ ÑÑ Ò Ñ It is assumed that ½. ´¼µ is obtained using È ´Òµ ½, Ñ Ñ ½ ´Ñ µÒ ´Ñ µÑ ´¼µ · ½ Ò ¼ Ò Ñ ´½ µ A packet arriving at time Ø sees all servers busy (ÜØ Ñ) with probability È ÜØ Ñ packet arrives in ´Ø Ø · ¯µ È ÜØ Ò packet arrives in ´Ø Ø · ¯µ Ò Ñ È packet arrives in ´Ø Ø · ¯µ ÜØ Ò È ´ÜØ Òµ Ò Ñ È ´packet arrives in ´Ø Ø · ¯µµ ¯ ´Òµ ÑÑ Ò from (3.9) ´Òµ ´¼µ Ò Ò ¯ Ò Ñ Ñ Ò Ñ ´¼µ´Ñ µÑ È ´ÕÙ Ù µ Ñ ´½ µ The expected number of packets waiting in queue (not in service) is ´¼µ´Ñ µÑ Ò Æ ´ÕÙ Ù µ Ò ´Ò · Ñµ Ò È ´ÕÙ Ù µ Ò ¼ Ñ Ò ¼ ½ 3.4. OTHER M/M/M/N MODELS 33 By Little’s law (see below), the average waiting time in queue (not in service) is Æ ´ÕÙ Ù µ Ï and the total latency (waiting time) is ½ Ì ·Ï 34 CHAPTER 3. QUEUING 3.5 Little’s law Suppose ´Øµ is the cumulative arrivals in ¼ Ø into a stable queueing system, Ü´Øµ is number of packets in system (including those in service). Let ´ µ Ë · Ï be latency of packet . Let ´Øµ Ø be arrival rate. Suppose queue is empty at Ø ¼ and Ø Ì . From ﬁgure 3.6, the time average of queue size is ÊÌ È Ìµ È Ìµ ¼ Ü´Øµ Ø ´ ½ ´µ ´Ì µ ´ ½ ´µ Ì Ì Ì ´Ì µ Taking limits as Ì ½, and if time averages equal ensemble averages, we get ´Üµ ¢ ´ µ x(t) A(t) W5 W2 W4 W5 S1 S2 S3 S 4 S5 t Figure 3.6: Calculations for Little’s law 3.6. PASTA 35 3.6 PASTA We have used the PASTA property (Poisson arrivals see time averages) several times. Consider stationary queuing system with deterministic service time of 3 and periodic arrivals (period 10). A sample path with arrivals at 1,2,3,11,12,13,21,22,23,¡ ¡ ¡ and queue process Ü´Øµ is shown in ﬁgure 3.7. Let ´Òµ be the probability that Ü´Øµ Ò at any time Ø, and let Ô´Òµ be the probabililty that an arriving packet sees Ò packets in queue. For this system, ´¼µ ½ ½¼ ´½µ ½¼ ´¾µ ½¼ ´¿µ ½ ½¼ Ô´¼µ Ô´½µ Ô´¾µ ½¿ so the two probabilities are not the same. x(t) 1 2 3 4 5 6 7 10 11 Figure 3.7: PASTA property does not hold in this deterministic queuing system Consider a M/G/1 system, with stationary probabilities ´Òµ. Let Ô´Òµ be the probability that an arrival sees Ò packets in queue. Then, Ô´Òµ È Ü´Øµ Ò packet arrives in ´Ø Ø · ¯µ È ´Ü´Øµ ÒµÈ ´packet arrives in ´Ø Ø · ¯µµ È ´packet arrives in ´Ø Ø · ¯µµ È ´Ü´Øµ Òµ ´Òµ using Bayes’ rule, independence of arrivals after Ø from Ü´×µ × Ø , and independence of service times. 36 CHAPTER 3. QUEUING W W(t) S2 area (2) S1 S3 S5 S4 W2 W3 W5 t Figure 3.8: Deriving Pollaczek-Khinchin formula 3.7 Pollaczek-Khinchin formula Consider M/G/1 system with independent service times Ë , Ë ½ , Ë ¾ ½, Poisson arrivals with rate . Let Ï ´Øµ be the remaining waiting time, i.e. the amount of time needed to serve packets in the system at Ø. Let Ë and Ï be the service time and waiting times of packet , see ﬁgure 3.8. The time average of waiting time Ìµ ½ Ì ½ ´ Ï ´Øµ Ø Ö ´µ Ì ¼ Ì ¼ Ö ´ µ is the parallelogram area for packet , so Ö ´ µ ½ ¾Ë · Ë Ï . Substituting and taking ¾ limits as Ì ½, ½ Ï ´ Ë · ´waiting time faced by arriving packetµ Ë µ ¾ ¾ By PASTA, ´waiting time faced by arrivalµ Ï . So, Ë Ë ¾ ¾ Ï ¾´½ Ë µ ¾´½ µ where is the utilization. Note: The formula ´ Ìµ Ö ´µ ´Ì µ Ö ´µ ¼ involving a random sum of ´Ì µ terms is sometimes called Wald’s formula. A general version of Wald’s formula is a consequence of the fact that ´Øµ Ø Ø ¼ is a martingale. See Problem8. Determinism minimizes waiting In general, Ë ¾ ´ Ëµ ·¾ ¾ , so ´´ Ë µ · µ ¾ ¾ ´ Ëµ ¾ Ï ¾´½ µ ¾´½ µ ¾ ´ µ where the last expression is the waiting time for a deterministic service time (eg. ATM cells). 3.8. PROBLEMS 37 3.8 Problems 1. How does (3.2) follow from (3.1)? 2. Give examples of Markov chains Ü with the following properties: (a) Ü is irreducible and has no invariant distribution; (b) Ü is ﬁnite with more than one invariant distribution; (c) Ü is ﬁnite but not irreducible; (d) Ü is inﬁnite and positive recurrent; (e) Ü is ﬁnite, irreducible and (3.5) does not hold. 3. Show that if is ﬁnite the convergence in (3.5) is geometrically fast, i.e. Ò Ò for some ¼ ½. 4. A packet processor takes 1 s to forward one packet. Packets arrivals are iid. In 1 s, packets arrive with probability ¼. Let ÜÒ be the number of packets at the beginning of the Òth s in the (inﬁnite) buffer. (a) Show that Ü ÜÒ Ò ¼ is a Markov chain. Hint: express the evolution of Ü as a stochastic dynamical system of the form ÜÒ ·½ ´ÜÒ ÛÒ µ where Û ÛÒ Ò ¼ is an independent process. Show that in this case Ü is always Markov, and if Û is iid, Ü has stationary transition probabilities. (b) Show that Ü is irreducible. (c) Write the balance equations. (d) Find conditions on the so that Ü is positive recurrent. How would you ﬁnd the ex- pected forwarding delay faced by a packet? (e) Give an example of the so that Ü is not positive recurrent. What happens to the queue size in this case? 5. Find the rate matrix of the queue length process ÜØ of the Jackson network in Figure 3.4. 6. In the feedback network in Figure 3.9, at each link a packet leaves the system with probability 0.5. For what values of (in terms of ½ ¾ is the system stable? 7. In Figure 1.2 suppose the bridges have a throughput of 1 Gbps, L1,L2,L4 and L5 are 10 Mbps LANS and L3 is a 100 Mbps LAN. Suppose trafﬁc originating in each LAN is 50 percent of LAN capacity. Suppose 90 percent of the trafﬁc originating in LAN Li is destined for a station in the same LAN whereas 10 percent is destined for a station in LAN Lj, selected randomly. (a) Can the network support this trafﬁc? 38 CHAPTER 3. QUEUING µ1 0.5 0.5 0.5 0.5 µ2 Figure 3.9: Network for problem 6 (b) By what factor can the trafﬁc increase, before the network becomes unstable? 8. Let Ü´Òµ Ò ¼, be a Bernoulli sequence with È ´Ü´Øµ ½µ Ô. Let Æ be a random number deﬁned below. For each case explain why or why not Æ Ü´Òµ Ô Æ Ò ¼ (a) Æ ÓÒ×Ø ÒØ a.s. (b) Æ Ö Ñ Ò Ò Ü´Ò · ½µ ½ . (c) Æ Ö Ñ Ò Ò Ü´Òµ ½ . (d) Æ Ö ÑÒ Ò ÈÒ Ü´ µ ½¼¼ . ¼ Chapter 4 Switching 4.1 Packet switching ¯ Architectures ¯ IQ/HOL ¯ VoQ ¯ SQ 39 40 CHAPTER 4. SWITCHING is blocked IQ: hol blocking OQ: faster switch VoQ: matching SQ: reduces buffer size Figure 4.1: Packet switch architectures 4.1.1 Architectures Second generation PRIZMA architecture is ¿¾ ¢ ¿¾, with 2 Gbps ports ... all on one chip  4.1. PACKET SWITCHING 41 λ /N Xt min(1, xt) 1 1 3 xt At xt+1 2 2 1 read HOL arrivals 3 2 1 Average delay in cell times 10 - - - 8 - - - - HOL queue 6 - - 1 1 - ρ - 1 1 4 - At - Xt - - 2 - Input from nonblocked queues - - 0 - 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 ρ Figure 4.2: Virtual HOL queue 4.1.2 Input queues Assume: ¯ discrete time Ø, independent arrivals, uniform destination with prob Æ ¯ Æ large so total number of port 1 arrivals is Poisson Ò È ´Ò port 1 arrivalsµ Ò Virtual HOL queue of Ø port 1 packets at head of queue: Ø·½ ´ Ø ½µ · · Ø Ø · Ø ½´ Ø ¼µ Ø ¼ (4.1) where, Ø number of new port 1 packets that come to head unblocked queues Suppose equilibrium probability of unblocked queue is is ¨. Then Ø is Poisson with mean ¢ ¨. From (4.1) ´ µ ´ µ · ´ µ È´ ¼µ 42 CHAPTER 4. SWITCHING so È ´ ¼µ ´ µ . Square (4.1), and take expectations, ¼ ´ µ· ·¾ ¾ ´ µ ¾ ´ µ ¾ ¾ (4.2) Since ´ µ ¾ · ¾ , ¾ ¾ ´ µ ¬ ¾´½ µ Also ´# blocked queuesµ Æ ´½ ¨µ Æ ´ Ø ½µ · Æ ´¬ µ For ½ this gives ¬ ½, Ô ¾ ¾ ¼ That is 42 % switch bandwidth is not utilized. quick upper bound Same switch. But at end of each cycle IQs are ﬂushed. With ½, switch throughput is ½ Æ È´ ¼µ ½ È´ ¼µ Æ ½ ´½ µ Æ Per port throughput is ½ Æ ½ ´½ Æ µ ½½ ½ ¼ ¿ Æ 4.1. PACKET SWITCHING 43 4.1.3 Virtual output queue ¯ Each input port has Æ VoQ’s, one per output port. ¯ If several input ports have packets for same destination, which one should be served. ¯ Assume iid arrivals ´Øµ with rates £ such that ½ ½ ¡¡¡ Æ ½ ½ ¡¡¡ Æ ¯ Service Ë ´Øµ Ë ´Øµ such that Ë ½ ½ ¡¡¡ Æ Ë ½ ½ ¡¡¡ Æ Note: If = above, Ë ´Øµ is a permutation over ½ ¡¡¡ Æ . ¯ Queue lengths Ä ´Øµ such that Ä´Ø · ½µ Ä´Øµ Ë ´Øµ · ´Øµ · Question: Given £, ﬁnd Ë ´Øµ, based on past arrivals ´×µ × Ø and Ä´Øµ, so that Ä is stable. Conjecture: Always exists stabilizing matching Ë ´Øµ. 44 CHAPTER 4. SWITCHING I G O I M O wij Bipartite graph G Max size matching M all demands 0.5 Max size matchings Figure 4.3: Matching in bipartitie graphs (above) and counterexample (below) Matching: Î be bipartite graph, i.e. Î Á Ç and Á ¢ Ç. Å Î is a matching if no two edges in Å have a common vertex. Edges can be weighted. Max size matching:Å has max number of edges. Best algorithm has running time Ç´Æ µ. ¾ Max weight matching: È Û ´ µ ¾ Å is max. Best algorithm Ç´Æ ÐÓ Æ µ. ¿ Conjecture: Maximum size matching is stabilizing. Counterexample Take Æ ¿. ½½ ½¾ ¼ ¾½ ¿¾ ¼ . Suppose Ä½½ ´Øµ ¼ Ä½¾ ´Øµ ¼. Suppose ¾½ ´Øµ ¿¾ ´Øµ ½ (which happens with prob 0.25.) Then there are 3 max size matchings and input 1 will be selected with prob 2/3. So, È ÖÓ ´Ë ´Øµ · Ë ´Øµ ½½ ½¾ ½µ ¼¾ ¢¾ ¿·¼ ¢½ ½ ·¿ ½½ ½¾ ½ i.e. Ä ½´Øµ · Ä ¾´Øµ ½ ½ ½. Suppose all rates are ¼ Æ . Still get instability with max size matching for Æ ¼ small. 4.1. PACKET SWITCHING 45 Try providing more service if queue lengths are large, i.e. choose Ë ´Øµ Ö ÑÜ Ä ´ØµË Ë is a permutation Intuition. In continuous time approximation, Ä´Øµ ´Øµ Ë ´Øµ neglecting negatives Ø So, (below Ä Ë are vectors or matrices as appropriate) Ä´Øµ ¾ ¾Ä´ØµÌ ´ ´Øµ Ë ´Øµµ (4.3) Ø Ä´Øµ ¾ Ä´Øµ ¾Ä´ØµÌ ´£ Ë ´Øµµ (4.4) Ø So choose Ë £ ´Øµ Ö Ñ Ü Ä´ØµÌ Ë Ë is a permutation Recall Theorem The assignment problem ÑÜ Ì subject to È ½ È ½ ¼ has an optimum solution that is a permutation. Suppose in (4.4), È È ´½ Æµ. Then from Theorem, Ä´ØµÌ ´£ Ë £ ´Øµµ Æ Ä´Øµ so with policy Ë£ ´Øµ, Ø Ä´Øµ ¾ Ä´Øµ Æ Ä´Øµ from which stability follows by . 46 CHAPTER 4. SWITCHING Shared memory queue 1 path by packet from queue N input 1 to output N input 1 output 1 input N output N shared bus Figure 4.4: Switch with time-division shared bus and centralized shared memory 4.2 Shared queue This architecture is used in most low speed packet processors: a time-division bus with a centralized memory shared by all input and output lines, ﬁgure 4.4. Up to Æ packets may arrive at one time and up to Æ may be read at one time, so memory bandwidth must be ¾Æ -times line rate. Assume 100 ns DRAM access time, 53B-wide bus, gives total bandwidth of ¿¼ ¢ ¾ ¼ Mbps. For a 16-port ATM switch, this gives line rate of ¾ ¼ ¿¾ ½¿¾ Mbps. Let Ø size of 1-list at beginning of slot Ø Ø # of 1-packets arriving in slot Ø ¼ Ø Æ Ø·½ ´ Ø ½µ · Ø Ø · Ø ½´ Ø ¼µ (4.5) Following same argument that led to (4.2) gives · ¾ ¾ ¾´½ µ where È´ Ø ¼µ and ¾ ¾ · ¾ . For the Poisson case, ¾ , so ¾ ¾ ¾´½ µ (4.6) Shared vs separate queue Suppose shared buffer is sized at ´ µ· ´ µ where ´ µ is the standard deviation of ´ µ. Then separate buffer size ´ µ· ´µ shared buffer size ´ µ· ´µ ¾ ½ ¾ 4.3. OUTPUT QUEUE 47 4.3 Output queue In an output queued switch the switch fabric must run Æ -times, and the ouptut memory must run Æ · ½-times line rate. The queue length in port 1 is given by (4.5). 48 CHAPTER 4. SWITCHING 4.4 Problems 1. Assume that Ø is Poisson in (4.1) or (4.5). The mean queue size is given by (4.6). Is (a) Ø Ø ¼ Markov? Why? (b) If Ø is stationary, how would you ﬁnd ´Òµ Ô´ Ø Òµ? Chapter 5 Matching Crossbar switches need a controller to schedule a switch. The controller must ﬁnd a good match, eg. longest queue ﬁrst, oldest cell ﬁrst, etc. It is too expensive to run a centralized matching algorithm with complexity Ç´Æ µ or Ç´Æ µ. ¾ ¿ (A 40-byte packet at a line speed of 1 Gbps amounts to 360 ns/packet.) So one may have to be satisﬁed with maximal matching, using distributed algorithms. Note that for a fully-connected bipartite graph, a maximal matching is also maximum. In case of QoS, the matching must satisfy some preferences. 49 50 CHAPTER 5. MATCHING Man # Preference list Woman # Preference list 1 1 2 3 4 1 1 3 4 2 2 2 1 4 3 2 3 4 1 2 3 3 2 4 1 3 2 1 4 3 4 3 4 2 1 4 1 2 3 4 5.1 The dating game Consider a dating game with Æ men and Æ women and following preferences. SMP algorithm by Gale and Shapley ﬁnds a “stable” match, eg. ´½ ½µ ´¾ µ ´¿ ¾µ ´ ¿µ The algorithm is ¯ iterative—proceeds in a sequence of proposals and (tentative) accepts ¯ upon termination—returns a matching ´ Ô´ µ ¯ guarantees stability. A matching is unstable if it contains pairs ´ Ô´ µµ ´ Ô´ µµ such that prefers Ô´ µ to Ô´ µ and Ô´ µ prefers to ´ Ô´ µµ is a blocking pair. A stable matching has no blocking pair. 5.1. THE DATING GAME 51 The GSA algorithm. Say that a man or woman is ¯ free—if she/he is not engaged or matched to any man/woman ¯ engaged—if she/he is temporarily matched to some man/woman ¯ matched—if she/he is terminally matched 52 CHAPTER 5. MATCHING BEGIN all are free Is some man No END m free? Yes m proposes to w, the first woman he has not yet proposed to yes is w engaged to m w free? w is currently engaged to m' no does w m continues free prefer m to m'? match w and m, set m' free Figure 5.1: The GS algorithm 5.1. THE DATING GAME 53 Algorithm will terminate. No man can be rejected by all women. Because a woman can reject a man only if she is engaged. Once she is engaged, she stays engaged. So if every woman rejects Ñ, they are all engaged. Alternatively: in each iteration, a man makes worse choices and a woman makes better choices. GSA ﬁnds a stable matching. Suppose ´ Ô´ µµ ´ Ô´ µµ are matched but prefers Ô´ µ to Ô´ µ and Ô´ µ prefers to . Then, must have proposed to Ô´ µ before proposing to Ô´ µ; Ô´ µ must have rejected in favor of, say, prefered by her to . But women make better and better choices, so Ô´ µ’s ﬁnal match must be better than , which is better than , hence better than . Number of iterations is bounded by Æ¾ : there are Æ men and each makes at most Æ proposals. There may be more than one stable matching. Suppose Ñ½ prefers Û½ to Û¾ , Ñ¾ prefers Û¾ to Û½ ; Û½ prefers Ñ¾ to Ñ½ , Û¾ prefers Ñ½ to Ñ¾ . Then ´Ñ½ Û µ ´Ñ Û µ and ´Ñ Û µ ´Ñ Û µ are both stable matches. ½ ¾ ¾ ¾ ½ ½ ¾ 54 CHAPTER 5. MATCHING 1 1 4 a1 2 1 1 4 g2 2 1 1 3 2 2 3 2 2 1 3 3 1 3 3 4 a3 2 4 4 4 g4 2 4 4 3 3 Figure 5.2: ¢ RRM showing ½ ¿ ¾ pointers with Ä´½ ½µ Ä´½ ¾µ Ä´¿ ¾µ Ä´¿ µ ¼. 5.2 Round-robin matching Each input maintains accept pointer . Each output maintains grant pointer . RRM cycle. Step 1 Each requests all with Ä´ µ ¼. Step 2 Each grants next requesting input at or after current pointer value , i.e. ÑÒ Ä´ µ ¼ then increments · ½. Step 3 Each accepts next granted output at or after current pointer value , i.e. ÑÒ If grant has been accepted, increments ·½. Figure 5.2 illustrates one RRM cycle. Initially, all ½, and all ½. The input requests are ½ ½½ ¾¿ ¾¿ So we have the following steps: ½ ¼ ¾ ½ ¼ ¾ ¼ ¼ ½ ½ ½ ½ ½ ¾ ½ ¼ ¼ ¾ ¾ ¾ ¾ ½ ½ ¼ ¼ ¿ ¿ ¿ ¿ ¿ ½ At the end of this cycle, the match is ´½ ½µ ´¿ µ , and the pointer values are given above. 5.2. ROUND-ROBIN MATCHING 55 5.2.1 Analysis of RRM Under heavy load, the grant counters may get synchronized, reducing utitlization. Consider Æ ¾ Ä´ µ ¼ all . Then it is possible for ½ ¾ always as follows. ½ ½ ½ ¼ ¾ ½ ¼ ¾ ½ ½ ½ ¼ ½ ½ ¼ ½ Match ´½ ½µ ¾ ½ ¾ ½ ¾ ½ ¾ ¾ ¾ ¾ ½ ¾ ¾ ¾ ¼ ½ ¼ ¾ ½ ½ ½ ¾ ½ ¾ ¼ ½ ½ ½ ½ ¼ ½ ¾ Match ´¾ ½µ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ½ ½ ¼ ¾ ¾ ¼ ½ ½ ¾ ½ ½ ½ ½ ¼ ½ ¾ ½ ¼ ½ ¾ Match ´½ ¾µ ¾ ¾ ¾ ¾ ¾ ¾ ½ ¾ ¾ ¼ ½ ¼ ½ ½ ¾ ½ ¾ ½ ¾ ¼ ½ ½ ½ ¾ ¼ ½ ½ Match ´¾ ¾µ ¾ ¾ ¾ ¾ ¾ ¾ At the end of the fourth cycle the situation repeats. Throughput is 50 percent. Of course the following TDM cycle is also possible, and has througput of 100 percent. ½ ½ ½ ¼ ¾ ½ ¼ ¾ ½ ¾ ½ ¾ ½ ¾ ¼ ½ ½ ½ ¾ ¼ ½ ½ Match ´½ ½µ ´¾ ¾µ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¼ ½ ¾ ¼ ½ ½ ½ ½ ½ ½ ½ ¼ ½ ¾ ½ ½ ¼ ½ ¾ Match ´½ ¾µ ´¾ ½µ ¾ ¾ ¾ ¾ ¾ ¾ Under heavy load, if grant counters get syncronized at any time (i.e. have the same value), they’ll stay synchronized forever. Under light load, the grant counters will be randomly distributed. The probability that some input Æ ½ Æ is not served is È ´ µ ½ ½ ¼ ¿ Æ 56 CHAPTER 5. MATCHING (1,1) = 1 = µ(1,1) /4 (1,2) = 1 1 1 = µ(1,2)3/4 (2,1) = 2 1 2 = µ(2,1)3/4 Figure 5.3: PIM can be unfair under heavy load 5.3 Partial iterative matching, PIM Step 1 Each unmatched input sends requests to every output such that Ä´ µ ¼. Step 2 Each randomly picks from received requests. Step 3 Each randomly accepts one of received grants. The I in PIM means that this cycle is repeated to improve match. 5.3.1 Analysis of PIM It appears that with uniform iid trafﬁc, PIM achieves maximal match in 3 iterations. In heavy load, every input makes requests. Probability that receives no grant in one round equals Æ ½ Æ È ´ µ ½ ½ ¼ ¿ Æ PIM can be unfair. Figure 5.3 gives a ¾ ¢ ¾ case where the request rates from input to output is ´ µ. So requests ½ ½ ½ ¾ ¾ ½ are made in each slot. The grant rates from output to input will therefore be ´½ ½µ ´½ ¾µ ¼ ´¾ ½µ ½. So input will accept output 1 with probability ´½ ½µ ¼ ¾ , and output 2 with probability ´½ ¾µ ¼ ; input 2 will accept output 1 with probability ´¾ ½µ ¼ . Thus even though arrival rates for output port 1 are equal at inputs ports 1 and 2, the acceptance rates are not the same. 5.4. ISLIP MATCHING 57 5.4 iSLIP matching The detailed reference is . The RRM suffers from synchronization of the grant counters. The iSLIP modiﬁes RRM slightly so that the grant counters are incremented only if the grant is accepted. So step 2 of RRM is modiﬁed. Step 2 Each grants next requesting input at or after current pointer value , i.e. ÑÒ Ä´ µ ¼ then increments · ½ only if accepts output . 58 CHAPTER 5. MATCHING 5.4.1 Analysis of iSLIP Consider the situation Ä´ µ ¼ all . In contrast with RRM, inputs 1 and 2 share outputs in TDM fashion. ½ ½ ½ ½ ¼ ¾ ¼ ¾ ½ ½ ½ ½ ½ ½ ½ ¼ ½ ½ ¼ ½ ½ Match ´½ ½µ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¼ ½ ¼ ½ ½ ½ ½ ½ ½ ½ ½ ½ ¼ ½ ¾ ¼ ½ ¾ Match ´½ ¾µ ´¾ ½µ ¾ ¾ ¾ ¾ ¾ ¾ ½ ½ ½ ½ ¼ ¾ ¼ ¾ ½ ¾ ½ ¾ ½ ¾ ½ ¾ ¼ ½ ½ ¼ ½ ½ Match ´½ ½µ ´¾ ¾µ ¾ ¾ ¾ ¾ ¾ ¾ 5.4. ISLIP MATCHING 59 5.4.2 Priority iSLIP Suppose there are È priority levels. Then each input maintains È ¢ Æ VoQs, with ÄÔ ´ µ the buffer occupancy of priority Ô and output . Then gives strict priority, i.e. serves Ä ´ µ only Ô if ÄÕ ´ µ ¼, Õ Ô. Each input maintains counter Ô and each output maintains Ô for each priority level. Step 1 Each selects highest priority level È ´ µ with non-empty queue to output . Step 2 Output determines highest priority level È ´ µ Ñ Ü È ´ µ. The output then chooses one input among those inputs that have requested at level È ´ µ. The output maintains separate pointer Ô ´ µ, and chooses input Ô among requests at level È ´ µ in the same round-robin scheme. The output notiﬁes each input whether or not its request is granted. The pointer Ô ´ µ Ô · ½ is incremented only if granted input Ô accepts output . Step 3 If input receives any grants, it determines the highest priority level grant, say Ô. The input then chooses one grant among the requests granted at this level. This is done according to the counter Ô , which is incremented Ô Ô · ½. The input then notiﬁes each output whether or not its grant was accepted. 60 CHAPTER 5. MATCHING 5.4.3 Threshold iSLIP It may be better to select a weighted maximal match with weights corresponding to queue length. If queue lengths are quantized in threshold levels Ø½ Ø¾ ¡ ¡ ¡ ØÈ , then priorities may be assigned accordingly as ØÔ Ä´ µ ØÔ·½ . 5.4.4 Weighted iSLIP È ´ µ ½, È ´ µ ½. Suppose bandwidth from to is to be shared according to the ratio ´ µ Ò´ µ ´ µ subject to In iSLIP each counter is an ordered circular list Ë ½ ¡ ¡ ¡ Æ . Now expand the list at output to Ë ´ µ ½ ¡ ¡ ¡ Ï ´ µ where Ï ´ µ is the lcd of ´ µ and input appears Ï ´ µ¢Ò´ µ ´ µ times in the list. 5.4. ISLIP MATCHING 61 state of input queues (N2 bits) 1 1 2 2 N N Grant Accept Decision arbiters arbiters register Figure 5.4: Interconnection of the input and output arbiters to construct the iSLIP scheduler 5.4.5 Implementation Figure 5.4 shows how the iSLIP scheduler for a Æ ¢Æ switch is constructed from the input and output arbiters. ¯ The state memory records whether an input queue is empty. From this memory, an Æ¾ - bit wide vector presents Æ bits to each of the Æ output grant arbiters representing Step 1 (request). ¯ The grant arbiters select a single input among the contending requests to implement Step 2 (grant). ¯ The grant decisions are presented to the Æ accept arbiters, each of which selects at most one output on behalf of each input to implement Step 3 (accept). ¯ The ﬁnal decision is stored in the decision registers and the value of the and pointers are updated. The decision register is used to notify each input which cell to transmit and to conﬁgure the crossbar switch. 62 CHAPTER 5. MATCHING Chapter 6 Network processors Figure 6.1 is a logical diagram of how a network processor (NP) ﬁts in a system design. The NP is located between the physical layer (MAC or framer) and the switch fabric. In the ﬁgure the Serializer/Deserializer (SERDES) is the interface between the NPU and switch fabric. The framer or MAC presents a packet to the NPU which must examine it, parse it, do necessary edits and database lookups to enforce various policies at layers 3-7 (forwarding, queuing, labels), and exchange messages with switch controller. The NP is in the data path. 63 64 CHAPTER 6. NETWORK PROCESSORS Figure 6.1: Location of NP in a logical diagram. Source . 6.0.6 NP operation Figure 6.2 shows a generic block diagram. Data of multiple physical interfaces or the switch fabric are transferred to/from the NP. The bitstream processors receive the serial stream of packet data and extract the information needed to process the data, such as MAC or IP source/destination address, TOS bits, TCP port numbers, MPLS or VLAN tags. The packet is then written into the packet buffer memory. This information is fed to the processor complex—the programmable unit of the NP. Under program control, the processor may extract additional information from the packet and submits relevant information to the search engine which looks up the MAC or IP address, classiﬁes the packet, or does a VCI/VPI lookup using the routing/bridging tables. Upon packet transmission through the bitstream processor, the necessary modiﬁcations to the packet header are performed. 65 packet buffer general buffer manager/ purpose scheduler CPU memory processor complex routing and bitsteream bridging search engine HW assists processors tables To/from PHY/ switch fabric Figure 6.2: Generic NP architecture. Source . Figure 6.3: Time to process 40B packets at different line rates. Source  6.0.7 Speed of operations Table 6.3 shows the time available to process back to back 40B packets at different line speeds. At 1 Gbps, the time to process one packet is 360 ns. Using 10-ns SRAM permits a maximum of 36 memory accesses. Thus faster line rates can be accommodated only by processing several packets simultaneously in a pipelined or parallel fashion. 66 CHAPTER 6. NETWORK PROCESSORS 6.0.8 Packet buffer memory For the architecture of ﬁgure 6.2, each packet header byte may traverse the memory interface at least four times: ¯ write inbound packet ¯ read header into processor complex ¯ write back to memory ¯ read for outbound transmission So for 40 byte back-to-back packets the required memory interface capacity is 10-120 Gbps for line rates of 2.5-40 Gbps. Chapter 7 Distributed Switch The single switch fabric architectures cannot scale beyond 32 ports. Hence the need for distributed architectures. We’ll study blocking and routing properties. 7.1 Blocking A switch network is a graph of switches, each with a set of input and output ports as in Figure7.1. There is a set of Æ input nodes and a set of Å output nodes. Each internal link has a capacity of 1. A conﬁguration is a set of input-output pairs ´ ½ ½ Öµ ½ ¡¡¡´ Öµ with distinct inputs and outputs and disjoint routes connecting ½ to ½ , ¡ ¡ ¡, to . A DS is strictly non-blocking if given a conﬁguration and a pair ´ µ not in , there exists a disjoint route from to . It is rearrangeably non-blocking if given any partial permutation of input-output pairs, there is a conﬁguration that includes those pairs. We ﬁrst study modular architectures. 67 68 CHAPTER 7. DISTRIBUTED SWITCH 1 1 cap = 1 M N Figure 7.1: A distributed switch is a network of switches with certain number of input and output ports, Æ input nodes and Å output nodes 7.2 Clos network This is a 3-stage network as illustrated in Figure 7.2. The Clos network is speciﬁed by 5 numbers IN Æ½ Æ¾ Æ¿ OUT. There are Æ½ ¢ Æ¾ ¢ Æ¿ switches arranged in 3 stages. The number of input-output ports and connectivity of the switches are as shown. Theorem A Clos network with RNB switch modules is RNB iff Æ ¾ ÑÜ IN OUT A Clos network with SNB switch modules is SNB iff Æ ¾ IN · OUT ½ The total number of input lines is IN ¢ Æ½ . The total number of output lines is OUT ¢ Æ¿ . The Clos network in the ﬁgure is SNB. It has 9 input lines and 8 output lines. 7.3. RECURSIVE CONSTRUCTION 69 Clos (3, 3, 5, 4, 2) N2 = 5 OUT = 2 IN = 3 A B N1 = 3 N3 = 4 Figure 7.2: A Clos network is fully speciﬁed by ´IN Æ Æ Æ ½ ¾ ¿ OUT µ 7.3 Recursive construction We can recursively construct an Æ ¢ Æ SNB with Æ Ô ¢ Õ input and output lines as in Figure 7.3. The result is a ´Ô Õ ¾Ô ½ Õ Ôµ switch. It is SNB if each module is SNB. 70 CHAPTER 7. DISTRIBUTED SWITCH 2p - 1 planes q planes q planes N=pxq N=pxq (2p -1) x q p x (2p -1) qxq 1 1 1 p p q q 2p-1 Figure 7.3: Recursive construction of a SNB Clos network 7.3. RECURSIVE CONSTRUCTION 71 pxp pxp q planes q planes N=pxq N=pxq qxq 1 1 1 p p q q p Figure 7.4: Recursive construction of a RNB CLos network Figure 7.4 is a Æ ¢ Æ RNB switch if each module is RNB. 72 CHAPTER 7. DISTRIBUTED SWITCH N/2 ✕ N/2 2✕2 N N N/2 ✕ N/2 2 log2 N – 1 stages of N/2 2 ✕ 2 switches Figure 7.5: The Benes switch Figure 7.5 is a Æ ¢ Æ RNB switch made up of ¾ ¢ ¾ switch modules. 7.3. RECURSIVE CONSTRUCTION 73 1 1 2 2 3 3 4 4 1-->1 as shown; 4-->4 as shown; cannot accommodate 2-->3 Figure 7.6: Benes swtich is not SNB Figure 7.6 shows that a Benes switch is not SNB. 74 CHAPTER 7. DISTRIBUTED SWITCH Figure 7.7 illustrates an algorithm to rearrange existing connections in order to accommodate a new connection. Question 1: Can you supply a proof? Question 2: Is there an alogrithm to accommodate new connections in an arbitrary network of Figure 7.1? 7.3. RECURSIVE CONSTRUCTION 75 1 2 3 4 5 Figure 7.7: Algorithm to add a new connection for a RNB switch 76 CHAPTER 7. DISTRIBUTED SWITCH In a Benes switch, feasible ﬂows may require multiple paths. Figure 7.8 and 7.9 show this. Note: ¾ ¿ ¾ ¿ ¾ ¿ ¼ ½ ¼ ½ ½ ¼ ¼ ½ ´½ ¾ µ ½ ½ · ½ ½ ¾ ¼ ½ ¼ ½ ¼ ¼ ½ ½ ¾ ¿ ½ ½ · ½ ½ 7.3. RECURSIVE CONSTRUCTION 77 1 2 3 4 1 e 1 e 1-e 2 e 1-e 3 1-2e e e 1-e 4 1 1 1-e 4 1-2e 1 e e 1-e 3 1 2e e 1 1-e 4 1-2e e 2 1-e e e 1-e 1-2e 1 2e e 1 1-e 1-e e Figure 7.8: Split ﬂow 1 78 CHAPTER 7. DISTRIBUTED SWITCH e 1-e 1 2 3 4 1 1 e 1-e 2 1 2 e 1-e 1-2e 3 1-2e e e e 4 1 4 e 1-e 1-e 1 1 1-2e 3 e e 1 e 1-e 2e 1-e 1 e e 1-e 1 1 1-2e 1-e e e 1-e e 1 1-2e e 1-e 2e 1-e Figure 7.9: Split ﬂow 2 7.3. RECURSIVE CONSTRUCTION 79 2 3 1 1 1 1 1 3 2 Figure 7.10: Max ﬂow for single commodity is 3 and ﬂows are integers; in multi-commodity case, max ﬂows are 0.5 and non-integer In a Clos switch, permutations can be achieved without splittling ﬂows. In a general multi-commodity case this is not so. Figure 7.10 shows that if this is a single commodity problem, the maximum ﬂow is 3 and all ﬂows are 1 (integer). However, if the ﬂows are ½ ¾ ¾¿ ¿, the max ﬂows are 0.5 each, and not integer. 80 CHAPTER 7. DISTRIBUTED SWITCH 2 3 1 1 1 1 1 3 2 0.5 2 3 0.5 1 1 1 1 1 3 2 Figure 7.11: Two copies of ﬁgure 7.10 are connected in parallel. Achieving ﬂows of 1,1,1 requires splittling Figure 7.11 shows that a feasible permutation may require splittling ﬂows. The green and cyan ﬂows must be connected in parallel similarly to the red ﬂow. Bibliography  J. Walrand and P. Varaiya. Chapter 12, Switching. High performance communication networks 2nd edition, 2000.  M.J. Karol, M. Hluchyj and S. Morgan. Input vs output queueing on a space-division packet swtich. IEEE Trans Comm, COM-35(12): 1347-56, Dec. 1987.  T.E. Anderson, S. Owicki, J. Saxe and C.P. Thacker. High-speed scheduling for local area networks. ACM Trans Computer Systems, 11(4):319-52, Nov. 1993.  N. McKeown. iSLIP: a scheduling algorithm for input-queued switches. IEEE Trans Network- ing, 7(2), April 1999.  N. McKeown, V. Anatharam and J. Walrand. Achieving 100% througput in an input-queued switch. Proc. Infocom ’96, vol 1: 296-302.  B. Prabhakar and N. McKeown. On the speedup required for combined input and output queued switching. Automatica, 35(12), Dec. 1999  J.F. Hayes, R. Breault and M.K. Mehmet-Ali. Performance analysis of a multicast switch. IEEE Trans Comm, COM-39(4): 581-87, April. 1991.  B. Prabhakar, N. McKeown and R. Ahuja. Multicast scheduling for input-queued switches. J. Selected Areas in Comm 15(5):855-66, June 1997.  M. Waldvogel, G. Varghese, J. Turner and B. Plattner. Scalable high speed IP routing lookups. ACM Sigcomm ’97 September 1997.  A. Demers, S. Keshav and S. Shenker. Analysis and simulation of a fair queueing algorithm. ACM Sigcomm ’89 Computer Communication Review, 19(4): 1-12, 1989.  A. Parekh and R. Gallager. A generalized processor sharing approach to ﬂow control of inte- grated services networks: the single node case. IEEE Trans Networking, 1(3): 344-57, June 1993.  A. Parekh and R. Gallager. A generalized processor sharing approach to ﬂow control of inte- grated services networks: the multiple node case. IEEE Trans Networking, 2(2): 137-50, April 1994. 81 82 BIBLIOGRAPHY  S. Floyd and V. Jacobsen. Random early detection. IEEE Trans Networking, 1(4): 397-413, August 1993.  I. Stoica, S. Shenker and H. Zhang. Core-stateless fair queuing: achieving approximately fair bandwidth allocations in high speed networks. ACM Sigcomm ’98, 1998.  W. Bux, W.E. Denzel, T. Engbersen, et al. Technologies and building blocks for fast packet forwarding. IEEE Communications Magazine, 39(1): 70-77, January 2001.  P.R. Kumar and S. Meyn. Stability of queuing networks and scheduling policies. IEEE Trans. Automatic Control, 40(2), February 1995.  A. Deb. Building a network-processor based system. Integrated Communications Design, December 2000. Available at www.icdmag.com.