Today s Paper Networks and routers Routers forward packets

Document Sample
Today s Paper Networks and routers Routers forward packets Powered By Docstoc
					                                                                                   Today’s Paper
                         EECS 262a                                              • RouteBricks: Exploiting Parallelism To Scale Software Routers
             Advanced Topics in Computer Systems                                  Mihai Dobrescu and Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin
                                                                                  Fall Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia Ratnasamy.
                         Lecture 18                                               Appears in Proceedings of the 22nd ACM Symposium on Operating Systems
                                                                                  Principles (SOSP), October 2009


                   Software Routers/RouteBricks                                 • Thoughts?
                        October 29th, 2012
                                                                                • Paper divided into two pieces:
               John Kubiatowicz and Anthony D. Joseph                                – Single-Server Router
                Electrical Engineering and Computer Sciences                         – Cluster-Based Routing
                       University of California, Berkeley
                      Slides Courtesy: Sylvia Ratnasamy

                   http://www.eecs.berkeley.edu/~kubitron/cs262


                                                                            10/29/2012                            cs262a-S12 Lecture-18                                2




      Networks and routers                                                         Routers forward packets

                                                                                         UCB
                                                                                                                                                              to MIT
              UCB                                                                                    Router 4
                                                                                                                                           Router 2
                                                                  MIT          payload    header
                                             AT&T                                                                                    Route Table
                                                                                   111010010   MIT                                   Destination   Next Hop
                                                                                                                                     Address       Router
                                                                                                                                     UCB           4

                                                                                                                Router 1             HP            5
                                                                                                                                     MIT           2
                                                                                                                                     NYU           3

                                                                                                     Router 5
                                                                  NYU
              HP
                                                                                  to HP                                                                to NYU
                                                                                                                                      Router 3

10/29/2012                       cs262a-S12 Lecture-18                  3   10/29/2012                            cs262a-S12 Lecture-18                                4
       Router definitions                                                                    Networks and routers

                                 1
                    N
                                                 2                                                                                                edge (enterprise)
                                                                                                    UCB
                  N-1                                3   R bits per second (bps)                                             AT&T                         MIT
                                                                                             home,
                                                                                          small business                        core
                        …                    4
                                 5
                                                                                                                                         edge (ISP)

         • N = number of external router `ports’
         • R = line rate of a port                                                                                                                         NYU
                                                                                                    HP                                   core

         • Router capacity = N x R
10/29/2012                  cs262a-S12 Lecture-18                                  5   10/29/2012                cs262a-S12 Lecture-18                                6




       Examples of routers (core)                                                             Examples of routers (edge)


      Juniper T640                                                                           Cisco ASR 1006
        • R= 2.5/10 Gbps                                                                        • R=1/10 Gbps
        • NR = 320 Gbps                                                                         • NR = 40 Gbps



      Cisco CRS-1
         • R=10/40 Gbps                                                                      Juniper M120
         • NR = 46 Tbps                                                                        • R= 2.5/10 Gbps
                                                                                               • NR = 120 Gbps

                                                          72 racks, 1MW
10/29/2012                  cs262a-S12 Lecture-18                                  7   10/29/2012                cs262a-S12 Lecture-18                                8
       Examples of routers (small business)                            Building routers


                                                                     • edge, core
                                                                             – ASICs
                                                                             – network processors
       Cisco 3945E                                                           – commodity servers  RouteBricks
          • R = 10/100/1000 Mbps
          • NR < 10 Gbps
                                                                     • home, small business
                                                                             – ASICs
                                                                             – network, embedded processors
                                                                             – commodity PCs, servers




10/29/2012                        cs262a-S12 Lecture-18    9    10/29/2012                        cs262a-S12 Lecture-18   10




       Why programmable routers                                        Challenge: performance
    • New ISP services
         – intrusion detection, application acceleration
                                                                   • deployed edge/core routers
    • Simpler network monitoring                                        – port speed (R): 1/10/40 Gbps
         – measure link latency, track down traffic                     – capacity (NxR): 40Gbps to 40Tbps
    • New protocols
         – IP traceback, Trajectory Sampling, …
                                                                   • PC-based software routers
                                                                        – capacity (NxR), 2007: 1-2 Gbps [Click]
                                                                        – capacity (NxR), 2009: 4 Gbps [Vyatta]
               Enable flexible, extensible networks
                                                                   • subsequent challenges: power, form-factor,
                                                                     …

10/29/2012                        cs262a-S12 Lecture-18    11   10/29/2012                        cs262a-S12 Lecture-18   12
       A single-server router                                                                     Packet processing in a server
                     sockets with
                        cores
                                                                                                            cores         cores                   Per packet,
                                       cores      cores
                                                                                                mem                                  mem
                                                                                                                                                  1. core polls input port
                                 mem                       mem                                                                                    2. NIC writes packet to
                                           server                                                                                                    memory
                                                                                                                  I/O hub
         memory                              I/O hub                                                                                              3. core reads packet
                                                                     point-to-point
       controllers                                                  links (e.g., QPI)                                                             4. core processes packet
       (integrated)
                                                                                                                                                     (address lookup,
                                                                                                                                                     checksum, etc.)
                                                                 Network Interface                                                                5. core writes packet to port
                        ports                                    Cards (NICs)
                                          N router links




10/29/2012                             cs262a-S12 Lecture-18                            13   10/29/2012                                   cs262a-S12 Lecture-18                          14




     Packet processing in a server                                                                   Lesson#1: multi-core alone isn’t enough

                                                                 8x 2.8GHz
             cores       cores

   mem                              mem
                                               Assuming 10Gbps with all 64B packets                  `Older’ (2008)                                        Current (2009)
                                               19.5 million packets per second
                                                                                                                                                                  cores    cores
                                                one packet every 0.05 µsecs                              cores   cores
                 I/O hub
                                               ~1000 cycles to process a packet                                                  Shared front-        mem                         mem
                                                                                                                                    side bus

                                                       Today, 200Gbps memory                               `chipset’        mem
                                                                                                                            mem
                                                                                                                                                                      I/O hub



                                    Today, 144Gbps I/O                                                                              Memory
                                                                                                                                  controller in
                                                                                                                                   `chipset’

                       Teaser: of CPU
        Suggests efficient use10Gbps? cycles is key!
                                                                                                      Hardware need: avoid shared-bus servers
10/29/2012                             cs262a-S12 Lecture-18                            15   10/29/2012                                   cs262a-S12 Lecture-18                          16
     Lesson#2: on cores and ports                                                             Lesson#2: on cores and ports

                                                                                                                     Problem: locking




                                poll             transmit


             input                                              output
                                       cores
             ports                                              ports

                                                                                                     Hence, rule: one core per port
 How do we assign cores to input and output ports?
10/29/2012                         cs262a-S12 Lecture-18                         17   10/29/2012                       cs262a-S12 Lecture-18                       18




     Lesson#2: on cores and ports                                                          Lesson#2: on cores and ports

                                                                                        • two rules:
        Problem: cache misses, inter-core communication                                      – one core per port
                                                                                             – one core per packet
                                                                                        • problem: often, can’t simultaneously satisfy both
                                                                                                 Example: when #cores > #ports


     L3 cache        L3 cache                        L3 cache    L3 cache

                                                                                                   one core per packet                         one core per port
  packet (may be) transferred
 packet transferred between cores                    packet stays at one core
         across caches                              packet always in one cache

                                        parallel
             pipelined rule: one core per packet                                        • solution: use multi-Q NICs
              Hence,
10/29/2012                         cs262a-S12 Lecture-18                         19   10/29/2012                       cs262a-S12 Lecture-18                       20
       Multi-Q NICs                                                                 Multi-Q NICs
                                                                              • feature on modern NICs (for virtualization)
     • feature on modern NICs (for virtualization)
                                                                              • repurposed for routing
             – port associated with multiple queues on NIC
                                                                                    – rule: one core per port
             – NIC demuxes (muxes) incoming (outgoing) traffic                                                  queue
                                                                                    – rule: one core per packet
             – demux based on hashing packet fields
               (e.g., source+destination address)




                                                                              • if #queues per port == #cores, can always
                                                                                enforce both rules
Multi-Q NIC: incoming traffic           Multi-Q NIC: outgoing traffic
10/29/2012                   cs262a-S12 Lecture-18                      21   10/29/2012                           cs262a-S12 Lecture-18                         22




                                                                                  Lesson#3: book-keeping
       Lesson#2: on cores and ports
 recap:                                                                                   cores    cores

 • use multi-Q NICs                                                             mem                             mem       1.   core polls input port
                                                                                                                          2.   NIC writes packet to memory
      – with modified NIC driver for lock-free polling of
        queues                                                                                I/O hub                     3.   core reads packet
                                                                                                                          4.   core processes packet
 • with                                                                                                 ports
                                                                                                                          5.   core writes packet to out port
      – one core per queue (avoid locking)
      – one core per packet (avoid cache misses, inter-                                                                           and packet descriptors
        core communication)
                                                                               problem: excessive per packet book-keeping overhead
                                                                                • solution: batch packet operations
                                                                                     – NIC transfers packets in batches of `k’
10/29/2012                   cs262a-S12 Lecture-18                      23   10/29/2012                           cs262a-S12 Lecture-18                         24
                                                                                                          Single-Server Measurements:
        Recap: routing on a server                                                                        Experimental setup

                                                                                                     • test server: Intel Nehalem (X5560)                              cores     cores
    Design lessons:                                                                                       – dual socket, 8x 2.80GHz cores                                                   mem
                                                                                                                                                                mem
        1.   parallel hardware
                                                                                                          – 2x NICs; 2x 10Gbps ports/NIC
             » at cores and memory and NICs                                                                                                                                I/O hub
        2.   careful queue-to-core allocation
             » one core per queue, per packet
        3.   reduced book-keeping per packet
                                                                                                                                                 max 40Gbps
             » modified NIC driver w/ batching                                                                                                                                              10Gbps




         (see paper for “non needs” – careful memory                                                                                                               additional servers
           placement, etc.)                                                                                                                                     generate/sink test traffic


10/29/2012                       cs262a-S12 Lecture-18                                        25   10/29/2012                           cs262a-S12 Lecture-18                                        26




       Experimental setup                                                                                 Experimental setup

  • test server: Intel Nehalem (X5560)                           cores     cores                     • test server: Intel Nehalem (X5560)                               cores     cores
                                                          me   packet processing  me                                                                             me   packet processing  me
                                                          m                           m                                                                          m                           m
  • software: kernel‐mode Click [TOCS’00]                        Click runtime
                                                                                                     • software: kernel‐mode Click [TOCS’00]                            Click runtime
       – with modified NIC driver                                    I/O hub                              – with modified NIC driver                                        I/O hub
                                                               modified NIC driver                                                                                    modified NIC driver
         (batching, multi‐Q)
                                                                                                     • packet processing
                                                                                                          – static forwarding (no header processing) 
                                                                                     10Gbps                                                                                                 10Gbps
                                                                                                          – IP routing
                                                                                                                » trie‐based longest‐prefix address lookup
                                                                                                                » ~300,000 table entries [RouteViews]
                                                            additional servers                                  » checksum calculation, header updates, etc.       additional servers
                                                         generate/sink test traffic                                                                             generate/sink test traffic


10/29/2012                       cs262a-S12 Lecture-18                                        27   10/29/2012                           cs262a-S12 Lecture-18                                        28
       Experimental setup                                                                                Factor analysis: design lessons
                                                                                                                                                                            19
  • test server: Intel Nehalem (X5560)




                                                                                                           pkts/sec (M)
                                                                cores       cores
                                                         me   packet processing  me
                                                         m                              m
                                                                                                                                                               5.9
  • software: kernel‐mode Click [TOCS’00]                       Click runtime
       – with modified NIC driver                                   I/O hub                                                 1.2            2.8
                                                              modified NIC driver

  • packet processing                                                                                                     older          current           Nehalem       Nehalem 
       – static forwarding (no header processing)
                                                                                    10Gbps               shared‐bus                     Nehalem           + `batching’  w/ multi‐Q 
       – IP routing
                                                                                                             server                                        NIC driver + `batching’
                                                                                                                                          server
                                                                                                                                                                          driver
  • input traffic
                                                       additional servers
       – all min‐size (64B) packets 
                                                    generate/sink test traffic                                            Test scenario: static forwarding of min-sized
         (maximizes packet rate given port speed R)
                                                                                                                                             packets
       – realistic mix of packet sizes [Abilene]
10/29/2012                       cs262a-S12 Lecture-18                                       29   10/29/2012                                    cs262a-S12 Lecture-18                          30




     Single-server performance                                                                           Bottleneck analysis (64B pkts)

                                                                                                        Recall: max IP routing = 6.35Gbps  12.4 M pkts/sec
  40Gbps
                          36.5                    36.5                                                                            Per-packet           Maximum           Max. packet rate as
                                                                          min-size
                                                                          packets                                                 load due to         component            per component
         Gbps




                                                                        realistic pkt
                                                                           sizes
                                                                                                                                    routing           capacity –             capacity --
                    9.7                                                                                                                            nominal (empirical)   nominal (empirical)
                                           6.35
                                                                                                        memory                 725 bytes/pkt       51 (33) Gbytes/sec     70 (46) Mpkts/sec
                                                                                                                                            CPUs are the
                                                                                                           I/O                 191 bytes/pkt bottleneck
                                                                                                                                               16 (11) Gbytes/sec         84 (58) Mpkts/sec
                static forwarding          IP routing
                                                                                                          Inter-               231 bytes/pkt       25 (18) Gbytes/sec    108 ( 78) Mpkts/sec
                                                                                                         socket
                                                                                                           link
                   Bottleneck: traffic generation
                      Bottleneck?                                                                        CPUs                 1693 cycles/pkt       22.4 Gcycles/sec        13 Mpkts/sec

                                                                                                                      Test scenario: IP routing of min-sized packets
10/29/2012                       cs262a-S12 Lecture-18                                       31   10/29/2012                                    cs262a-S12 Lecture-18                          32
       Recap: single-server performance                                        Recap: single-server performance


                                              R            NR
           current servers
       (realistic packet sizes)         1/10 Gbps       36.5 Gbps

           current servers                                6.35
         (min-sized packets)                   1         (CPUs
                                                       bottleneck)
                                                                                            With upcoming servers? (2010)
                                                                                                4x cores, 2x memory, 2x I/O

10/29/2012                     cs262a-S12 Lecture-18                 33   10/29/2012                             cs262a-S12 Lecture-18                               34




       Recap: single-server performance                                          Project Feedback from Meetings
                                                                              • Update your project descriptions and plan
                                              R           NR                       – Turn your description/plan into a living document in Google Docs 
                                                                                   – Share Google Docs link with us 
           current servers                                                         – Update plan/progress throughout the semester
       (realistic packet sizes)         1/10 Gbps      36.5 Gbps              • Later this week: register your project and proposal on class 
                                                                                Website (through project link)
           current servers                                6.35
         (min-sized packets)                   1         (CPUs
                                                       bottleneck)            • Questions to address:
                                                                                   –   What is your evaluation methodology? 
         upcoming servers –                                                        –   What will you compare/evaluate against? Strawman?
              estimated                   1/10/40         146                      –   What are your evaluation metrics?
       (realistic packet sizes)                                                    –   What is your typical workload? Trace‐based, analytical, …
         upcoming servers –                                                        –   Create a concrete staged project execution plan:
                                                                                         » Set reasonable initial goals with incremental milestones – always have 
              estimated                     1/10          25.4                             something to show/results for project
         (min-sized packets)

10/29/2012                     cs262a-S12 Lecture-18                 35   10/29/2012                             cs262a-S12 Lecture-18                               36
       Practical Architecture: Goal                                   A cluster-based router today

  • scale software routers to multiple 10Gbps
    ports

  • example: 320Gbps (32x 10Gbps ports)
       – higher-end of edge routers; lower-end core                                     interconnect
         routers                                                                              ?
                                                                                                                  10Gbps




10/29/2012                cs262a-S12 Lecture-18             37   10/29/2012               cs262a-S12 Lecture-18            38




       Interconnecting servers                                          A naïve solution
                                                                        N2 internal
                                                                           links
     Challenges                                                        of capacity R
         –   any input can send up to R bps to any output
                                                                                                     R
                                                                                          R
                                                                                                              R

                                                                                          R
                                                                                                                  10Gbps
                                                                                                     R

                                                                          problem: commodity servers cannot accommodate
                                                                                            NxR traffic

10/29/2012                cs262a-S12 Lecture-18             39   10/29/2012               cs262a-S12 Lecture-18            40
       Interconnecting servers                                                      Overload

                                                                                    drop at input servers?
     Challenges                                                                 problem: requires global state                   need to drop 20Gbps;
                                                                                                                               (fairly across input ports)
         –   any input can send up to R bps to any output
             »   but need a low-capacity interconnect (~NR)
             »   i.e., fewer (<N), lower-capacity (<R) links per server                                              10Gbps

                                                                                                             10Gbps
         –   must cope with overload

                                                                                                            10Gbps
                                                                                                                                       10Gbps

                                                                                                  drop at output server?
                                                                                                  problem: output might
                                                                                                 receive up to NxR traffic
10/29/2012                        cs262a-S12 Lecture-18                   41   10/29/2012                     cs262a-S12 Lecture-18                      42




       Interconnecting servers                                                        Interconnecting servers


     Challenges                                                                     Challenges
         –   any input can send up to R bps to any output                               –     any input can send up to R bps to any output
                 but need a lower-capacity interconnect
             »
                                                                                        –     must cope with overload
             »   i.e., fewer (<N), lower-capacity (<R) links per server


         –   must cope with overload                                                With constraints (due to commodity servers
             »   need distributed dropping without global scheduling                   and NICs)
             »   processing at servers should scale as R, not NxR                           – internal link rates ≤ R
                                                                                            – per-node processing: cxR (small c)
                                                                                            – limited per-node fanout

                                                                                    Solution: Use Valiant Load Balancing (VLB)
10/29/2012                        cs262a-S12 Lecture-18                   43   10/29/2012                     cs262a-S12 Lecture-18                      44
       Valiant Load Balancing (VLB)                                                 VLB: operation

 • Valiant et al. [STOC’81], communication in multi-                                        phase 1                                           phase 2
   processors
                                                                                                                                              R/N
                                                                                                   R/N

 • applied to data centers [Greenberg’09], all-optical                                       R/N
                                                                                                                                       R/N
   routers [Kesslassy’03], traffic engineering [Zhang-                                                   R/N                                            R/N
   Shen’04], etc.                                                                            R/N
                                                                                                                                                    R/N

                                                                                                   R/N         R                 R           R/N
 • idea: random load-balancing across a low-
   capacity interconnect                                                           Packets arriving at external port
                                                                                                                      • server sends up to R/N (of R/N
                                                                                      are uniformly load balancedEach N2 internal links of capacitytraffic
                                                                                • N2 internal links of capacity R/N
                                                                                                                     Output server transmits received
                                                                                                                   received in phase-1) to output server;
                                                                                • each server receives up to R bps • each server receives up to R bps
                                                                                                                          traffic on external port
                                                                                                                            drops excess fairly

                                                                                              Packets forwarded in two phases
10/29/2012                  cs262a-S12 Lecture-18                         45   10/29/2012                      cs262a-S12 Lecture-18                                     46




     VLB: operation                                                                   VLB: fanout? (1)
                       phase 1+2


                                                                                                                                                           fewer but
                                                                                                                                                          faster links




                R                                   R

             • N2 internal links of capacity 2R/N                                                                                        fewer but
                                                                                                                                       faster servers
             • each server receives up to 2R bps
             • plus R bps from external port
                                                                                                   Multiple external ports per server
             • hence, each server processes up to 3R                                                 (if server constraints permit)
             • or up to 2R, when traffic is uniform [directVLB, Liu’05]
10/29/2012                  cs262a-S12 Lecture-18                         47   10/29/2012                      cs262a-S12 Lecture-18                                     48
       VLB: fanout? (2)                                         Authors solution:

                                                           • assign maximum external ports per server
                                                           • servers interconnected with commodity NIC
                                                             links
                                                           • servers interconnected in a full mesh if
                                                             possible
                                                           • else, introduce extra servers in a k-degree
                                                             butterfly
                                                           • servers run flowlet-based VLB


    Use extra servers to form a constant-degree
      multi-stage interconnect (e.g., butterfly)
10/29/2012                  cs262a-S12 Lecture-18   49   10/29/2012             cs262a-S12 Lecture-18          50




       Outline                                                  Scalability


      • introduction
                                                              • question: how well does clustering scale for
      • routing on a single server                              realistic server fanout and processing
             – design                                           capacity?
             – evaluation
      • routing on a cluster                                  • metric: number of servers required to
             – design                                           achieve
             – evaluation                                       a target router speed
      • next steps
      • conclusion

10/29/2012                  cs262a-S12 Lecture-18   51   10/29/2012             cs262a-S12 Lecture-18          52
       Scalability                                                            Example: 320Gbps
                                                                           • R=10Gbps, N=32
                                                                           • with current servers: 1x 10Gbps external port
         Assumptions                                                            – target: 32 servers
                                                                                – 2R/N < 1Gbps  need: 1Gbps internal links
         • 7 NICs per server                                                    – 8x 1Gbps ports/NIC  need: 4 NICs per server
         • each NIC has 6 x 10Gbps ports or 8 x 1Gbps
           ports
         • current servers
             – one external 10Gbps port per server
               (i.e., requires that a server process 20-30Gbps)

         • upcoming servers
             – two external 10Gbps port per server
               (i.e., requires that a server process 40-60Gbps)


10/29/2012                       cs262a-S12 Lecture-18            53   10/29/2012                      cs262a-S12 Lecture-18                        54




       Example: 320Gbps                                                       Scalability (computed)
    • R=10Gbps, N=32
    • with current servers: 1x 10Gbps external port
         – need: 32 servers, 4 NICs/server (1Gbps NICs)                                        160Gbp 320Gbp 640Gbp            1.28Tbp
                                                                                                                                         2.56Tbps
                                                                                                  s      s      s                 s
                                                                                     current
                                                                                     servers     16       32         128        256       512

                                                                                    upcoming
                                                                                     servers     8        16          32        128       256

                                                                                                          Transition from mesh to butterfly


                                                                            Example: can build a 320Gbps router using 32 `current’ serv

10/29/2012                       cs262a-S12 Lecture-18            55   10/29/2012                      cs262a-S12 Lecture-18                        56
       Implementation: the RB8/4                                                         Is this a good paper?
                                                                                      • What were the authors’ goals?
                                                    Specs.                            • What about the evaluation/metrics?
    4 x Nehalem
                                                    • 8x 10Gbps external              • Did they convince you that this was a good
    servers                                                                             system/approach?
                                                      ports
                                                                                      • Were there any red-flags?
                                                    • form-factor: 4U
                                                                                      • What mistakes did they make?
                                                    • power: 1.2KW
                                                                                      • Does the system/approach meet the “Test of Time”
                                                    • cost: ~$10k                       challenge?
                                                    Key results (realistic
                                                                                      • How would you review this paper today?
                                                      traffic)
                                                    • 72 Gbps routing
                                                    • reordering: 0-0.15%
             2 x 10Gbps external
                     ports                          • validated VLB bounds
10/29/2012
              (Intel Niantic NIC) cs262a-S12 Lecture-18                      57   10/29/2012               cs262a-S12 Lecture-18           58

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/20/2013
language:Unknown
pages:15