Networking The Cloud by nyut545e2

VIEWS: 10 PAGES: 45

									Networking The Cloud
                Albert Greenberg
              Principal Researcher
              albert@microsoft.com
 (work with James Hamilton, Srikanth Kandula, Dave Maltz, Parveen
             Patel, Sudipta Sengupta, Changhoon Kim)
Agenda

• Data Center Costs
   – Importance of Agility
• Today’s Data Center Network
• A Better Way?




             Albert Greenberg, ICDCS 2009 keynote   2
 Data Center Costs
Amortized Cost*      Component                         Sub-Components
~45%                 Servers                           CPU, memory, disk
~25%                 Power infrastructure              UPS, cooling, power distribution
~15%                 Power draw                        Electrical utility costs
~15%                 Network                           Switches, links, transit

 • Total cost varies
       – Upwards of $1/4 B for mega data center
       – Server costs dominate
       – Network costs significant
The Cost of a Cloud: Research Problems in Data Center Networks.
Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel.
*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

                     Albert Greenberg, ICDCS 2009 keynote                                 3
Server Costs

Ugly secret: 30% utilization considered “good” in data centers
Causes include:
• Uneven application fit:
    – Each server has CPU, memory, disk: most applications exhaust
      one resource, stranding the others
• Long provisioning timescales:
    – New servers purchased quarterly at best
• Uncertainty in demand:
    – Demand for a new service can spike quickly
• Risk management:
    – Not having spare servers to meet demand brings failure just when
      success is at hand
• Session state and storage constraints
    – If the world were stateless servers, life would be good

                                                                         4
Goal: Agility – Any service, Any Server

• Turn the servers into a single large fungible pool
   – Let services “breathe” : dynamically expand and contract their
     footprint as needed
• Benefits
   – Increase service developer productivity
   – Lower cost
   – Achieve high performance and reliability

  The 3 motivators of most infrastructure projects




                                                                      5
Achieving Agility

• Workload management
   – Means for rapidly installing a service’s code on a server
   – Virtual machines, disk images 
• Storage Management
   – Means for a server to access persistent data
   – Distributed filesystems (e.g., blob stores) 
• Network
   – Means for communicating with other servers, regardless of where
     they are in the data center




                                                                       6
Network Objectives

Developers want a mental model where all their servers, and
only their servers, are plugged into an Ethernet switch
1. Uniform high capacity
   – Capacity between servers limited only by their NICs
   – No need to consider topology when adding servers
2. Performance isolation
   – Traffic of one service should be unaffected by others
3. Layer-2 semantics
   – Flat addressing, so any server can have any IP address
   – Server configuration is the same as in a LAN
   – Legacy applications depending on broadcast must work

                                                              7
Agenda

• Data Center Costs
   – Importance of Agility
• Today’s Data Center Network
• A Better Way?




                                8
The Network of a Modern Data Center
     Internet                                        Internet
                                                CR                     CR
    Data Center
      Layer 3
                                 AR            AR           …           AR          AR


    Layer 2            LB          S            S         LB

                                                                                    Key:
                     S             S          S             S    …     • CR = L3 Core Router
                                                                       • AR = L3 Access Router
                                                                       • S = L2 Switch
                                                                       ~ 4,000 servers/pod
                                                                       • LB = Load Balancer
                               …                        …              • A = Rack of 20 servers
                                                                             with Top of Rack switch
   Ref: Data Center: Load Balancing Data Center Services, Cisco 2004


• Hierarchical network; 1+1 redundancy
• Equipment higher in the hierarchy handles more traffic, more
  expensive, more efforts made at availability  scale-up design
• Servers connect via 1 Gbps UTP to Top of Rack switches
• Other links are mix of 1G, 10G; fiber, copper
                            Albert Greenberg, ICDCS 2009 keynote                                       9
Internal Fragmentation Prevents Applications from
Dynamically Growing/Shrinking
                                         Internet
                               CR                         CR


           AR         AR                        …                       AR      AR


      LB       S       S           LB                              LB       S   S    LB



  S            S      S             S           …              S            S   S        S



           …                    …                                       …            …


• VLANs used to isolate properties from each other
• IP addresses topologically determined by ARs
• Reconfiguration of IPs and VLAN trunks painful, error-
  prone, slow, often manual
                   Albert Greenberg, ICDCS 2009 keynote                                      10
No Performance Isolation

                                         Internet
                               CR                         CR


           AR         AR                        …                       AR      AR


      LB       S       S           LB                              LB       S   S    LB
                                         Collateral
                                         damage
  S            S      S             S           …              S            S   S        S



           …                    …                                       …            …


• VLANs typically provide reachability isolation only
• One service sending/receiving too much traffic hurts all
  services sharing its subtree

                   Albert Greenberg, ICDCS 2009 keynote                                      11
Network has Limited Server-to-Server Capacity,
and Requires Traffic Engineering to Use What It Has
                                            Internet
                                  CR                         CR

                    10:1 over-subscription or worse (80:1, 240:1)
            AR           AR                        …                       AR      AR


       LB       S         S           LB                              LB       S   S    LB



   S            S        S             S           …              S            S   S        S



            …                      …                                       …            …


• Data centers run two kinds of applications:
   – Outward facing (serving web pages to users)
   – Internal computation (computing search index – think HPC)

                      Albert Greenberg, ICDCS 2009 keynote                                      12
Network Needs Greater Bisection BW,
and Requires Traffic Engineering to Use What It Has
                                           Internet
                                 CR                         CR


            AR   Dynamic reassignment of servers and AR
                        AR            …           AR
                 Map/Reduce-style computations mean
       LB                S    LB            LB
                S traffic matrix is constantly changing
                                                   S    S            LB



   S            S      S         S    …     S       S     S
                Explicit traffic engineering is a nightmare              S



            …                     …                              …   …


• Data centers run two kinds of applications:
   – Outward facing (serving web pages to users)
   – Internal computation (computing search index – think HPC)

                     Albert Greenberg, ICDCS 2009 keynote                    13
What Do Data Center Faults Look Like?
 •Need very high reliability near top
                                                                                         CR                 CR
 of the tree
     – Very hard to achieve                                                  AR         AR          …        AR   AR




         Example: failure of a                                      LB       S          S       LB



          temporarily unpaired core                                 S                  S            S   …
                                                                                S

          switch affected ten million
          users for four hours                                              …                   …



     – 0.3% of failure events                            Ref: Data Center: Load Balancing Data Center Services,
                                                         Cisco 2004

       knocked out all members of a
       network redundancy group

VL2: A Flexible and Scalable Data Center Network. Sigcomm 2009.
Greenberg, Jain, Kandula, Kim, Lahiri, Maltz, Patel, Sengupta.

                  Albert Greenberg, ICDCS 2009 keynote                                                                 14
Agenda

• Data Center Costs
   – Importance of Agility
• Today’s Data Center Network
• A Better Way?




             Albert Greenberg, ICDCS 2009 keynote   15
Agenda

• Data Center Costs
   – Importance of Agility
• Today’s Data Center Network
• A Better Way?
   – Building Blocks
   – Traffic
   – New Design




             Albert Greenberg, ICDCS 2009 keynote   16
Switch on Chip ASICs
                     General purpose                  ASIC floorplan
                           CPU
                     for control plane                  Forwarding tables
                           Switch-on-a-chip




                                                                             (SerDes)
                                                                              X-ceiver
                                 ASIC
                                                       Forwarding pipeline

                                                           Packet buffer
                                                             Memory

• Current design points
   – 24 port 1G, 4 10G 16K IPv4 fwd entries, 2 MB buff
   – 24 port 10G Eth, 16K IPv4 fwd entries, 2 MB buff
• Future
   – 48 port 10G, 16K fwd entries, 4 MB buff
   – Trends towards more ports, faster port speed
               Albert Greenberg, ICDCS 2009 keynote                                  17
Packaging

• Switch                                        • Link technologies
  – Combine ASICs                                     – SFP+ 10G port
                                                      – $100, MM fiber
      Silicon fab costs drive
       ASIC price                                     – 300m reach
      Market size drives                       • QSFP (Quad SFP)
       packaged switch price                          – 40G port avail today
  – Economize links:                                  – 4 10G bound together
      On chip < on PCB < on                    • Fiber “ribbon cables”
       chassis < between
                                                      – Up to 72 fibers per cable, to
       chasses
                                                        a single MT connector
  – Example:
      144 port 10G switch,
       built from 24 port switch
       ASICs in single chassis.
               Albert Greenberg, ICDCS 2009 keynote                                     18
Latency

• Propagation delay in the data center is essentially 0
    – Light goes a foot in a nanosecond; 1000’ = 1 usec
• End to end latency comes from
    – Switching latency
          10G to 10G:~ 2.5 usec (store&fwd); 2 usec (cut-thru)
    – Queueing latency
          Depends on size of queues and network load
• Typical times across a quiet data center: 10-20usec
• Worst-case measurement (from our testbed, not real DC, with all2all
  traffic pounding and link util > 86%): 2-8 ms
• Comparison:
    – Time across a typical host network stack is 10 usec
• Application developer SLAs > 1 ms granularity


                 Albert Greenberg, ICDCS 2009 keynote                   19
Agenda

• Data Center Costs
   – Importance of Agility
• Today’s Data Center Network
• A Better Way?
   – Building Blocks
   – Traffic
   – New Design




             Albert Greenberg, ICDCS 2009 keynote   20
Measuring Traffic in Today’s Data Centers

• 80% of the packets stay inside the data center
   – Data mining, index computations, back end to front end
   – Trend is towards even more internal communication
• Detailed measurement study of data mining cluster
   – 1,500 servers, 79 Top of Rack (ToR) switches
   – Logged: 5-tuple and size of all socket-level R/W ops
   – Aggregated in flows – all activity separated by < 60 s
   – Aggregated into traffic matrices every 100 s
       Src, Dst, Bytes of data exchange




              Albert Greenberg, ICDCS 2009 keynote            21
Flow Characteristics
DC traffic != Internet traffic

                                                     Most of the flows:
                                                     various mice

                                                     Most of the bytes:
                                                     within 100MB flows



                                                       Median of 10
                                                       concurrent
                                                       flows per server



              Albert Greenberg, ICDCS 2009 keynote                    22
Traffic Matrix Volatility

                                                  - Collapse similar traffic
                                                  matrices (over 100sec)
                                                  into “clusters”

                                                  - Need 50-60 clusters to
                                                  cover a day’s traffic


                                                   - Traffic pattern changes
                                                   nearly constantly

                                                   - Run length is 100s to
                                                   80% percentile; 99th is
                                                   800s
           Albert Greenberg, ICDCS 2009 keynote                            23
   Today, Computation Constrained by Network*




   Figure: ln(Bytes/10sec) between servers in operational cluster
   • Great efforts required to place communicating servers under the same ToR 
      Most traffic lies on the diagonal (w/o log scale all you see is the diagonal)
   • Stripes show there is need for inter-ToR communication
*Kandula, Sengupta, Greenberg,Patel
                           Albert Greenberg, ICDCS 2009 keynote                       24
  Congestion: Hits Hard When it Hits*




*Kandula, Sengupta, Greenberg, Patel
                          Albert Greenberg, ICDCS 2009 keynote   25
Agenda
                                                    • VL2: A Flexible and Scalable Data
• Data Center Costs                                   Center Network. Sigcomm 2009.
                                                      Greenberg, Hamilton, Jain, Kandula,
   – Importance of Agility                            Kim, Lahiri, Maltz, Patel, Sengupta.

• Today’s Data Center Network                       • Towards a Next Generation Data
                                                      Center Architecture: Scalability and
• A Better Way?                                       Commoditization. Presto 2009.
                                                      Greenberg, Maltz, Patel, Sengupta,
   – Building Blocks                                  Lahiri.
   – Traffic                                        • PortLand: A Scalable Fault-Tolerant
   – New Design                                       Layer 2 Data Center Network Fabric.
                                                      Mysore, Pamboris, Farrington, Huang,
                                                      Miri, Radhakrishnan, Subramanya,
                                                      Vahdat

                                                    • BCube: A High Performance,
                                                      Server-centric Network Architecture
                                                      for Modular Data Centers. Guo, Lu,
                                                      Li, Wu, Zhang, Shi, Tian, Zhang, Lu

             Albert Greenberg, ICDCS 2009 keynote                                       26
VL2: Distinguishing Design Principles
• Randomizing to Cope with Volatility
   – Tremendous variability in traffic matrices
• Separating Names from Locations
   – Any server, any service
• Embracing End Systems
   – Leverage the programmability & resources of servers
   – Avoid changes to switches
• Building on Proven Networking Technology
   – We can build with parts shipping today
   – Leverage low cost, powerful merchant silicon ASICs,
     though do not rely on any one vendor

             Albert Greenberg, ICDCS 2009 keynote          27
What Enables a New Solution Now?

• Programmable switches with high port density
   – Fast: ASIC switches on a chip (Broadcom, Fulcrum, …)
   – Cheap: Small buffers, small forwarding tables
   – Flexible: Programmable control planes
• Centralized coordination
  – Scale-out data centers are
    not like enterprise networks
  – Centralized services already
    control/monitor health and
    role of each server (Autopilot)
  – Centralized directory and
                                                      20 port 10GE switch. List price: $10K
    control plane acceptable (4D)
               Albert Greenberg, ICDCS 2009 keynote                                       28
 An Example VL2 Topology: Clos Network
                                    D/2 switches
                                                                       Intermediate
              D ports                            ...                   node switches
                                                                       in VLB
                                                                                             Node degree (D) of
                                                                                             available switches &
                                                                                             # servers supported
                                                                                         D   # Servers in pool
  D/2 ports
                                                                                           4                 80
                                                                           Aggregation

D/2 ports
                                                   ...                     switches       24              2,880
                        10G                                                               48            11,520
                                       D switches
                                                                                         144           103,680

                              Top Of Rack switch
            20 ports


                                 [D2/4] * 20 Servers
• A scale-out design with broad layers
   • Same bisection capacity at each layer  no oversubscription
   • Extensive path diversity  Graceful degradation under failure
                                Albert Greenberg, ICDCS 2009 keynote                                           29
 Use Randomization to Cope with Volatility
                                    D/2 switches
                                                                       Intermediate
              D ports                            ...                   node switches
                                                                       in VLB
                                                                                             Node degree (D) of
                                                                                             available switches &
                                                                                             # servers supported
                                                                                         D   # Servers in pool
  D/2 ports
                                                                                           4                 80
                                                                           Aggregation

D/2 ports
                                                   ...                     switches       24              2,880
                        10G                                                               48            11,520
                                       D switches
                                                                                         144           103,680

                              Top Of Rack switch
            20 ports


                                 [D2/4] * 20 Servers
• Valiant Load Balancing
   – Every flow “bounced” off a random intermediate switch
   – Provably hotspot free for any admissible traffic matrix
   – Servers could randomize flow-lets if needed
                                Albert Greenberg, ICDCS 2009 keynote                                           30
Separating Names from Locations:
How Smart Servers Use Dumb Switches
                          Dest: N Src: S
             Headers      Dest: TD Src: S                      Dest: TD Src: S
                          Dest: D Src: S                       Dest: D Src: S
                          Payload…                             Payload…
                                                Intermediate
                                       2          Node (N)       3
                           ToR (TS)                                  ToR (TD)
      Dest: N Src: S
      Dest: TD Src: S                                                                Dest: D    Src: S
      Dest: D Src: S
      Payload
                           1                                                     4   Payload…


                          Source (S)                                 Dest (D)

• Encapsulation used to transfer complexity to servers
   – Commodity switches have simple forwarding primitives
   – Complexity moved to computing the headers
• Many types of encapsulation available
   – IEEE 802.1ah defines MAC-in-MAC encapsulation; VLANs; etc.

                        Albert Greenberg, ICDCS 2009 keynote                                             31
 Embracing End Systems

                                                                  Directory
            VL2 Agent                                              System
User                                                          Server    Network
Kernel
                 TCP                                           Role      Health
                                                              Server
 Resolve         IP      ARP
 remote                                                       Health
 IP                                   MAC
           Encapsulator             Resolution
                                      Cache
                 NIC
Server machine

 • Data center OSes already heavily modified for VMs,
   storage clouds, etc.
    – A thin shim for network support is no big deal
 • No change to applications or clients outside DC
                       Albert Greenberg, ICDCS 2009 keynote                       32
VL2 Prototype




• 4 ToR switches, 3 aggregation switches, 3 intermediate switches
• Experiments conducted with both 40 and 80 servers


                 Albert Greenberg, ICDCS 2009 keynote               33
VL2 Achieves Uniform High Throughput




• Experiment: all-to-all shuffle of 500 MB among 75 servers – 2.7 TB
   • Excellent metric of overall efficiency and performance
   • All2All shuffle is superset of other traffic patterns
• Results:
   • Ave goodput: 58.6 Gbps; Fairness index: .995; Ave link util: 86%
• Perfect system-wide efficiency would yield aggregate goodput of 75G
   – VL2 efficiency is 78% of perfect
   – 10% inefficiency due to duplexing issues; 7% header overhead
   – VL2 efficiency is 94% of optimal
                 Albert Greenberg, ICDCS 2009 keynote                   34
VL2 Provides Performance Isolation




                                                 - Service 1
                                                 unaffected by
                                                 service 2’s activity




          Albert Greenberg, ICDCS 2009 keynote                    35
VL2 is resilient to link failures




    - Performance degrades and recovers gracefully as
               links are failed and restored


             Albert Greenberg, ICDCS 2009 keynote       36
                                                         Amortized   Component        Sub-Components

 Summary                                                 Cost

                                                         ~45%        Servers          CPU, memory, disk

                                                         ~25%        Infrastructure   UPS, cooling, power distribution

                                                         ~15%        Power draw       Electrical utility costs

• It’s about agility                       ~15%    Network Switches, links, transit



    – Increase data center capacity
         Any service on any server, anywhere in the data center                                                           D ports
                                                                                                                                                                D/2 switches




                                                                                                                                                                                        ...         Intermediate
                                                                                                                                                                                                    node
                                                                                                                                                                                                    switches in
                                                                                                                                                                                                    VLB




• VL2 enables agility and ends oversubscription                                                            D/2 ports




                                                                                                   D/2 ports
                                                                                                                                     10G
                                                                                                                                                                                              ...                  Aggregatio
                                                                                                                                                                                                                   n
                                                                                                                                                                                                                   switches




    – Results have near perfect scaling
                                                                                                                                                                   D switches




                                                                                                                                           Top Of Rack switch



                                                                                                                       2
                                                                                                                       0

                                                                                                                       p
                                                                                                                       o
                                                                                                                       r
                                                                                                                       t
                                                                                                                       s                                          [D2/4] * 20 Servers




    – Increases service developer productivity
         A simpler abstraction -- all servers plugged into one huge Ethernet
          switch
    – Lowers cost
         High scale server pooling, with tenant perf isolation
         Commodity networking components
    – Achieves high performance and reliability
         Gives us some confidence that design will scale-out
             – Prototype behaves as theory predicts

                      Albert Greenberg, ICDCS 2009 keynote                                                                                                                                                                      37
Thank You




       Albert Greenberg, ICDCS 2009 keynote   38
Lay of the Land

                                                                Users
                                                 Internet

                                                   CDNs, ECNs
                                           Data Centers




          Albert Greenberg, ICDCS 2009 keynote                          39
Problem

• Scope
   – 200+ online properties
   – In excess of one PB a day to/from the Internet
• Cost and Performance
   – Cost : ~$50M spend in FY08
   – Performance [KDD’07; Kohavi]
        Amazon: 1% sales loss for an extra 100 ms delay
        Google: 20% sales loss for an extra 500 ms delay
        Apps are chatty: N ∙ RTT can quickly get to 100 ms and beyond
• How to manage for the sweet spot of the cost/perf
  tradeoff?
   –   Hard problem
          200 properties x 18 data centers x 250K prefixes x 10 alternate
           paths


                Albert Greenberg, ICDCS 2009 keynote
                                   Microsoft Confidential                    40
Perspective

• Good News
  – Extremely well-connected to the
    Internet (1,200 ASes!)
  – 18+ data centers (mega and micro)
    around the world
• Bad news
  – BGP (Border Gateway Protocol)
    picks one “default” path
       oblivious to cost and performance
  – What’s best depends on the
    property, the data center, the
    egress options, and the user
  – Humans (GNS engineers) face an
    increasingly difficult task



                Albert Greenberg, ICDCS 2009 keynote
                                   Microsoft Confidential   41
Optimization

• Sweet Spot, trading off Perf and Cost




              Albert Greenberg, ICDCS 2009 keynote
                                 Microsoft Confidential   42
Solution

 •Produce the ENTACT curve, for TE strategies
 •Automatically choose optimal operating points under certain high-
 level objectives
                                        Default Routing




                                 Sweet Spot




               Albert Greenberg, ICDCS 2009 keynote
                                  Microsoft Confidential              43
Conventional Networking Equipment

Modular routers                                         Ports
   – Chassis $20K                                          – 8 port 10G-X - $25K
   – Supervisor card $18K                                  – 1GB buffer memory
                                                           – Max ~120 ports of 10G
                                                             per switch
Total price in common configurations: $150-200K (+SW&maint)
Power: ~2-5KW

Integrated Top of Rack                           Load Balancers
   switches                                            – Spread TCP connections
   – 48 port 1GBase-T                                    over servers
   – 2-4 ports 1 or 10G-x                              – $50-$75K each
   – $7K                                               – Used in pairs


                Albert Greenberg, ICDCS 2009 keynote                                 44
VLB vs. Adaptive vs. Best Oblivious Routing




• VLB does as well as adaptive routing (traffic engineering
 using an oracle) on Data Center traffic
• Worst link is 20% busier with VLB, median is same
               Albert Greenberg, ICDCS 2009 keynote           45

								
To top