3-5.ppt by yaosaigeng

VIEWS: 0 PAGES: 60

									       Network Management (II):
            Fault Diagnosis




Based on slides of Nick Feamster and Ramana R. Kompella
                                                     Part 3.5   1
Outline

  rcc [Feamster et al. 2005]
     Routing as a distributed program
     Apply static analysis to proactively detect BGP
      configuration faults before deployment
     Neither sound nor complete but useful



  SCORE [Kompella et al. 2005]
    Model shared risks among network elements
    Correlate faults from different network layers
    Occam’s Razor

                                                  Part 3.5   2
Detecting BGP configuration
 faults with static analysis




                          Part 3.5   3
Is correctness really that important?
  The Internet is increasingly becoming part of the mission-
   critical Infrastructure (a public utility!).




      Big problem: Very poor understanding
                   of how to manage it.    Part 3.5             4
Why does routing go wrong?

  Complex policies
     Competing / cooperating networks
     Each with only limited visibility

  Large scale
    Tens of thousands networks
    …each with hundreds of routers
    …each routing to hundreds of thousands of IP
     prefixes




                                               Part 3.5   5
What can go wrong?
  Some things are out of the hands of networking research




But…
   Two-thirds of the problems are caused by
     configuration of the routing protocol Part 3.5         6
Categories of BGP Configurations

Filtering: route advertisement   Ranking: route selection
                    …. More Flexibility
                         brings
Customer
                         More
                     COMPLEXITY!             Primary



Competitor
                                             Backup



       Dissemination: internal route advertisement
                                                 Part 3.5   7
These problems are real
 “…a glitch at a small ISP… triggered a major outage in
 Internet access across the country. The problem started
 when MAI Network Services...passed bad router
 “Microsoft's websites were offline for onto 23
 information from one of its customers up to Sprint.”
                 -- of a [router] misconfiguration…it took
 hours...because news.com, April 25, 1997
 nearly a day to determine what was wrong and undo the
 “WorldCom Inc…suffered a widespread outage on its
 changes.”       -- wired.com, January 25, 2001
 Internet backbone that affected roughly 20 percent of its
 U.S. customer base. The network problems…affected
 millions of computer users worldwide. A spokeswoman
 "A number of outage to "a route table issue."
 attributed the Covad customers went out from 5pm today
 due to, supposedly, a DDOS (distributed denial of service
                -- cnn.com, October 3, 2002
 attack) on a key Level3 data center, which later was
 described as a route leak (misconfiguration).”
                                                         Part 3.5   8
                -- dslreports.com, February 23, 2004
Faults on NANOG Mailing List
                                 90
  # Threads over Stated Period




                                 80
                                 70
                                 60
                                 50
                                 40
                                 30
                                 20
                                 10
                                  0
                                      Filtering   Route   Route       Route     Routing Blackholes
                                                  Leaks   Hijacks   Instability Loops

                                                    1994-1997   1998-2001   2001-2004


 Note: Only includes problems openly discussed on this list.
                                                                                           Part 3.5   9
 Example: Blackhole

Date: Thu, 18 Jul 2002 06:05:10 -0400 (EDT)
From: Chad Oleary <col@pobox.com>
Subject: Re: problems with 701
To: <nanog@merit.edu>


We're starting to see the same issues with UUNet, again. Anyone else
seeing this? Trying to reach Qwest...

traceroute to 63.146.190.1 (63.146.190.1), 30 hops max, 38 byte packets
 1 esc-lp2-gw.e-solutionscorp.com (63.118.220.1) 1.167 ms 1.163 ms 1.142 ms
 2 500.Serial2-10.GW1.TPA2.ALTER.NET (157.130.149.9) 1.097 ms 1.059 ms 1.044 ms
 3 161.at-1-0-0.XL4.ATL1.ALTER.NET (152.63.81.190) 13.839 ms 14.108 ms 16.638 ms
 4 0.so-3-1-0.XL2.ATL5.ALTER.NET (152.63.0.238) 14.370 ms 14.587 ms 14.553 ms
 5 POS7-0.BR2.ATL5.ALTER.NET (152.63.82.193) 13.928 ms 14.099 ms 14.053 ms
 6 * * *
 7 * * *
  …


                                                                       Part 3.5   10
Why is routing hard to get right?

  Defining correctness is hard
  Interactions cause unintended consequences
     Each network independently configured
     Unintended policy interactions

  Operators make mistakes
    Configuration is difficult
    Complex policies, distributed configuration




                                                   Part 3.5   11
Today: Tweak-N-Pray
               What happens if I
              tweak this policy…?




                  Revert
                                        No
                                              Yes     Wait for
Configure        Observe            Desired
                                    Effect?         Next Problem

  Problems cause downtime
  Problems often not immediately apparent
                                                        Part 3.5   12
Goal: Proactive Approach

  Idea: Analyze configuration   before
   deployment
                       rcc
                      Detect
 Configure                                Deploy
                      Faults


  Many faults can be detected with static analysis.


                                              Part 3.5   13
Router Configuration Checker (rcc)

  A tool that finds faults in BGP configuration
   with static analysis
      Does not require additional work of operators
  Detects
    Path Visibility Faults
    Route Validity Faults
    Only detects faults in single AS
    Only detects faults that cause persistent failures




                                                  Part 3.5   14
What is so cool about rcc?

  Finds faults proactively
       before deployment
  Just convenient for now
    BGP might need a high level specification of
     policies in the future
    To do so,
         • High level specification language needed
         • Network operators need to learn and deploy
         • Even so, they may well write it incorrectly!
     No   additional works from network operators!

                                                          Part 3.5   15
rcc Overview
Distributed router
                     Correctness
 configurations                       Constraints
   (Single AS)       Specification
                               “rcc”Normalized            Faults
                                     Representation


 Challenges
  Analyzing complex, distributed configuration
  Defining a correctness specification
  Mapping specification to constraints

                                                      Part 3.5   16
 rcc Implementation

                                        More Parsable Version

                           Preprocessor            Parser
                                                         Normalized
Distributed router                                       Representation
 Configurations                                                 Relational
     (offline)                                                  Database
 (Cisco, Avici, Juniper,                                         (mySQL)
                                          Constraints
     Procket, etc.)
                            Runs simple queries
                              Select, join, etc
                                                   Verifier


                                                   Faults
                                                                 Part 3.5   17
Which faults does rcc detect?


       Faults found by rcc

            Latent faults

           Potentially active faults
                    End-to-end failures




                                          Part 3.5   18
Correctness Specification
Safety
The protocol converges to a stable
path assignment for every possible
 The protocol does not oscillate
initial state and message ordering

Path Visibility
       destination with a usable
Every If there exists a path,
     has there exists a route
paththen a route advertisement
Example violation: Network partition

Route Validity
       route advertisement
EveryIf there exists a route,
     then there usable path
corresponds to aexists apath
Example violation: Routing loop        Part 3.5   19
Path Visibility
If every router learns a route for every usable path,
then path visibility is satisfied.
      A usable path:
      - Reaches the destination
      - Corresponds to the path that packets take when using that route
      - Conforms to the policies of the routers on that path



Possible path visibility faults
  Dissemination
    - Partition in session-level graph that disseminates routes
  Filtering
    - Filtering routes for prefixes for usable paths
                                                             Part 3.5   20
  Path Visibility: Internal BGP (iBGP)

 Default: don’t re-advertise iBGP-
                                                             “iBGP”
  learned routes.
      Complete propagation requires “full
       mesh” iBGP. Doesn’t scale.




 “Route reflection” improves scaling.                  RR       RR          RR
    Client: don’t re-advertise iBGP routes.
    Route reflector: reflect non-client            c        c        c          c
     routes to all clients, client routes to non-
     clients and other clients.
                                                                      Part 3.5       21
Path Visibility: iBGP Signaling



                     Route reflectors

             R


       W             X       Y


           Clients               Z
                                     No route to destination.
                                     Debugging nightmare!
                                                  Part 3.5   22
Path Visibility: iBGP Signaling

                                    Route reflectors
                        R

                  W             X         Y

                      Clients                 Z


Theorem.
Suppose the iBGP reflector-client relationship graph contains
no cycles. Then, path visibility is safisfied if, and only if, the set
of routers that are not route reflector clients forms a full mesh.

       Condition is easy to check with static analysis.
                                                           Part 3.5   23
Path Visibility Faults in Practice
          Analysis of configuration from 17 ASes
 1000
                                               420 sessions

                            133 routers
  100

        11 networks
  10



   1
        iBGP Signaling   Duplicate Loopback   Incomplete iBGP
           Partition                              session

                           Latent   Benign              Part 3.5   24
Route Validity
If every route that a router learns corresponds to a usable
path, then route validity is satisfied.
      A usable path:
      - Reaches the destination
      - Corresponds to the path that packets take when using that route
      - Conforms to the policies of the routers on that path


Possible route validity faults
  Filtering
    - Unintentionally providing transit service
    - Advertising routes that violate higher-level policy
    - Originating routes for private (or unowned) address space
  Dissemination
    - Loops and “deflections” along internal routing path
                                                             Part 3.5   25
Route Validity: Consistent Export
 Rules of settlement-free peering:
    Advertise routes at all peering points
    Advertised routes must have equal “AS path length”




                             Sprint

   “equally good”
       routes
                              AT&T

                    Enables “hot potato” routing.   Part 3.5   26
Route Validity: Consistent Export
 Possible Causes
      Neighbor  AS  Export   Export            Clause Prepend
   Malice/deception
       10.1.2.3 456   1        1               1       123
   iBGP signaling partition
       10.4.5.6 456   2        2               1       123 123
   Inconsistent export policy
 Policy normalization makes comparison easy.

neighbor 10.1.2.3                               neighbor 10.4.5.6
route-map PEER permit 10                        route-map PEER permit 10
  set prepend 123                                 set prepend 123 123




                                                                Part 3.5   27
Route Validity Faults in Practice
            Analysis of configuration from 17 ASes

 1000
        233 Sessions                                              196 Sessions
                  117 Sessions
 100                                                45 Sessions


  10                                   6 Sessions


   1
        Inconsistent   Inconsistent       Transit     Undefined     Incomplete
           Export         Import         Between        Filter         Filter
                                          Peers

                                      Latent   Benign
                                                                        Part 3.5   28
Operational Impact

  Downloaded by 70 network operators, some of
  them shared their configurations
    Reluctant to share because its proprietary
    Because they don’t like researchers finding faults
     on their network
  Detected more than 1000 faults previously
  undiscovered in 17 ASes




                                                 Part 3.5   29
Operator Feedback

 “That’s wicked!”                   -- Nicolas Strina, ip-man.net

 “Thanks again for a great tool.” -- Paul Piecuch, IT Manager

 “...good to finally see more coverage of routing as distributed
 programming. From my experience, the principles of software
 engineering eliminate a vast majority of errors.”
                                   -- Joe Provo, rcn.com

 “I find your approach useful, it is really not fun (but critical for
 the health of the network) to keep track of the inconsistencies
 among different routers…a configuration verifier like yours can
 give the operator a degree of confidence that the sky won't fall
 on his head real soon now.”
                                    -- Arnaud Le Tallanter, clara.net
                                                                Part 3.5   30
Summary: Faults across 17 ASes
                    Every AS had faults, regardless of network size
                 Most faults can be attributed to distributed configuration
                 10
                         Route Validity                                        Path Visibility
                 8
Number of ASes




                 6

                 4

                 2

                 0
                      Signaling




                                                           Inconsistent



                                                                          Inconsistent




                                                                                                    Undefined
                                              Incomplete




                                                                                                                      Incomplete
                                                                                         Between
                                  Loopback
                                  Duplicate
                      Partition




                                                                                          Transit
                                                Session




                                                                                          Peers
                        iBGP




                                                                                                      Filter



                                                                                                                         Filter
                                                              Export



                                                                             Import
                                                 iBGP




                                                                                                                Part 3.5           31
rcc: Take-home lessons

 Better intra-AS route dissemination protocol
  needed
     Current route reflection causes many faults!
 BGP needs to be configured with a centralized
  higher-level specification language
   Current distributed, low-level nature introduces
    complexity, obscurity, and possibility to
    misconfiguration
   But! trade-off with flexibility and expressiveness




                                                     Part 3.5   32
Discussion

  Strength
    Routing as a distributed program
    Static analysis uncovers many errors
    Identifies major causes of error
         • Distributed configuration
         • Intra-AS dissemination is too complex
         • Mechanistic expression of policy
       Real operational impact!
  Weakness
    rcc is neither sound nor complete
         • May be necessary against hairy real-world problems
         • Already quite useful as it is
                                                           Part 3.5   33
IP fault localization via
     risk modeling




                            Part 3.5   34
IP Network Fault-Tolerance

         Any failure that
  Router causes an IP       Internet
          link to fail is
              termed
            “IP Fault”                 Alternate Path

                              X
Alice                       IP Fault                  Eve




IP Networks are designed to be fault-tolerant!
                                                 Part 3.5   35
Fault Repair

  Fast Repair is necessary because
    Probability of a simultaneous failure increases
     with down-time
    Expensive to provision too many alternate paths



  Fault Localization is a bottleneck for fault
   repair!




                                                Part 3.5   36
What makes fault localization hard?

 A typical Tier-I ISP network has
    About a thousand routers
    A few thousand IP links
    Tens of thousands of optical components
    About 50-100 thousand miles of optical fiber
    Complicated topologies (mesh, ring etc.)
 Current alarms do not indicate root-cause
 Often problematic to monitor actual component failure
 Failure alerts can get lost


   Operators Need an automated tool for
           fast fault localization              Part 3.5   37
Key Ideas: Shared Risk!

  Risk modeling to localize faults across the IP
   and optical layers
  SRLG: Shared Risk Link Groups
       A physical object represents shared risk for a
        group of logical entities at IP layer
  SCORE: Spatial Correlation Engine
    cross-correlates dynamic fault information from
     two disparate network layers




                                                   Part 3.5   38
Logical/Physical IP Network
              San Jose             Washington IP Network
                                       QWEST




Los Angeles                                        Atlanta


                         Houston




                                                             Part 3.5   39
 Logical/Physical IP Network
               San Jose             Washington

         X
                    X                             Atlanta
Los Angeles
               X
    DWDM                                               Links that share a
                          Houston
    failed ?                                         “Shared Risk” form an
                                                    Shared Risk Link Group
               San Jose             Washington
SHARED                                                      (SRLG)
 RISK



                                                  Atlanta
Los Angeles
                                                 Router
                          Houston
                                                 DWDM O-E-O Conversion
                                                               Part 3.5   40
Various types of SRLGs

  Physical Shared Risks
    SONET (e.g. DWDM, ADM, Optical Amplifiers)
    Fiber
    Fiber Span
    Router
    Module
    Port

  Logical Shared Risks
    Autonomous     System
      OSPF Areas
                                            Part 3.5   41
 SRLG Prevalence
        1

      0.9
                   At least 47% of all
      0.8           SRLGs have at
      0.7            least two links
                                                    Source :
      0.6                                           Section of ATT
CDF




      0.5
             More than 85% of                       Backbone Network
            OSPF Areas have at
      0.4
              least 10 links                   Fiber Spans
      0.3                                              Fiber
                                    SONET Network Elements
      0.2
                                                       Ports
                                            Router Modules
                                                    Routers
      0.1
                                                      Areas
                                       Aggregated Database
        0
            1              10                 100              1000
                  SRLG Cardinality (no. of links per group)
                                                                      Part 3.5   42
                                Logscale
Problem Formulation
  A set of link
     C = {c1, c2, … , cn}
  A set of risk Group
                    Occam’s Razor:
     G = {G1, G2, … , Gm}
              Let’s not assume more
     Gi = {ci1, ci2, … , cik}, s.t. cix are likely to fail simultaneously
  An observation
              than what is necessary
                 ce2, … , cemis
     O = {ce1,Simplicity } the Best
  Find Hypothesis H
     H = {Gh1, Gh2, … , Ghk} which explains O
          • Every member of O belongs to at least one member of H and all
            the members of a given group Ghi belong to O
        Many Hs!




                                                                     Part 3.5   43
SRLG Database
                R4                  R3                  R0 – {L0,L1}
                          L6                            R1 – {L0,L2,L3,L4}
      L1                                      L5        R2 – {L4,L5}
                     L2        L3                       R3 – {L3,L5,L6}
R0          L0                           L4        R2   R4 – {L1,L2,L6}
                                                        D1 – {L0,L1,L2}
                       R1
                                                        D2 – {L3,L5,L6}
            R4                      R3                  D3 – {L3,L4,L5}
                     F7 D2 F6
           F2
                                                        F0 – {L0,L1}
     F0                         F5 D3                   F1 – {L0,L2}

R0         D1
                                              F4
                                                   R2   …
                F1
                               F3

                       R1
                                                                       Part 3.5   44
     Bipartite Graph Formulation
            Observation: Temporally Correlated
 X                     X         X        X
 L0         L1         L2         L3      L4      L5            L6




R0     R1    R2   R3        R4                                 Fiber
                                 DWDM1 DWDM2 DWDM3 Fiber
      X                                            Span0      Span1


Hypothesis : Possible Explanation
                                                           Part 3.5   45
      Bipartite Graph Formulation
  X                    X                 X      X
  L0         L1        L2         L3      L4     L5             L6




 R0     R1   R2   R3        R4                         Fiber    Fiber
                                 DWDM1 DWDM2 DWDM3
                                                      Span0    Span1
             X                                                    X
Hypothesis : Can contain multiple simultaneous failures
  Set cover of a given Observation : NP-Hard
                                                           Part 3.5   46
Greedy Approximation
 X                   X                 X      X
 L0        L1        L2         L3      L4     L5            L6




R0    R1   R2   R3        R4                        Fiber  Fiber
                               DWDM1 DWDM2 DWDM3
                                                   Span 0 Span 1

   Hit Ratio of R0 = |Gi O|/|Gi| = 1/2 = 50%
Coverage Ratio of R0 = | Gi O|/|O| = 1/4 = 25%
                                                      Part 3.5    47
Greedy Approximation
 X                     X                    X       X
                                  Out of all groups with hit-ratio
 L0        L1          L2          100%, pick group with max L6
                                  L3       L4        L5
                                   coverage
                                  Prune links associated with this
                                   group and add this group to
                                   hypothesis
                                  Repeat with pruned observation
                                   until no unexplained
                                   Observation
R0    R1    R2    R3        R4                           Fiber  Fiber
                                 DWDM1 DWDM2 DWDM3
                                                        Span 0 Span 1


R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%)
D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%)
                                                           Part 3.5   48
Modeling Imperfections

  Ideally,
      If a shared component fails, all associated links fail
  Not true in practice sometimes (why?)
      Failure message could get lost! (transported by UDP)
      Inaccurate modeling of risk groups
  Solution : Use an error threshold for the hit-
   ratios
      Accounts for losses in data
      Inaccurate modeling of SRLGs




                                                                Part 3.5   49
Modified Greedy Approximation

  Select groups that have   hit ratio > error
   threshold
  Out of these groups, identify the group with
   maximum coverage
  Prune the set of links that are explained by
   this group
  Recursively repeat the above steps until all
   links are fully explained

                                             Part 3.5   50
SCORE Spatial Correlation Module

  Intelligence is built onto the SRLG database
   and reflected in the SCORE queries
  Obtains minimum set hypothesis




                                           Part 3.5   51
  SCORE System Architecture
                            1. Event Clustering
                  Router -captures events close together in time
                                    SONET
 SNMP Traps       Syslogs 2. Localization Heuristics:
                                   PM data
                            -uses multiple error threshold outputs H
                             with min cost (|H|/eThresh)
    Data                             Data
                    Data -queries clustered events with similar
                                                      WWW
  Translator                       Translator
                  Translator signature


                FAULT LOCALIZATION POLICIES
                                     Multiple Query
                                                     API
                                           Input : <Ckt1, Ckt2 ..>,
                                              Error Threshold
                     Spatial Correlation   Output : <Grp1, Grp2..>
                         (SCORE)

SRLG Database                                                         Part 3.5   52
Evaluation : Artificial Faults

  Artificially generated faults but real SRLG
   database from (a section of) AT&T backbone
   network
  Picked a set of components to fail
  Observation then fed to SCORE
      No losses in data no database inconsistency
  Hypothesis compared with injected faults




                                                     Part 3.5   53
Perfect Fault Notification
                                     1
 Fraction of Correct Hypotheses



                                  0.98

                                  0.96

                                  0.94

                                  0.92

                                   0.9
                                                                    Accuracy Greater than
                                                                      95% for 5 failures
                                  0.88

                                  0.86       FIBERSPAN
                                                   PORT
                                  0.84
                                                MODULE
                                                ROUTER
                                                   AREA
                                  0.82
                                                 SONET
                                              Aggregated
                                   0.8
                                         0      2    4     6    8     10   12    14   16    18    20

                                                    Number of simultaneously induced failures Part 3.5   54
Imperfect Fault Notifications
                                     1
                                                                                One Failure
                                                                               Two Failures
 Fraction of Correct Hypotheses




                                  0.95                                        Three Failures
                                                                               Four Failures
                                                                               Five Failures
                                   0.9



                                  0.85



                                   0.8



                                  0.75
                                                   Almost linear
                                                 accuracy tradeoff
                                   0.7          with loss probability
                                  0.65
                                         0.05        0.1      0.15      0.2            0.25    0.3

                                                              Loss Probability (eThresh 0.6)   Part 3.5   55
Evaluation : Real Faults
 A set of 18 faults studied and diagnosed
      Where root-cause well-known
 One Case Study
   OSPF Area wide problem that affected ~70 links
   SCORE identified about 20 SRLG groups as
    hypothesis
   Further analysis revealed that error due to
    incorrect SRLG modeling
   Relaxed error threshold to 0.7 brought it down to 4
   Only OSPF interfaces with MPLS enabled got
    affected by the protocol bug

                                                 Part 3.5   56
Evaluation: Real Faults

  Similarly, SCORE uncovered
    Database problems
    Missing error reports from certain links
    Other inconsistencies

  Shows how error-thresholds are effective in
  uncovering these inconsistencies and data
  losses




                                                Part 3.5   57
Localization Precision
         1

       0.9

       0.8

       0.7
                               About 80% of faults
                              could be localized to
       0.6
                             About 40% of faults
                                less than 10% of
 CDF




       0.5                         be localized
                             could components to
       0.4                     less than 5% of
       0.3
                                 components
       0.2

       0.1

         0
             0   0.1   0.2   0.3     0.4    0.5   0.6      0.7   0.8   0.9         1

                                   Localization Fraction                     Part 3.5   58
Discussion

  Strength
     Captured the spatial correlation between IP links
     Database inconsistencies are resolved in SCORE using a
      simple error threshold scheme


  Weakness
    Fails to model either very high-level risk group or very low-
     level risk group
         • High level: e.g., all links in a PoP sharing a power grid
         • Low level: e.g., internal risk group within a router
       Extremely hard to select a single error threshold for all
        observations!
       Need more intelligent heuristics to fault localization policy
                                                                       Part 3.5   59
Concluding remarks

  Fault diagnosis is challenging!
  We see two points in the whole solution space
    rcc: static analysis
    SCORE: spatial modeling
  Directions
    Centralized configuration
         • E.g. RCP, 4D
       Applying formal methods & model checking
         • What’s challenging here?
     Advanced     statistical correlation & causal analysis
         • Ongoing work …

                                                      Part 3.5   60

								
To top