WAN Monitoring - SLAC

Document Sample
WAN Monitoring - SLAC Powered By Docstoc
					               WAN Monitoring

Prepared by Les Cottrell, SLAC, for the
Joint Engineering Taskforce Roadmap Workshop
            JLab April 13-15, 2004

    Partially funded by DOE/MICS Field Work Proposal on
   Internet End-to-end Performance Monitoring (IEPM), also   1
                      supported by IUPAP
  (Can’t manage what you can’t measure)
• Need measurements for both production networks &
  – Planning, setting expectations, policy/funding
  – Trouble-shooting: reliability & performance
     • Problems may not be logical, e.g. most Internet problems caused by
       operator error (Sci Am Jun’03), most LAN problems are Ethernet
       duplex, host config, bugs
     • Made hard by transparency, size & rate of change of network
     • A distributed system is one in which I can’t get my work done
       because a computer I never heard of has failed. Butler Lampson
  – Application steering (e.g. Grid data replication)
• E2E performance problem is THE critical user metric
                                        C. Asia, Russia, S.E. Europe,
E.g. Policy - trends                    L. America, M. East, China: 4-
                                        5 yrs behind
S.E. Europe, Russia: catching up
Latin Am., Mid East, China: keeping up India, Africa: 7 yrs behind
India, Africa: falling behind

 for policy

                              E.g. Changes in network topology (BGP) result
                                    in dramatic change in performance
                                                                                                  Samples of
                                                                                                 traceroute trees
                                                                                                 generated from the
Remote host


      Snapshot of traceroute summary table

   1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00
   2. ESnet/GEANT working on routes from 2:00 to 14:00
   3. A previous occurrence went un-noticed for 2 months
   4. Next step is to auto detect and notify

                        Drop in performance                     Back to original path
                        (From original path: SLAC-CENIC-Caltech
                                                                                         Dynamic BW capacity (DBC)
                        to SLAC-Esnet-LosNettos (100Mbps) -Caltech )

                                                                     Changes detected by
                                                                     IEPM-Iperf and AbWE
                                                                                    Available BW = (DBC-XT)

                                                                                     Cross-traffic (XT)

                            Esnet-LosNettos segment in the path
                                      (100 Mbits/s)                                                           4
              ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am
• Active Measurement probes:
  – Include: Ping, traceroute, owamp, pathload/abwe, major
    apps (e.g. bbftp, bbcp, GridFTP…)
  – Typically used for end-to-end testing
  – Inject data into network
• Passive tools:
  – Include: SNMP, NetFlow, OCxMon, NetraMet, cflowd, SCNM
  – Typically used at border or inside backbones
     • SNMP heavily used for utilization, errors on LAN & backbones
     • Flows for traffic characterization and intrusion detection
  – Need access to network devices (e.g. routers, taps)
• Need to put together data from multiple sources
  – Different probes, different source & destinations, network-
    centric & end-to-end                                      5
Some Challenges for Active monitoring
• Bandwidth used, e.g. iperf etc. & apps
• For TCP tools: configuring windows at
  clients/servers and optimizing windows,
• Some lightweight tools (e.g. packet pairs) not
  effective at >> 1Gbits/s
• Many tools tuned for shared TCP/IP nets not for
  dedicated circuits
• Simplifying use and understanding for end-user,
  automating problem detection & resolution,
  need close collaboration today                6
• Many measurement projects with different emphases,
  different communities
   – Passive (usually requires network control, used at borders
     and on backbones, e.g. MICSmon/Netflow, ISP/SNMP,
   – Active
      • Lightweight (PingER, AMP, Surveyor, RIPE …)
      • Medium weight (PiPES, NWS, IEPM-Lite …)
      • Heavy weight/hi-perf (IEPM-BW, NTAF
   – End-to-end vs net centric (skitter, macroscopic views)
   – Repetitive (PingER, AMP, IEPM, PiPES, NWS, NTAF, …)
   – On demand, or non-production (NDT, NIMI, PiPES …)
   – Dedicated hardware (AMP, RIPE, NDT, PlanetLab …)
   – Hierarchical (e.g. AMP) vs Full mesh (e.g. PingER)
• For a table comparing 13 public domain infrastructures, see:
                     NMI challenges
• Sustaining deployment/operation in multi-agency /
  international world
• Scaling beyond hundreds of hosts very hard over the
  long term:
  – Hosts change, upgrade, new OS
     • No control over shared hosts
        – Depend on friendly admin contacts who may be busy, uninterested,
          have moved etc.
     • Policy/fears at remote site can make dedicated changes painful
     • web100 upgrades not coordinated with Linux upgrades
     • New TCP kernel upgrades not coordinated with OS upgrades
  – Hosts age, become measurement bottleneck
     • Need constant upgrades for dedicated hosts
  – Access policies change (pings & ports filtered)
  – Probes (iperf etc.) change: new features, patches
• Appropriate security
                      So Recognize
• Unrealistic to think multiple admin domains will all
  deploy one and the same infrastructure
   – Scaling and interests make unrealistic
• Multiple-domain, multi-infrastructures will be deployed
• Need to tie together heterogeneous collection of
  monitoring systems
   – Create a federation of existing NMIs
   – Infrastructures work together
   – Share data with peer infrastructures and others using a
     common set of protocols for describing, exchanging &
     locating monitoring data (e.g. GGF NMWG)
   – Enables much improved overall view of network using
     multiple measurement types from multiple sources          9
                 MAGGIE Proposal
• Measurement and Analysis for the Global Grid and
  Internet End-to-end performance
• Contribute to, utilize the GGF NMWG naming
  hierarchy and the schema definitions for network
• Develop tools to allow sharing
  – Web services based
  – Integrate information from multiple sources
• Brings together several major infrastructure
  participants: LBNL (NTAP, SCNM), SLAC (IEPM-
  PingER/BW), Internet2 (PiPES, NDT), NCSC (NIMI),
  U Delaware, ESnet
• Will work with others, e.g. MonALISA, AMP,
  UltraLight, PPDG, StarLIght, UltraScienceNet
               Federation goals
• Appropriate security
• Interoperable
• Useful for applications, network engineers,
  scientists & end users
• Easy to deploy & configure
• As un-intrusive as possible
• As accurate & timely as possible
• Identify most useful features of each NMI to
  improve each NMI faster than working alone
                  NMI Challenges:
• Reduce “Wizard gap”
• Applications cross agency AND international funding
  boundaries (includes Digital Divide)
• Incent multi-disciplinary teams, including people close
  to scientists, operational teams
  – Make sure what is produced is used, tested in real
    environment, include deployment in proposals
• Network management research historically
  underfunded, because it is difficult to get funding
  bodies to recognize as legitimate networking research,
• Without excellent trouble-shooting capabilities, the 12
  Grid vision will fail
                       More Information
• Some Measurement Infrastructures:
   – CAIDA list:
   – AMP:, PMA
   – IEPM/PingER home site:
   – IEPM-BW site:
   – NIMI:
   – RIPE:
   – NWS:
   – Internet2 PiPES:
• Tools
   – CAIDA measurement taxonomy:
   – SLAC Network Tools:
• Internet research needs:
   –                13

Shared By:
xiaohuicaicai xiaohuicaicai