Docstoc

Monitoring Traffic Exits in a Multihomed I2 Environment

Document Sample
Monitoring Traffic Exits in a Multihomed I2 Environment Powered By Docstoc
					Monitoring Traffic Exits in a
Multihomed I2 Environment

         Joe St Sauver
     joe@oregon.uoregon.edu
      UO Computing Center

NLANR/I2 Joint Techs, Minneapolis
       May 18th, 2000
Today's Message in a Nutshell...
• What: You should pay attention to how
  application traffic is exiting your campus.
• Why: In an I2 multihomed environment
  there are important differences between the
  network exits you've got available.
• How: We have a simple and freely available
  tool written in perl which you can use to
  monitor your application exits
      Definition: Multihomed
• By saying that a site is "multihomed"
  we mean that the site is:

  "connected to two or more wide area
  networks, such as the commodity Internet
  and a high performance limited access
  network such as Abilene"
ALL I2 Schools Are Multihomed
• ALL I2 schools are multihomed, since all
  are required to have both commodity AND
  high performance connectivity
• In fact, many I2 schools have:
  -- multiple commodity transit providers, and
  -- multiple high performance exits, and
  -- local or national peerage, and
  -- local intraconsortia traffic…
  Many people don't know/care
    how their traffic flows
• "It just works"
• "I couldn't change how my traffic is routed
  even if I didn't like how it is exiting"
• "I wouldn't do anything different even if I
  could get my traffic did go a different way"
• "I can't tell the difference between <X> and
  <Y> connectivity, anyhow."
      Should They Care? Yes!
• Users need to be helped to understand that
  they should care about how traffic exits.
     (1) In a multihomed environment, all
         exits are not alike.
     (2) Users can adjust their applications
          and their collaborations to make
         efficient use of available exits, if
         they can determine what's going on,
         and if they are motivated.
 Some crucial differences among
  various exits you may have….
• Different network exits may have:

     (1) different available capacity (a
         primary motivator in our case)
     (2) different cost structures apply (also
         important to us, as at most places)
     (3) different latency and loss properties
     (4) different wide area reachability
     (1) Available capacity...
• Classic Example:

     I2 school with congested fractional DS3
     commodity Internet transit connectivity,
     but lightly loaded OC3 connectivity via
     Abilene. A huge difference in available
     capacity exists between those two
     exits.
(2) Different Costs Structures for
    Different Connections…
• Oregon, like most places, has different cost
  structures for different sorts of connectivity.
• For example, we pay $1K/Mbps/month for
  contracted inbound commodity Internet
  transit capacity (excluding some categories
  of traffic)
  Different Costs Structures for
 Different Connections… (cont.)
• While in the case of peerage or HPC
  connectivity, usage may have no direct
  incremental cost (except at step boundries
  where capacity would need to be added),
  and in fact, there may be an effective
  "negative incremental cost" associated with
  shifting inbound traffic off of commodity
  transit links and over to HPC connectivity.
Inbound Traffic? I Thought This
 Was About Monitoring Exits?
• We can't easily/systematically do reverse
  traceroutes to watch our commodity traffic
  inbound (even though that's what we pay for)
• We'll assume that if we've got outbound
  traffic going to you, you've probably got
  inbound traffic going to us
• We'll watch what we can and hope that the
  world is symmetric (hah hah, I know, I know)
        (3) Latency and Loss
• Different exits may have dramatically
  different latency and loss characteristics.
• Sustained throughput at a level which might
  be perfectly reasonable over a low latency
  and low loss circuit might be completely
  unrealistic to attempt in the face of higher
  latency or higher loss
            (4) Reachability
• In some cases, a change of exit may result
  in a material change in reachability. For
  example, consider a block of addresses that
  are advertised ONLY to high performance
  connectivity partners (the intent being that
  if the HPC connectivity falls over, the load
  will not flop over onto commodity transit
  circuits and swamp those exits).
   How Does All This Relate to
     Network Engineering?
• "Okay, sure it is true that there's a
  difference between the various exits that
  might be available.

  "But how does this relate to network
  engineering tasks? What knobs do I need to
  tweak on my Cisco?"
Focus is on the user/applications,
  not on tweaking the network
• We assume that the network configuration
  is "a given" and is NOT changeable -- e.g.,
  network engineers [are not|cannot|will
  not|should not] tweak routes for load
  balancing or other traffic management.
• We DO assume that end users can make
  application level decisions about what they
  do, or who they collaborate with.
… plus user network monitoring
• Assuming we really need and are really
  using our I2 connectivity, users who are
  running crucial applications over I2 need to
  know when it isn't there for a given
  collaboration, if only so they can call up and
  complain. :-)
   Why not just look at periodic
        route snapshots?
• For example, why not just check:
  http://www.vbns.net:8080/route/Allsnap.rt.h
  tml
• Answer: too much data
  Answer: HPC connectivity only
  Answer: only updates every half hour
  Answer: too hard to see what's changed
  Answer: ...
 Oregon's Particular Application
• We run news servers connecting with
  hundreds of Usenet News peers located all
  around the world, some normally I2
  connected, some connected via commodity
  transit, some connected via peerage.
• News traffic must not be allowed to
  interfere with other network traffic ==>
  we need to closely monitor and manage
  traffic flowing out these multiple exits
    Sample Decisions Made In Part
     Based on How Traffic Exits:
•   Should we peer with this site at all?
•   What should we feed them?
•   What should we have them feed us?
•   If we will be feeding this site over a high
    performance connection, do we want to
    insure that that HPC traffic doesn't flop onto
    our commodity transit links in case HPC
    connectivity falls over?
  Usenet is NOT a unique case...
• Usenet News is not unique when it comes to
  distributing data to many different sites over
  potentially diverse exits. Examples include:

  -- Unidata's LDM weather data network
   -- web cache hierarchies
  -- IRC server traffic
  -- any coordinated server-to-server traffic
 Stage 1 Exit Selection Analysis:
  "I know, let's use traceroute..."

• Inititially, we did what everyone else does
  when we wanted to figure our where traffic
  was exiting, and just said:

  % traceroute foo.bar.edu

  with our traffic involving five main exits...
                  Sample traceroute #1
•   % traceroute www.berkeley.edu
    traceroute to amber.Berkeley.EDU (128.32.25.12), 30 hops max, 40 byte packets
     1 cisco3-gw.uoregon.edu (128.223.142.1) 0.919 ms 0.740 ms 0.432 ms
     2 cisco7-gw.uoregon.edu (128.223.2.7) 0.491 ms 0.421 ms 0.414 ms
     3 ogig-uo.oregon-gigapop.net (198.32.163.1) 0.511 ms 0.578 ms 0.574 ms
     4 sac.oregon-gigapop.net (198.32.163.10) 10.123 ms 10.207 ms 10.148 ms
     5 198.32.249.61 (198.32.249.61) 13.597 ms 13.307 ms 13.551 ms
     6 BERK--SUNV.POS.calren2.net (198.32.249.13) 15.053 ms 14.485 ms 14.511 ms
     7 pos1-0.inr-000-eva.Berkeley.EDU (128.32.0.89) 14.907 ms 14.662 ms 14.483 ms
     8 pos5-0-0.inr-001-eva.Berkeley.EDU (128.32.0.66) 44.428 ms 14.964 ms 15.362 ms
     9 fast1-0-0.inr-007-eva.Berkeley.EDU (128.32.0.7) 15.910 ms 16.113 ms 16.807 ms
    10 f8-0.inr-100-eva.Berkeley.EDU (128.32.235.100) 19.274 ms 20.805 ms 15.958 ms
    11 amber.Berkeley.EDU (128.32.25.12) 17.755 ms 16.076 ms 16.479 ms

•   But wait, there's more… we also have HPC connectivity via Denver...
                  Sample traceroute #2
•   % traceroute www.bc.net
    traceroute to www.bc.net (142.231.112.3), 30 hops max, 40 byte packets
     1 cisco3-gw.uoregon.edu (128.223.142.1) 0.590 ms 0.644 ms 0.572 ms
     2 cisco7-gw.uoregon.edu (128.223.2.7) 0.526 ms 0.579 ms 0.448 ms
     3 ogig-uo.oregon-gigapop.net (198.32.163.1) 0.850 ms 0.520 ms 0.509 ms
     4 den.oregon-gigapop.net (198.32.163.14) 32.563 ms 32.279 ms 32.403 ms
     5 kscy-denv.abilene.ucaid.edu (198.32.8.14) 42.759 ms 42.956 ms 43.214 ms
     6 ipls-kscy.abilene.ucaid.edu (198.32.8.6) 51.965 ms 51.841 ms 52.306 ms
     7 205.189.32.98 (205.189.32.98) 56.422 ms 56.242 ms 57.130 ms
     8 win1.canet3.net (205.189.32.141) 73.884 ms 75.850 ms 73.620 ms
     9 reg1.canet3.net (205.189.32.137) 80.920 ms 80.823 ms 80.505 ms
    10 cal1.canet3.net (205.189.32.133) 89.457 ms 89.726 ms 89.757 ms
    11 van1.canet3.net (205.189.32.129) 104.529 ms 102.115 ms 102.287 ms
    12 c3-bcngig01.canet3.net (205.189.32.193) 102.085 ms 102.504 ms 102.459 ms
    13 policy-canet2.hc.BC.net (207.23.240.241) 102.771 ms 103.833 ms 103.055 ms
    14 cyclone.BC.net (142.231.112.2) 104.819 ms 104.113 ms 106.695 ms

•   Oregon to BC, via Indianapolis… Guess we still need a west coast HPC peering point… :-;
                  Sample traceroute #3
•   % traceroute leao.ebonet.net
    traceroute to leoa.ebonet.net (194.133.121.8), 30 hops max, 40 byte packets
     1 cisco3-gw.uoregon.edu (128.223.142.1) 0.757 ms 0.802 ms 0.592 ms
     2 cisco7-gw.uoregon.edu (128.223.2.7) 0.813 ms 0.626 ms 0.521 ms
     3 eugene-hub.nero.net (207.98.66.11) 3.230 ms 1.832 ms 2.019 ms
     4 eugene-isp.nero.net (207.98.64.41) 2.981 ms 2.588 ms 2.044 ms
     5 ptld-isp.nero.net (207.98.64.2) 6.059 ms 6.347 ms 6.340 ms
     6 906.Hssi8-0-0.GW1.POR2.ALTER.NET (157.130.176.33) 6.780 ms 18.495 ms
       Serial4-0.GW1.POR2.ALTER.NET (157.130.179.77) 6.313 ms
     7 121.ATM3-0.XR2.SEA4.ALTER.NET (146.188.200.190) 18.591 ms 21.871 ms 19.717 ms
     8 292.ATM2-0.TR2.SEA1.ALTER.NET (146.188.200.218) 18.572 ms 12.785 ms 12.776 ms
     9 110.ATM7-0.TR2.EWR1.ALTER.NET (146.188.137.77) 68.878 ms 69.940 ms 81.317 ms
    10 196.ATM7-0.XR2.EWR1.ALTER.NET (146.188.176.85) 86.459 ms 71.227 ms 100.778 ms
    11 192.ATM9-0-0.GW1.NYC2.ALTER.NET (146.188.177.61) 93.644 ms 73.741 ms 74.459 ms
    12 AngolaTel-gw.customer.ALTER.NET (157.130.4.242) 618.036 ms 641.660 ms 636.724 ms
    13 194.133.121.1 (194.133.121.1) 628.084 ms 626.335 ms 638.627 ms
    14 194.133.121.8 (194.133.121.8) 640.332 ms 637.330 ms 667.900 ms

•   Dual fractional DS3s==multiple gateway addresses we need to watch out for...
                  Sample traceroute #4
•   % traceroute www.rcn.com
    traceroute to www.rcn.com (207.172.3.77), 30 hops max, 40 byte packets
     1 cisco3-gw.uoregon.edu (128.223.142.1) 0.609 ms 0.473 ms 0.480 ms
     2 cisco7-gw.uoregon.edu (128.223.2.7) 0.595 ms 0.498 ms 0.544 ms
     3 eugene-hub.nero.net (207.98.66.11) 1.643 ms 1.390 ms 1.671 ms
     4 eugene-isp.nero.net (207.98.64.41) 1.915 ms 1.827 ms 1.721 ms
     5 xcore2-serial0-1-0-0.SanFrancisco.cw.net (204.70.32.5) 12.065 ms 12.759 ms 12.053 ms
     6 corerouter2.SanFrancisco.cw.net (204.70.9.132) 12.410 ms 17.748 ms 19.462 ms
     7 acr2-loopback.SanFranciscosfd.cw.net (206.24.210.62) 14.585 ms 19.097 ms 14.700 ms
     8 aar1-loopback.SanFranciscosfd.cw.net (206.24.210.2) 18.101 ms 14.707 ms 15.468 ms
     9 rcn.SanFranciscosfd.cw.net (206.24.216.190) 15.914 ms 17.202 ms 17.063 ms
    10 10.65.101.3 (10.65.101.3) 16.201 ms 15.708 ms 17.053 ms
    11 pos5-0-0.core1.lnh.md.rcn.net (207.172.19.253) 97.268 ms 95.429 ms 94.204 ms
    12 poet6-0-1.core1.spg.va.rcn.net (207.172.19.234) 96.311 ms
        poet5-0-0.core1.spg.va.rcn.net (207.172.19.246) 102.148 ms
        poet6-0-1.core1.spg.va.rcn.net (207.172.19.234) 100.850 ms
    13 fe11-1-0.core4.spg.va.rcn.net (207.172.0.135) 96.572 ms 98.770 ms 96.906 ms
    14 corporate-2.web.rcn.net (207.172.3.77) 98.016 ms 97.085 ms 98.480 ms
                  Sample traceroute #5
•   % traceroute www.digilink.net
    traceroute to www.digilink.net (205.147.0.100), 30 hops max, 40 byte packets
     1 cisco3-gw.uoregon.edu (128.223.142.1) 0.777 ms 0.850 ms 0.398 ms
     2 cisco7-gw.uoregon.edu (128.223.2.7) 0.515 ms 0.536 ms 0.563 ms
     3 verio-gw.oregon-ix.net (198.32.162.6) 1.373 ms 1.395 ms 1.063 ms
     4 d3-0-1-0.r01.ptldor01.us.ra.verio.net (206.163.3.133) 4.023 ms 4.379 ms 4.037 ms
     5 ge-5-0.r00.ptldor01.us.bb.verio.net (129.250.31.222) 12.074 ms 24.473 ms 25.085 ms
     6 p1-1-1-3.r01.ptldor01.us.bb.verio.net (129.250.2.38) 4.977 ms 18.582 ms 5.035 ms
     7 p4-4-2.r02.sttlwa01.us.bb.verio.net (129.250.3.37) 8.746 ms 9.627 ms 9.983 ms
     8 p4-1-0-0.r03.sttlwa01.us.bb.verio.net (129.250.2.230) 8.715 ms 9.062 ms 10.871 ms
     9 sea3.pao6.verio.net (129.250.3.89) 27.449 ms 27.315 ms 26.929 ms
    10 p4-1-0-0.r00.lsanca01.us.bb.verio.net (129.250.2.114) 35.008 ms 35.567 ms 34.807 ms
    11 p4-0-0.r01.lsanca01.us.bb.verio.net (129.250.2.206) 34.857 ms 34.972 ms 34.764 ms
    12 fa-6-0-0.a01.lsanca01.us.ra.verio.net (129.250.29.141) 124.100 ms 36.806 ms 34.957 ms
    13 uscisi-pl.customer.ni.net (209.189.66.66) 37.148 ms 36.604 ms 35.889 ms
    14 130.152.128.1 (130.152.128.1) 37.993 ms 37.343 ms 37.400 ms
    15 border1-100m-ln.digilink.net (207.151.118.22) 39.490 ms 38.334 ms 37.764 ms
    16 la1.digilink.net (205.147.0.100) 38.755 ms 37.997 ms 39.237 ms
  Couple of Slight Problem(s)...
• Manually tracerouting to several hundred
  hosts gets to be really tedious, really fast

• Traffic could (and did) shift w/o notice
  (sometimes dramatically) based on MRTG
  graphs, yet we'd only end up doing
  traceroutes on rare occaisions, often missing
  interesting phenomena
     And once we were done
 tracerouting all over the place...
• We still needed to consolidate that
  information into a useful format (other than
  jotted notes on the back of recycled memos)

• We learned that we did badly when it came
  to noticing changes in exit behavior for one
  or two particular hosts out of hundreds
 Stage 2 Exit Selection Analysis:
     "Let's write a filter!!!"
• Simple repetitive task ==> create a filter
• Input: list of FQDNs or dotted quads whose
  exit selection we want to monitor
• Output: same as input list, but with exit info
  prepended to each entry
• Approach: do a traceroute, grep for the
  gateways, tag and print the exits
  accordingly. Write in perl. Sort by exit.
         Stage 2 Philosophy:
• Like Unix itself, build small simple tools
  that can be chained together to dolarger
  complex tasks
• "If we can just see where the traffic is
  going, that will be good enough…."
              Stage 2 code...
• #!/usr/local/bin/perl
  $Cmd = "/usr/etc/traceroute";
  open (SAVERR,">&STDERR");
  open (STDERR,">temp.tmp");
  while ($Host = <>) {
    chop $Host; open(IN,"$Cmd $Host |");
    while($line = <IN>) {
    if ($line =~ m/198.32.163.10/) {
       print "OGIG-S: $Host\n"; }
    elsif ($line =~ m/198.32.163.14/) {
       print "OGIG-D: $Host\n"; }
          Stage 2 code (cont.)
• [continued]

  elsif ($line =~ m/157.130.176.33/) {
       print "UUNet: $Host\n"; }
  elsif ($line =~ m/157.130.179.7/) {
       print "UUNet: $Host\n"; }
  elsif ($line =~ m/204.70.32.5/} {
       print "CWIX: $Host\n"; }
  elsif ($line =~ m/198.32.162.6/} {
        print "Verio: $Host\n"; }
  }
  }
      Sample Stage 2 Run...
• % cat hosts.foo | ./chexit

 Verio:     newsfeed.digilink.net
 CWIX:      nwes.ciateq.mx
 [etc.]

 and/or pipe it through sort.
       Lessons from Stage 2...
• Even automated, it can take a long time to
  traceroute all the way to several hundred
  sites, particularly if you go a full thirty hops
  (e.g., hit a firewall and loop); added -m 6
  option to the traceroute (max hops of 6)
• Default time-to-wait is too long (use -w 2)
• If doing this on an automated basis, there's
  no need to resolve addresses on the traces
  (add -n option)
   Lessons from Stage 2 (cont.)
• No need to send three packets; one will
  usually be enough (and will be faster than
  sending three) -- add the option -q 1
• Still can't spot what's changed between runs
• Other folks want to see the output; need
  some way to share this data with others
• Phenomena existed that we hadn't expected
      Unexpected phenomena
           included...
• We learned that some peers oscillated
  routinely between commodity providers
• Needed to think about what to do when sites
  became completely unreachable
• Decided that maybe we should draw an
  inference when ALL hosts which had been
  using a given exit would suddenly stopped
  doing so
 Stage 3 Exit Selection Analysis:
    The Great Webification...

• Abiel Reinhart joined us as an intern, and
  we needed a project for him to work on.
         Goals of Stage 3….
• We wanted the perl filter converted into a
  web cgi-bin with two main deliverables:

  (1) snapshot web page showing current
  mapping of specified peers to exits, and
  (2) change file web page showing what
  peers had changed exits
     Sample snapshot.html output
     (see http://darkwing.uoregon.edu/~joe/snapshot.html)
•   This report generated at 16:30:00 on 4/9/100

    CWIX (68)

                         feederbox.skynet.be
                              news-in.tvd.be
                  su-news-hub1.bbnplanet.com
                                      [etc.]
                    newnews.mikom.csir.co.za

    OGIG-D (62)

                        poseidon.acadiau.ca
                     newsflash.concordia.ca
                     cyclone-i2.mbnet.mb.ca
                                     [etc.]
                            news01.sunet.se
    [etc.]
    Interperting snapshot.html
• Each exit has its own section
• Total number of peers using that exit is
  reported in parentheses
• Peers are alphabetized by reversed FQDN
  within section
• Updated every fifteen minutes (via cron)
     Sample changes.html output
     (see http://darkwing.uoregon.edu/~joe/changes.html)
•   15:15:00 sonofmaze.dpo.uab.edu     UUNet --> OGIG-D   [green]

    14:45:00 feeder.nmix.net           CWIX --> UUNet     [black]
    [etc.]
    12:45:00 news.maxwell.syr.edu      CWIX --> OGIG-D    [green]
    [etc.]
    06:30:00: Note: OGIG-D is now being used.             [green]
    06:30:00: Note: OGIG-S is now being used.             [green]
    [etc]
    06:30:00 hardy.tc.umn.edu          UUNet --> OGIG-D   [green]
    06:30:00 news-ext.gatech.edu       CWIX --> OGIG-D    [green]
    06:30:00 news-feeder.sdsmt.edu     Verio --> OGIG-D   [black]
    06:30:00 news.alaska.edu           Verio --> OGIG-S   [black]
    [etc]
    06:15:00: Warning: OGIG-D is not being used.          [red]
    06:15:00: Warning: OGIG-S is not being used.          [red]
    06:15:00 hardy.tc.umn.edu          OGIG-D --> UUNet   [red]
    06:15:00 news-ext.gatech.edu       OGIG-D --> CWIX    [red]
    06:15:00 news-feeder.sdsmt.edu     OGIG-D --> Verio   [black]
    06:15:00 news.alaska.edu           OGIG-S --> Verio   [black]
     Interpreting changes.html
• Entries are in reverse chronological order
• "Positive" changes are green, "negative"
  changes are red, neutral changes are black
• Spaces separate clumps of entries (by time)
• Example shows examples of actual routing
  changes from 4/9 to actual news peers
• Also captured a 6:15-6:30 OGIG
  maintenance window (Greg loading
  120-10.3.S1 on the GSR...)
       Lessons from Stage 3:
• Abiel did a great job. :-)
• Need to set up a policy to handle growth of
  the changes page (we set a max file size of
  1000 lines)
• Running the code from a subnet other than
  the one that you're really interested in is
  okay, but not ideal (misses local breakage,
  and any applicable locally applied static
  routes for hosts with multiple interfaces)
   Lessons from Stage 3 (cont.)
• Selection of hosts to trace to may be crucial
  (and should be the actual host(s) you really
  care about, not "straw dogs" such as
  www.<domain>.edu) since at least some I2
  schools are selectively advertising just
  portions of their network blocks
   Lessons from Stage 3 (cont.)
• Assumption of symmetric routes is a bad
  one to make. Practically speaking, this
  means that if you nail up I2-only routes
  (based on exit data or other info), you do so
  at your peril. This problem may be
  relatively widespread; see, for example,
  Hank Nussbacher's paper:
  http://www.internet-2.org.il/i2-asymmetry/index.html
   Lessons from Stage 3 (cont.)
• General case: you can build useful network
  monitoring tools by automating simple
  interactively accessible building blocks
• Next project likely to be focused on
  generating daily SNMP-ish
  input octets/output octet summaries for all
  connectors of interest via the Abilene core
  node router proxy web interface
Want to try monitoring your exits?
• Send email to joe@oregon.uoregon.edu and
  we'll be glad to share our code with you.
• Code includes some UO specific stuff that
  you'll need to manually tweak (it isn't really
  productized for automatic/"zombie"
  installation); you'll also see it also does
  some extra stuff we haven't talked about
  (like generating I2-only static route entries),
  but you can simply comment that out.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:10/28/2011
language:English
pages:47
xiaohuicaicai xiaohuicaicai
About