Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

DOC - Internet2

VIEWS: 4 PAGES: 11

									                  Network Performance Workshop Exercises

   1. NDT Testing
        a. Go to http://www.measurementlab.net
        b. Select "Test Your Internet Connection"
        c. Select "Network Diagnostic Tool"
        d. Select "Run the network diagnostic tool"
        e. Run the test by clicking "START"
        f. Questions to answer:
                 i. What is the performance from client to server?
                ii. What is the performance from server to client?
               iii. What does NDT think is the cause of the network performance
                    problems (if any)?
               iv. Look at the “Statistics” and “More Details” buttons. What
                    additional information are you learning?

   2. NPAD Testing
        a. Go to http://www.measurementlab.net
        b. Select "Test Your Internet Connection"
        c. Select "Network path & application diagnostics"
        d. Select "Run NPAD test"
        e. Select an RTT and Bandwidth number (e.g. try 30ms and 10Mbps).
        f. Run the test by clicking "Start Test"
        g. Questions To Answer:
                i. Was your host able to send at the requested rate/latency?
               ii. What caused the failures, if any?
              iii. Change the latency or bandwidth until you are able to complete
                   a successful test. What latency/bandwidth did you choose?

For all subsequent questions, you will need to login to the hosts. This
information will be passed out. With the username/password, use SSH to login
to "npw.internet2.edu". Open a web browser and point it to
“npw.internet2.edu”. Note this is a “broken” network, there will be problems
that you will be attempting to locate using the tools.

   3. NTP Investigation
        a. Investigate the NTP configuration on the "head" host.
        b. You will need to use the "ntpq" command to do this. You can learn
            about ntpq by using "man ntpq". In particular, learn what commands
            will display the NTP peers.
      c. Questions To Answer:
            i. What are the peers of the "head" host?
   [zurawski@head ~]$ ntpq -p
        remote           refid      st t when poll reach   delay   offset jitter
   ==============================================================================
   +nms-rlat.newy32 .PPS.            1 u   80 1024 377    18.861   -0.137   0.008
   *nms-rlat-eth1.w .IRIG.           1 u 192 1024 377     13.659   -0.064   0.025
    nms-rlat.chic.n 64.57.16.34      2 u   44 1024 377    30.265   -0.200   0.001
   +nms-rlat.hous.n .IRIG.           1 u 1018 1024 377    50.543   -0.048   0.075
   -nms-rlat.salt.n 132.163.4.103    2 u    5 1024 377    65.336    0.786   0.062
   -nms-rlat.losa.n 64.57.16.162     2 u 1023 1024 377    82.511    1.021   0.226


              ii. Which host is the "head" host synchronizing against?

   The 2nd host (denoted with a “*”). The 1st and 4th hosts are candidates
   (denoted with a “+”), and the 5th and 6th host have been ruled out (denoted
   with a “-”). Note that these rankings will change when new data comes in on
   each host on the “poll” interval of 1024 seconds. The “When” value tells
   when the next poll will occur in seconds.

             iii. What is the "stratum" of the "head" host?

   The Head host is synchronized against a Stratum 1 host, making it a Stratum
   2.

4. Testbed Layout
      a. There are 4 hosts in our network:
              i. head
             ii. red-pc1
            iii. green-pc1
            iv. blue-pc1.




                             Figure 1 - Network Map
   b. Your key will work on each, so feel free to SSH as needed. Check the
      status of NTP on each host.

We don’t need to SSH to each host to do this, ntpq will accept a hostname
argument:
[zurawski@head ~]$ ntpq -p red-pc1
     remote            refid     st t when poll reach   delay   offset jitter
==============================================================================
+nms-rlat.newy32 .PPS.            1 u 131 1024 377     18.874    0.151   0.069
*nms-rlat-eth1.w .IRIG.           1 u 251 1024 377     13.653    0.350   0.092
 nms-rlat.chic.n 64.57.16.34      2 u 139 1024 377     30.286    0.320   0.086
+nms-rlat.hous.n .IRIG.           1 u   99 1024 377    50.572    0.244   0.129
-nms-rlat.salt.n 132.163.4.103    2 u   27 1024 377    65.369   -0.041   0.048
-nms-rlat.losa.n 64.57.16.162     2 u 176 1024 377     82.557    1.268   0.468

[zurawski@head ~]$ ntpq -p blue-pc1
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
 s250.internet2. .GPS.            1 u 16h 1024     0    1.370    0.466   0.000
+nms-rlat.newy32 .PPS.            1 u 212 1024 377     18.867   -0.353   0.026
*nms-rlat-eth1.w .IRIG.           1 u 239 1024 377     13.637   -0.302   0.764
 nms-rlat.chic.n 64.57.16.34      2 u 915 1024 377     30.245   -0.554   0.127
+nms-rlat.hous.n .IRIG.           1 u 988 1024 377     50.561   -0.394   0.116
-nms-rlat.salt.n 132.163.4.103    2 u 968 1024 377     65.326    0.618   0.063
-nms-rlat.losa.n 64.57.16.162     2 u 204 1024 377     82.516    0.584   0.285

[zurawski@head ~]$ ntpq -p green-pc1
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
 s250.internet2. .GPS.            1 u 17h 1024     0    1.405    0.369   0.000
+nms-rlat.newy32 .PPS.            1 u    9 1024 377    20.355    0.494   0.033
*nms-rlat-eth1.w .IRIG.           1 u 948 1024 377     13.654   -0.202   0.028
 nms-rlat.chic.n 64.57.16.34      2 u 947 1024 377     30.242   -0.370   3.920
+nms-rlat.hous.n .IRIG.           1 u 894 1024 377     50.548   -0.185   0.010
-nms-rlat.salt.n 132.163.4.103    2 u 915 1024 377     65.345    0.720   0.022
-nms-rlat.losa.n 64.57.16.162     2 u 971 1024 377     82.527    0.620   0.207


   c. Use "ping" to investigate how the various hosts are connected. You
      can find out more information about ping by running "man ping"
   d. Questions To Answer:
          i. Are all of the clocks stable? If not, which host(s) is(are)
             suspect?

Currently the clocks seem stable. To facilitate discussion, here is an example
from Red-pc1 before the clock had a chance to stabilize. We can try another
command (e.g. ntptime) to see the error calculation:
[zurawski@red-pc1 ~]$ ntptime
ntp_gettime() returns code 0 (OK)
  time d11b5c2a.84038000 Fri, Mar 4 2011 7:40:10.515, (.515678),
  maximum error 245715 us, estimated error 3060 us
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset 3440.000 us, frequency -103.558 ppm, interval 1 s,
  maximum error 245715 us, estimated error 3060 us,
  status 0x1 (PLL),
  time constant 2, precision 1.000 us, tolerance 512 ppm,

Note the “estimated error” of 3060 microseconds, as well as the offset of
3440 microseconds. Compare this with head (which is stable):
   [zurawski@head ~]$ ntptime
   ntp_gettime() returns code 0 (OK)
     time d11b5c0e.eb99a000 Fri, Mar 4 2011 7:39:42.920, (.920313),
     maximum error 366705 us, estimated error 519 us
   ntp_adjtime() returns code 0 (OK)
     modes 0x0 (),
     offset -71.000 us, frequency 122.796 ppm, interval 1 s,
     maximum error 366705 us, estimated error 519 us,
     status 0x1 (PLL),
     time constant 6, precision 1.000 us, tolerance 512 ppm,

   The measurement tools have the ability to use the NTP error estimation to
   give the user a better estimation of the validity of a measurement, but with
   the clocks being this far off, we would not be able to trust the results at face
   value to and from red-pc1. Once Red-pc1 stabilized (the current state if you
   were to look currently), measurement is possible and reliable.

   Possible reasons for NTP being far off:

         Bad host clock
         Network congestion affecting NTP results
         Connectivity loss to clocks (unlikely – note that “reach” value from
          ntpq is high)


              ii. Annotate the above diagram of all the hosts, add IP addresses
                  and hostnames, their links to other hosts, and the round-trip
                  time between each host.




                              Figure 2 - RTT Values

5. OWAMP Investigation
     a. Use the "owping" tool to discover the one-way latencies between the
        hosts. You can find out more information on owping by running "man
        owping".
     b. Questions To Answer:
           i. Using the information from above, draw another diagram of all
              the hosts, and the one-way delay between each host on all
              links. Take note of any loss or duplicate packets seen (May
              need to investigate by running the tool more than once, or
              running with a longer number of packets than the default of
              100)




                        Figure 3 - OWD Values

Collection notes:

      The loss between head and blue-pc1 is very small, this required a
       longer test (e.g. owping –c 1000) to determine.
      The duplication between red-pc1 and green-pc1 was also small – a
       longer test was needed
      The packet corruption between head and red-pc1 is tricky – OWAMP
       will report these simply as losses (e.g it can’t read the packets).
       Another tool like SNMP monitoring (or something lower, perhaps at
       the optical layer) would need to relay why the packets were being
       lost.
           ii. Based on the results of the NTP comparison, how will the
               stability of the clocks change the measurement results?

In our example network, there is not an issue with NTP on any hosts. To
fabricate an example, consider the following from head and red-pc1 (which
did have a failing clock at the time):
[zurawski@head ~]$ owping red-pc1
Approximately 13.0 seconds until results available

--- owping statistics from [head]:43803 to [red-pc1]:47627 ---
SID:   c0a80002d11b60d750ad9ad84221d0d9
first: 2011-03-04T08:00:08.561
last: 2011-03-04T08:00:18.081
100 sent, 8 lost (8.000%), 0 duplicates
one-way delay min/median/max = 39.2/39.8/40.6 ms, (err=4.61 ms)
one-way jitter = nan ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [red-pc1]:47744 to [head]:50238 ---
SID:   c0a80001d11b60d75708bb906a28aedd
first: 2011-03-04T08:00:08.488
last: 2011-03-04T08:00:20.319
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = -19.6/-18.3/-17.3 ms, (err=4.61 ms)
one-way jitter = 0.3 ms (P95-P50)
TTL not reported
no reordering


There are two things to note from the output:

      The red-pc1 to head direction (2nd output) features “negative” latency.
      Both directions feature a high error estimate (4.61 ms).

If taken alone, these values are worthless. There are steps that can be taken
to mitigate the errors. “Adding” a negative and a positive latency together to
get the full RTT (e.g. in this case 39 + -19 = 20). This doesn’t help us much
(we need one way latency) and highlights the fact that we require stable
clocks.

          iii. Based on the latency, duplicates and loss rates, which links do
               you think will perform best and which will perform the worst?

Lets consider some aspects of TCP to answer this question:

TCP will “ramp up” slowly until it reaches a loss. Whenever TCP takes a loss,
it cuts the throughput in half. TCP will then repeat the process, by slowly
ramping up again till it takes more loss. This is called “additive increase,
multiplicative decrease”, and helps TCP to be “fair” for all users.
On a short path, this behavior is masked by the small RTT. On a longer path,
the time it takes to reach a higher speed increases with the RTT. This implies
   that longer paths with even small amounts of loss will perform much worse
   than shorter paths with a large amount of loss.

   Path Performance Predictions:

         Head and red-pc1: The short RTT will most likely mask the high loss
          in the one direction, and will make unnoticeable in the other.
         Head and green-pc1: The medium RTT has no reported problems.
          This path may be affected by network congestion.
         Head and blue-pc1: The long RTT will experience performance
          problems due to the small amount of loss. Note that we see the loss in
          one direction; this will influence the results if we think about how TCP
          works. E.g. is it bad to loose data packets? Is it bad to lose ACK
          packets? Think about how this will shape the throughput
          measurements.
         Red-pc1 and green-pc1: The relatively short RTT will help to mask
          the small amount of duplication. Performance will not be as good as it
          could be, but will be close to expectation. Because of this, think about
          how long this problem might go unnoticed.
         Red-pc1 and blue-pc1: The long RTT has no reported problems. This
          path may be affected by network congestion.
         Green-pc1 and blue-pc1: The medium RTT has no reported
          problems. This path may be affected by network congestion. Think
          about the possible cause of the asymmetry as well. If this was a
          routing asymmetry, we may be using different network paths.
          Congestion on one vs the other may influence performance
          measurements (e.g. consider this to be a similar case to seeing loss in
          one direction).

6. BWCTL Investigation
     a. Use the "bwctl" tool to perform bandwidth tests between the hosts.
        You can find out more information on bwctl by running "man bwctl".
     b. Do ‘TCP’ testing for now.
     c. Questions To Answer:
            i. How long did you have to wait to get the results?

   Depending on which hosts you tested against, and how many others were
   doing the testing, this wait could have been significant. BWCTL will queue
   connections (to only serve one at a time).
           ii. Draw yet another diagram of all the hosts using the
               information from above, and note the bandwidth (in Mbps, use
               the ‘–f m’ option to get this format) between each host. Did the
               bandwidth match your expectations based on the information
               you found above?




                          Figure 4 - BW Values

Summary:

     Head and red-pc1: The Red-pc1->Head direction will always perform
      well because we are not having a corruption of data packets; we are
      having a corruption of ACKs. In TCP “cumulative” and “selective” ACK
      schemes are commonly used, meaning that a loss of an ACK is not a
      big deal in terms of keeping up with performance. The other direction
      is problematic. Due to the short RTT TCP can recover fro data loss
      quickly. There are times we could get full performance (e.g. for short
      uses of the network, e.g. a 5-10 second transfer). Longer transfers will
      suffer due to the constant loss. To place a real world spin on this
      problem, consider that the problem was on a campus and affected
      your customers. If you had a population that consisted of of about
      95% people “downloading” and 5% “uploading, a problem that affects
      the minority may go un-reported.
     Head and green-pc1: Performance is symmetric and expected; there
      are minor blips due to congestion.
     Head and blue-pc1: Performance in the direction without the loss
      meets expectations, due to the long path, but could be higher due to
      the loss of ACKs. Performance in the direction with the loss (Head to
      Blue-pc1) is extremely poor due to the loss and the long RTT.
     Red-pc1 and green-pc1: Due to the short RTT and the duplication (a
      condition not as severe as loss) there is hardly any effect on the traffic.
         Red-pc1 and blue-pc1: The lack of loss/duplication implies that this
          path is clean, congestion has some minor effects due to the long RTT.
         Green-pc1 and blue-pc1: The lack of loss/duplication implies that
          this path is clean, and the asymmetry is close enough that RTT is not
          severely affected. Congestion may play a roll in problems.

7. NDT Command Line
     a. Use the "web100clt" tool to perform ndt tests between the hosts. You
         can find out more information on web100clt by running "man
         web100clt" or “web100clt -h”.
     b. Launch these tests from the head node ONLY to the other hosts
     c. Use the “-d” and “-l” flags (sometimes more than once…) to get more
         information.
     d. Questions To Answer:
             i. NDT will deliver an answer on bandwidth that is similar to
                BWCTL, but with more information. What sort of information
                are you seeing, and does this agree with previous
                observations?

   For brevity, I will only attach 1 test between blue-pc1 and head:
   [zurawski@blue-pc1 ~]$ web100clt -l -n head
   Testing network path for configuration and performance problems -- Using IPv4
   address
   Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done
   checking for firewalls . . . . . . . . . . . . . . . . . . . Done
   running 10s outbound test (client to server) . . . . . 77.31 Mb/s
   running 10s inbound test (server to client) . . . . . . 1.82 Mb/s
   Server unable to determine bottleneck link type.
   Information: Other network traffic is congesting the link
   Information [C2S]: Packet queuing detected: 38.28% (local buffers)
   Information [S2C]: Packet queuing detected: 92.43% (local buffers)
   Server 'head' is not behind a firewall. [Connection to the ephemeral port was
   successful]
   Client is not behind a firewall. [Connection to the ephemeral port was successful]

          ------   Web100 Detailed Analysis   ------

   Web100 reports the Round trip time = 153.80 msec;the Packet size = 1448 Bytes; and
   There were 68 packets retransmitted, 285 duplicate acks received, and 351 SACK
   blocks received
   Packets arrived out-of-order 22.11% of the time.
   This connection is sender limited 76.71% of the time.
   This connection is network limited 22.00% of the time.

       Web100 reports TCP negotiated the optional Performance Settings to:
   RFC 2018 Selective Acknowledgment: ON
   RFC 896 Nagle Algorithm: ON
   RFC 3168 Explicit Congestion Notification: OFF
   RFC 1323 Time Stamping: ON
   RFC 1323 Window Scaling: ON; Scaling Factors - Server=13, Client=13
   The theoretical network limit is 1.88 Mbps
   The NDT server has a 32768 KByte buffer which limits the throughput to 1664.50
   Mbps
   Your PC/Workstation has a 656 KByte buffer which limits the throughput to 33.32
   Mbps
   The network based flow control limits the throughput to 20.97 Mbps

   Client Data reports link is ' -1', Client Acks report link is ' -1'
   Server   Data reports link is ' -1', Server Acks report link is ' -1'
   Packet   size is preserved End-to-End
   Server   IP addresses are preserved End-to-End
   Client   IP addresses are preserved End-to-End

   NDT reports similar throughput numbers and RTT. It is able to report that
   there are retransmissions, indicative of packet loss. The report of queue in
   the local buffers is spurious; this is normally due to slow measurement
   operation through the kernel.

                ii. Are there any problems (e.g. buffer sizes, queueing) noted
                    between the hosts?

   Report of “local” queueing is common because the NDT tool must send it’s
   measurements through the kernel. There is often a delay between when the
   packet is formed, sent through the kernel, and onto the wire via the NIC.

   Buffer sizes on all hosts are tuned, from a ‘server’ point of view, but note that
   the NDT client will use auto-tuning.

8. NPAD Command Line
     a. Use the “pathdiag.py” tool to perform NPAD tests between the hosts.
        You can find out more information on pathdiag by running
        “/opt/npad/pathdiag.py -h”
     b. Questions to Answer:
            i. What local infrastructure problems had NPAD identified?

   For brevity, I will only attach 1 test between blue-pc1 and head:
   [zurawski@blue-pc1 ~]$ /opt/npad/pathdiag.py -H head 150 90 -P 8000
   Parameters based on 151 ms initial RTT
   peakwin=1690555 minpackets=3 maxpackets=43 stepsize=4
   Target run length is 2771751 packets (or a loss rate of 0.00003608%)
   Test 1a (11 seconds): Coarse Scan
   Test 1b (11 seconds): ...
   …
   Test 2a (9 seconds): Search for the knee
   Accumulate loss statistics, no more than 510 seconds:

   The results described the problems in achieving the throughput (e.g. excesive
   loss and RTT distance, plus lack of sufficient buffering).

                ii. Does this meet what we have seen with the other tools?

   The results back up the observations with the other tools. Note that if we
   started with either NPAD or NDT it would give us a reason to use the other
   tools to confirm instead of the approach we are taking here.

9. perfSONAR Tools Investigation
      a. As a way to verify your findings, we're running regular performance
         tests between the various hosts, and recording the results.
   b. Browse to http://npw.internet2.edu/toolkit/
           i. Select the "One-Way Latency" option from the "Service Graphs"
                 1. Look at each host, use a "4 hour" or “12 Hour” graph.
                      Does the performance match what you saw?
          ii. Select the "Throughput" option from the "Service Graphs"
                 1. Look at each host, use a “1 Hour” graph. Does the
                      performance match what you saw in the diagnostic
                      section? What can you notice if you look at the “1
                      Month” graph?
         iii. Select the "Head Ping Latency" option from the "Service
              Graphs"
                 1. Examine the graphs for host pairs that you perceive to
                      have a problem
         iv. Select the "Red PC Ping Latency" option from the "Service
              Graphs"
                 1. Examine the graphs for host pairs that you perceive to
                      have a problem
          v. Select the "Green PC Ping Latency" option from the "Service
              Graphs"
                 1. Examine the graphs for host pairs that you perceive to
                      have a problem
         vi. Select the "Blue PC Ping Latency" option from the "Service
              Graphs"
                 1. Examine the graphs for host pairs that you perceive to
                      have a problem
   c. Based on the knowledge of using the tools in a diagnostic fashion, how
      do the regular monitoring results compare?

The results confirm what we have seen. Alarming on the paths could detect
problems much sooner.

								
To top