Network Anomaly Diagnosis
• Network performance metrics
• Active monitoring tools
• Preparation of canonical dataset
• Anomaly detection
• Anomaly diagnosis
Network performance metrics
• One-Way Delay (OWD)
– Serialization Delay
– Propagation Delay
– Queuing Delay
– Forwarding Delay
• Round-Trip Time (RTT)
• Delay Variation (Jitter)
• Packet Loss
• Packet Reordering
• Maximum Transmission Unit (MTU)
• Bandwidth Delay Product (BDP)
One-Way Delay (OWD)
• The time it takes for a packet to reach its end-to-
• It can be broken down into:
– per-hop one-way delays, and these in turn into:
• per-link and
• per-node delay components
• The per-link component of one-way delay consists
of two sub-components: propagation delay and
• The per-node component of one-way delay also
consists of two sub- components: forwarding
delay and queuing delay.
• Its the time taken to separate a packet into
sequential link transmission units (bits).
• It is obtained by dividing the packet size (in
bits) by the capacity of the link (in bits per
• Nowadays, as links increasingly have a
higher bit rate, serialization delay is less
• Propagation delay is the duration of time for
signals to move from the transmitting to the
receiving end of a link.
• On simple links, this is the product of:
– the link's physical length and
– the characteristic propagation speed of media.
• On high-speed wide-area network (WAN)
paths, delay is usually dominated by
• Queuing delay is defined as the time a
packet has to spend inside a node such as a
router while waiting for availability of the
• It depends on
– the amount of traffic competing to send packets
towards the output link and
– on the priorities of the packet
• Processing at the node
– Reading forwarding-relevant information
• Dest address + other headers
– Forwarding decision
• Based on the routing table
Round-Trip Time (RTT)
• Its the sum of the one-way delays from
source to destination plus time it takes B to
formulate the response.
• Large RTT values can cause problems for
TCP and other window-based transport
• The round-trip time influences the
achievable throughput, as there can only be
a window's worth of unacknowledged data
in the network.
Delay Variation (Jitter)
• OWD is not constant on a real network
because of competing traffic and contention
for processing resources.
• The difference between a given packet‘s
actual and average OWD is termed ‗delay
variation‘ or jitter.
• It only compares the delays experienced by
packets of equal size, as OWD is dependent
on packet size because of serialization delay
• Packet loss is determined as the probability of a
packet being lost in transit from a source A to a
• Applications requiring reliable transmission e.g.
Bulk data transfers, use retransmission, which
• In addition, congestion-sensitive protocols such as
standard TCP assume that packet loss is due to
congestion, and respond by reducing their
transmission rate accordingly.
• Congestion and errors are the two main reasons
for packet loss.
• When the offered load exceeds the capacity of a
part of the network, packets are buffered in
• Since these buffers are also of limited capacity,
congestion can lead to queue overflows, which
leads to packet losses.
• Congestion can be caused by
– moderate overload condition maintained for an
extended amount of time or
– by the sudden arrival of a very large amount of traffic
• Another reason for loss of packets is
corruption, where parts of the packet are
• When such corruptions happen on a link
(due to noisy lines etc.), this is usually
detected by a link-layer checksum at the
receiving end, which then discards the
• The Internet Protocol (IP) does not guarantee the
• Packet reordering concerns packets of different
– larger packets take longer to transfer and may be
overtaken by smaller packets in transit.
• It can be measured by:
– injecting the same traffic pattern via traffic generator
and calculating the reordering.
– To measure maximal reordering: short burst of long
packets immediately followed by a short burst of short
Maximum Transmission Unit (MTU)
• It describes the maximum size of an IP
packet that can be transferred over the link
• Common MTU sizes are:
– 1500 bytes (Ethernet, 802.11 WLAN)
– 4470 bytes (FDDI, common default for POS
and serial links)
– 9000 bytes (Internet2 and GÉANT convention,
limit of some Gigabit Ethernet adapters)
– 9180 bytes (ATM, SMDS)
Bandwidth Delay Product (BDP)
• The Bandwidth Delay Product (BDP) of an end-to-
end path is the product of the bottleneck bandwidth
and the delay of the path.
• It is often useful to think of BDP as the "memory
capacity" of a path, i.e. the amount of data that fits
entirely into the path between two end-systems.
• This relates to throughput, which is the rate at
which data is sent and received.
• Network paths with a large BDP are called Long
Fat Networks or LFNs.
• BDP is an important parameter for the
performance of window-based protocols such as
• dynamic bandwidth capacity (Cap) by
analyzing the minimum inter-packet delay;
• the cross-traffic (Xtr) byanalyzing the inter-
packet delay variability;
• and the available bandwidth
– (Abw = Cap – Xtr)
Active Probing Tools
• Some publicly available active network monitoring
• Throughput & Delay Measurement Tools
• Path Characterization & Bandwidth Estimation
• Uses ICMP echo mechanism
– Packet loss
• Source & Destination use timestamps to calculate
• Packet loss can be measured by sending a set of
packets from a source to a destination and
comparing the number of received packets against
the number of packets sent.
Repeat count Packet size
syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (184.108.40.206): 64 data bytes
72 bytes from 220.127.116.11: icmp_seq=0 ttl=240 time=641.8 ms
72 bytes from 18.104.22.168: icmp_seq=2 ttl=240 time=1072.7 ms
72 bytes from 22.214.171.124: icmp_seq=3 ttl=240 time=1447.4 ms Missing seq #
72 bytes from 126.96.36.199: icmp_seq=4 ttl=240 time=758.5 ms
72 bytes from 188.8.131.52: icmp_seq=5 ttl=240 time=482.1 ms
--- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5
packets received, 16% packet loss round-trip min/avg/max =
• Determines the route a packet takes through the
Internet to reach its destination; i.e. the number of
"hops" it takes.
• It sends "probe" packets with TTL values
incrementing from one, and uses ICMP "Time
Exceeded" messages to detect "hops" on the way
to the specified destination.
• It also records "response" times for each hop, and
displays losses and other types of failures
• Used to measure
– hop-by-hop connectivity and
• UDP/ICMP tool to show route packets take from local to
remote host Max hops
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (184.108.40.206), 20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (220.127.116.11) 0.642 ms
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (18.104.22.168) 0.616 ms
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (22.214.171.124) 0.716 ms
4 snv-slac.es.net (126.96.36.199) 1.377 ms
5 nyc-snv.es.net (188.8.131.52) 75.536 ms
6 nynap-nyc.es.net (184.108.40.206) 80.629 ms
7 gin-nyy-bbl.teleglobe.net (220.127.116.11) 154.742 ms
8 if-1-0-1.bb5.NewYork.Teleglobe.net (18.104.22.168) 137.403 ms
9 if-12-0-0.bb6.NewYork.Teleglobe.net (22.214.171.124) 135.850 ms
10 126.96.36.199 (188.8.131.52) 128.648 ms No response:
11 184.108.40.206 (220.127.116.11) 762.150 ms Lost packet or router
12 islamabad-gw2.comsats.net.pk (18.104.22.168) 751.851 ms ignores
14 lhr.comsats.net.pk (22.214.171.124) 827.301 ms
• It reports bandwidth, delay jitter, datagram loss.
• It has two modes
• In TCP mode it is used to measure the bandwidth.
• It reports MSS (Maximum Segment Size) and the
observed read sizes. It supports TCP window size
adjustment via socket buffers.
• In UDP mode packet loss and delay jitter can be
• Thrulay stands for THRUput and deLAY.
• It measures the capacity of a network by
sending a bulk TCP stream over it.
• One-way delay can be measured by sending
a Poisson stream (arrivals at random) of
very precisely positioned UDP packets.
• Pathload is based on the technique of Self-Loading
Periodic Streams (SLoPS), for measuring available
• A periodic stream in SLoPS consists of K packets of size
L, sent to the path at a constant rate R.
• If the stream rate R is higher than the available bandwidth,
the one-way delays of successive packets at the receiver
show an increasing trend.
• In pathload, sender uses UDP for packet stream and TCP
connection for controlling measurements.
• Sender timestamps each packet upon transmission, which
provides relative one-way delay at the receiver side.
• pathChirp is used to calculate unused
capacity or available bandwidth.
• Based on the concept of self-induced
– Probing rate < available bw no delay increase
– Probing rate > available bw delay increases
• Unique to pathChirp is an exponentially
spaced chirp probing train.
• ABwE is a light weight available bandwidth
measurement tool based on the packet pair
• It uses fixed size packets with specific delay
between each packet pair.
• The observed packet pair delay is converted into
available bandwidth calculations.
• The tool can be used in continuous mode and
detects all substantial bandwidth changes caused
by improper routing or by congestions.
• Netperf is a benchmark that can be used to
measure the performance of many different types
• It provides tests for both unidirectional
throughput, and end-to-end latency.
• Essentially, these tests will measure how fast one
system can send data to another and/or how fast
that other system can receive it.
• Request/response performance is the second area
that can be investigated with netperf.
Selection of tools
• Shriram et al. concludes his comparison of public End-to-
End bandwidth estimation tools on High-Speed Links with
– ―Pathload and Pathchirp are the most accurate. Iperf requires
maximum buffer size and is sensitive to small packet loss.‖
• Ribeiro et al. in comparisons of pathChirp with pathload
and TOPP, clearly show that pathChirp performs far batter
than the other two tools in single-hop, multi-hop scenarios
and on real internet as well.
• Cottrel et al. states ―Ping utility does not require any extra
installation/deployment efforts and has become a de-facto
standard for measuring RTT and packet loss.‖
• Tools selected are pathChirp, pinger, thrulay and
• Cottrell L., Internet Monitoring, Presented at NUST
Institute of Information Technology (NIIT) Rawalpindi,
Pakistan, March 15, 2005
• Cottrell L., Matthews W. and Logg C., Tutorial on Internet
Monitoring & PingER at SLAC
• Jacobson V., ―pathchar — a tool to infer characteristics of
• Ribeiro V., Riedi R., Baraniuk R., Navratil J., Cottrell L.
pathChirp: Efficient Available Bandwidth Estimation for
• Jain M., Dovrolis C., End-to-End Available Bandwidth:
Measurement Methodology, Dynamics, and Relation with
TCP Throughput. ACM SIGCOMM 2002.
• Navratil J. and. Cottrell L., ABwE: A Practical Approach
to Available Bandwidth Estimation.
• Beginner's Guide to Network-Distributed Resource Usage
• Cottrell, L., Logg, C.: Overview of IEPM-BW Bandwidth Testing of Bulk
Data Transfer. In: Sc2002: High Performance Networking and Computing.
• Shriram A., Murray M., Hyun Y., Brownlee N., Broido A., Fomenkov M.,
claffy k., ―Comparison of Public End-to-End Bandwidth Estimation tools on
• Prasad R. S., Dovrolis C., ―Capacity estimation using packet dispersion
techniques : recent progress and open issues‖ Presentation
• Labit Y., Owezarski P., Larrieu N., Evaluation of Active Measurement Tools
for Bandwidth Estimation in Real Environment.
• Dovrolis C., Jain M., ―End-to-end available bandwidth estimation using
• claffy k., Dovrolis C., ―Bandwidth estimation: measurement methodologies
and applications‖ Presentation
• GÉANT2 Performance Enhancement and Response Team (PERT) User Guide
and Best Practice Guide
Anomaly Detection Techniques
• Exponential weighted moving average (EWMA)
• Holt-Winters forecasting algorithm
• Plateau Algorithm
• Principal Component Analysis (PCA)
• ARIMA-based Box-Jenkins forecasting models
• Signal analysis techniques, e.g. the wavelet
• EWMA, a forecasting based techniques
• EWMA predicts the next value in a given
timeseries based on recent history.
• Specifically, if zt denotes the traffic at time t,
then the EWMA prediction for time t+1 is
given by ^zt+1 as:
• Anomalies can then be quantified by taking
the difference between the forecasted and the
actual value, i.e.,
Holt-Winters (HW) Algorithm
• Service network variable time series exhibit the
– A trend over time (i.e., a gradual increase in application
daemon requests over a two month period due to
increased subscriber load).
– A seasonal trend or cycle (i.e., every day bytes per
second increases in the morning hours, peaks in the
afternoon and declines late at night).
– Seasonal variability. (i.e., application requests fluctuate
wildly minute by minute during the peak hours of 4-8
pm, but at 1 am application requests hardly vary at all).
Holt-Winters (HW) Algorithm
• The Holt-Winters (HW) algorithm uses a triple
Exponential Weighted Moving Average (EWMA)
approximation to characterize the time series behavior
as a superposition of three components: a baseline, a
linear trend and a seasonal effect.
• It is used to detect a significant change in the base RTT
of a path.
• Base RTT means a change in most of the RTT values
measured, new values are based on the new, higher level.
• The plateau detector is based around two windows,
– the summary window and
– the sample window.
• Summary window is used to characterize the normal
state of the path which is based on the previous samples.
• The number of samples stored in the summary window is
a user settable parameter.
• Samples are first stored in summary window,
trigger selector decides if a sample forms part of a
• In this case, the sample is stored in the sample
• This decision is made on the value of the sample,
the mean and variance of the summary window,
and on parameters that the user has chosen.
– RTT > samplemean + (samplevariance* sensitivty)
• A trigger is generated when a number of samples
(a user settable parameter) meet the above
equation are received.
• Each time a sample, that fulfils the above
equation, is received a counter is incremented.
• For a non-zero counter when a sample that does
not meet the above equation is received the
counter is decremented.
• If the counter reaches the user parameter trigger
duration a trigger is generated. If the counter
returns to zero the trigger is aborted.
• In either case, data in the sample window is passed
to the summary window.
Principal Component Analysis (PCA)
• PCA is a coordinate transformation method that maps a given set of
data points onto new axes. These axes are called the principal axes or
• When working with zero-mean
• data, each principal component has the property that it points in
• the direction of maximum variance remaining in the data, given
• the variance already accounted for in the preceding components.
• As such, the first principal component captures the variance of the
• data to the greatest degree possible on a single axis. The next principal
• components then each capture the maximum variance among
• the remaining orthogonal directions. Thus, the principal axes are
• ordered by the amount of data variance that they capture.
• McGregor A.J. and Braun H-W,
‖Automated Event Detection for Active
Measurement Systems‖, Proceedings of
PAM2001, Amsterdam, Netherlands,
Preparation of Canonical dataset
• Fetch data from monitoring nodes
(Pathchirp, Thrulay, Ping, Traceroute) -
• Study/Analyze Data - Adnan, Ali
• Run Run-period - Adnan
• Analyze Alerts - Adnan, Ali
• Design a format - Adnan, Ali
• Rearrange data according to new format -
Topology to be used
Monitoring Monitoring Monitoring Monitoring
Node Node Node Node
Monitored Nodes = 21
Data being collected using this
• Pinger: Round trip Time (Min, Max, Avg)
• Pathchirp: Throughput (Avg, Max, Min, Stdev)
• Thrulay: Throughput (Avg, Max, Min, Stdev)
RTT (Avg, Max, Min, Stdev)
• trace route summaries
• Host Monitoring data (e.g. data from LISA,
Nagios, Ganglia) --- candidate
Format of canonical data set
• Not sure, needs some more investigation of
available data from multiple sources.
• Data is fetched from 1st Jan, 06 to 28th Feb