1
White Paper - Bill Alderson
Network and Application Performance
Across the Theatre of War
Abstract. Communications networks analysis and performance optimization is a complex and
difficult task, even at the best of times. However, when this process takes place in multiple
countries with extreme topography, communication links and networks of unpredictable and
varying quality, climatic extremes, and with a backdrop of the longest war in US history, the
challenge becomes even more daunting. In this paper, we discuss the forensic approach to the
performance diagnosis of a mission-critical application deployed by US Central Command in
the Middle East.
BACKGROUND
US Military biometric and other application owners aided their warfighter users by applying
state-of-the-art root cause diagnosis and performance optimization tools.
We were asked by military leadership to diagnose the root causes of issues affecting
application performance and, thereby, the military mission. The US Government enlisted the
expertise offered by our Critical Problem Resolution team rather than taking the time to
procure new tools, deploy and train warfighters to address the problem.
Procurement, deployment and training, plus time to gain experience is a process that could
take five years—so leadership used our expertise to diagnose the high-visibility Biometrics
application. The application is capable of sifting through millions of Biometric identities to
find the exact individuals involved in IED placement or other insurgent activities. The
military uses biometric enrollments of fingerprints and iris scans during contact with the
public at checkpoints and during military operations.
This biometrics capability is credited with assisting the surge success in both Iraq and
Afghanistan.
The Catalyst—Rapid Critical Problem Resolution (CPR)
Our rapid response deployable team members began learning the components and
characteristics of the Biometrics application, working with software developers, operations
organizations, and various entities around the world through which the application’s network
packets and transactions pass.
White Paper: NAPTW_v1_092611 page 1 of 7
2
Bill Alderson
We traced the path packets follow from the warfighters that collect biometric identities in the
field to the servers, and we analyzed steps necessary to match a biometrics enrolment to an
individual insurgent. Watch lists enable thousands of warfighters to distinguish insurgents
from ordinary citizens.
Our CPR analysts identified the application architecture, components including TCP/IP’s
Transport Control Protocol (TCP) application ports numbers and the associated
communicating processes employed by the application to enable identification of the
application in operation across the network. TCP ports provide one signature of the Biometric
payload communicating across the network as a transaction. Understanding processes and
application flow operation enabled our analysis teams to be ready to analyze biometric
transactions on the network once onsite in theatre for short durations.
Network analyzers and monitoring tools need to be placed in the actual path to collect the
packet captures. Network diagrams (often dated, and which are not particularly complete or
accurate) were used to identify application component locations so packets captures between
servers could be performed. Packet captures enable the subsequent microanalysis of the
sessions between servers: this is where knowledge of protocols and skill is required to
understand both network and application behavior. Some mistakenly believe that that
ownership of analyzers or expensive tools, used by trained interface operators, will ferret out
the problems- not so! Rather, problems are most often solved by a combination of extensive
application and protocol knowledge combined with years of experience analyzing critical
problems; this experience is found only among those who practice the art and science of
network and application forensics. Analysis training and capability are on the rise in the
military environment but have yet to become a focus in the military schoolhouse.
A typical application transaction typically comprises thousands of packets, and sometimes
involves more than one communications session. Transactions are analyzed to identify
performance limitations from errors or other issues in the environment. Protocols have
automatic recovery behavior apart from the application behavior that is designed to overcome
high latency inherent in long distance terrestrial or satellite systems and to retransmit lost or
severely delayed packets. In other words, the underlying protocols try to solve problems by
retransmitting packets at lower layers to insulate the application from the task of error
recovery. As these lower layer functions attempt to insulate the application we found severe
data redundancy consuming link bandwidth due to the automatic recovery algorithms
behavior. We found doubling of bandwidth consumption due to these issues and in a
bandwidth constrained environment that was an expensive penalty.
Root Causes Identified from the Limited “Slice” Analysis
The initial analysis amounted to a few weeks of detailed packet level analysis. Both the
environment and application were found to contribute to the degraded performance. Root
causes were explained and mitigation recommendations provided. Environmental issues in the
limited area analyzed were resolved on site. Analysis results were shared with the communities
of interest, and additional analysis performed with the developers in their simulation labs.
Findings included:
· Route change induced packet loss.
· Server slow response time.
White Paper: NAPTW_v1_092611 page 2 of 7
3
Bill Alderson
· Intermittent low throughput and high latency.
· Component issues, QoS incongruities, changing performance across links.
· Many network and application issues requiring packet level capture.
Expanding to a more comprehensive analysis—a full End-to-End Analysis—the same slice
analysis performed 24/7 across the entire Theatre.
We next responded to the request to comprehensively monitor and analyze Biometric
applications across the entire theatre including CONUS locations through which biometric
enrollments are processed, stored and watch lists are team of analysts applying a portfolio of
other industry tools to diagnose root cause across the Theatre of War. This effort required
over a year to complete.
DATASHEET
USCENTCOM’s tiered communications architecture connects operating bases and tactical
brigades to other bases in the Middle Eastern war AOR (Area of Responsibility), Europe,
Africa and CONUS locations via DoD’s DISA GIG infrastructure. USCENTCOM ties
monitoring systems to their tiered architecture providing a centralized worldwide gateway for
monitoring data consolidation and dissemination.
A Portfolio of Tools Deployed to Instrument the Theatre of War.
The idea was to instrument the war with instrumentation with movement toward the
instrumentation capability of the Apollo 13 lunar model, so that thousands CONUS experts
could help manage the environment remotely. The military already pays higher skilled people
CONUS, including experienced military personnel on rotation, but are unable to assist in the
meaningful and successful ways the Apollo 13 ground personnel were able to because of a lack
of effective instrumentation. The future of Network Centric Warfare is dependent up our
ability to execute network management in the theatre of war without the massive deployment
of contractors into the field. Imagine sending thousands of network management personnel to
the moon on our next trip to manage the lunar module network.
Each monitoring system performs local data reduction and analysis. This process ensures only
relevant resultant data is sent to the databases from which local user reports are queried and
delivered in data in user-facing web pages. This strategy provides world-class capability locally
with each system having some local metric reporting - but the real power is the centralized
database providing theatre-wide statistics on applications and network components in one or
all areas.
Six different monitoring capabilities provide:
1. Route Analytics to identify route instability, and conditions that may cause poor
performance across slower secondary circuits or end-to-end path MTU negotiation
problems.
2. Server Response Time, generating baselines and reporting deviation with events
identifying packet capture details required to definitively diagnose problems.
3. Netflow based rate and volume of application traffic indicates how much bandwidth a
client or group of clients is consuming giving hints on conditions limiting user’s
application performance.
White Paper: NAPTW_v1_092611 page 3 of 7
4
Bill Alderson
4. SNMP device status monitoring. Extending traditional SNMP to include CBQOS
and IPSLA tests, which generate small payloads between router ports synthetically
emulating application traffic and measuring layer three performance.
5. Packet capture at key hubs. Items 1–4 address the general problem empowering rapid
access to the problem packets to diagnose definitive root-cause of more detail-oriented
problems.
6. All these systems provide information for automated potential system documentation.
Route Analytics
Many problems are related to routing and switching changes. One way reliability is intended
DATASHEET
to improve is to utilize redundant equipment and paths. To utilize multiple paths through
multiple components spanning tree (data-link, layer two) and routing (network, layer three)
protocols dynamically choose the optimal or preferred path from a variety of cost metrics and
settings. When conditions occur necessitating a path change it affects many other paths.
When conditions are changing rapidly and in multiple places it can cause flapping (a
frequency of alternating changes) or chaos preventing stability for any period of time.
Convergence is the term used to describe the quiescent or lack of changes for a period of time.
Some networks or of changes or problems causing continual changes. Flapping of both data-
link layer two and network layer three together make diagnosing problems difficult as the two
layers are not managed using the same techniques or tools. Changing paths offer another
challenge for path MTU (maximum transmission unit – data link layer payload size) if a path
is changed to a smaller size it will have to renegotiate a new size adjusting the sending station’s
TCP MSS (maximum segment size).
Problems are compounded when firewalls, load balancers, TCP Accelerators, proxies and
other “our own man in the middle” MITM must adjust along the path. Further complicated
when firewalls block ICMP Type 3 code 4 MTU path discovery packets preventing IP to
negotiate the path MTU. One can imagine what happens when flapping paths have varying
MTU’s or permissions to negotiate Path MTU. These conditions are often present because
large networks with many controlling silos are inconsistent in handling or collaborating
settings that work consistently from end to end.
As routes are changing packets already in transit are caught without any place to go
momentarily which sends them upon the default route tree at the place of becoming orphaned.
This causes packets to spin until the time-to-live (TTL) expires and are discarded by the
router experiencing the zero TTL. Packets orphaned in this manner are considered lost by the
transport (TCP) or application layer protocol in charge and after reaching their dynamic
timeout value retransmit the data. After so many TCP retransmissions (typically 6 including
the first), each waiting twice the length of the former the session is considered closed.
White Paper: NAPTW_v1_092611 page 4 of 7
5
Bill Alderson
If the application layer does reissue a new TCP session and retry (typically not more than 3
times, if any) then it will go through another TCP attempt cycle and if continually incomplete
the application too will abort and the user process will fail. In a large environment like the
military using many routing protocols in a decentralized manner makes Route Analytics
invaluable to address root cause of packet loss and ineffective routing.
DATASHEET
Server Response Times
When a group of users experience slow response from a server it may be caused by, a.) the
network specific to their location, b.) the server they are using or the application they are using
or c.) the latency across the network increasing. Groups of servers are monitored by one local
collection component and the data is intelligently reduced before being sent to the central
database console.
The Response Time component delineates responsibility for degraded performance into three
areas:
a) server response time,
b) data transfer time, an indication of bandwidth starvation when increasing from
baseline levels.
c) network latency components graphed for rapid identification of the component
responsible.
White Paper: NAPTW_v1_092611 page 7 of 7
6
Bill Alderson
DATASHEET
Figure 2: Response Time Components Chart
Red
-‐
Server
Response
Time
Yellow
-‐
Data
Transfer
Time
(bandwidth)
Green
-‐
Network
latency
Blue
–
Retransmission
time
If slowdown or problem occurs a variety of investigations can be carried out, including:
• Application Rate & Volume corroboration from NetFlow charts.
• SNMP ipSLA test corroboration between nearby router ports.
• SNMP data collected on internetwork components in the path at the same time
the performance degradation occurred.
• Response time monitoring performance triggered packet captures.
An important element of response time monitoring is the investigations that allow
definitive diagnosis and subsequent problem mitigation. Often the response time data
itself definitively identifies the root cause. When additional data is required the triggered
investigation to automatically capture packets or identify the range of packets in a long-
term packet capture cache is necessary to perform additional packet trace application
analysis. When problems happen in diverse locations and times the automatic packet
capture investigation proved useful to bound the packets providing IP address and TCP
application session object data making analysis productive at determining root cause.
Retransmissions across high latency satellite circuits take so long that TCP stacks often
fail to recover gracefully. Compounding the problem are flapping routes that exacerbate
the condition. Our own Man In The Middle (MITM) devices such as proxies, bandwidth
optimizers, load balancers, firewalls and server and client TCP/IP stack offload engines
create interactions that require advanced analysis to diagnose.
The response time component provides both the user performance degradation points and
the detailed packet traces required for definitive diagnosis and remediation.
White Paper: NAPTW_v1_092611 page 6 of 7
7
Bill Alderson
NetFlow Application User Rate and Volume Data
Netflow provided by routers builds an intrinsic monitoring system without the need for probes
at each location in the network. Instead the routers themselves are used as the probes. The
resultant time synced database of Netflow data is used to correlate slow users, groups, servers
and applications. The power is enhanced by tightly integrated solutions that allow queries of
response time, Netflow and SNMP / ipSLA data so that triangulation and corroboration of
problems can be displayed together. The Response Time system when indicating a slowdown
performs a path discovery at the time the problem occurred. This capability enables problems
across the path to be examined at the time the user experienced a problem bringing together
the notice of the user problem and correlation with a component in the infrastructure.
Viewing NetFlow data at this same time at various locations across the network allows
pinpointing of the performance problem.
DATASHEET
Traditional SNMP with enhancements
SNMP queries network components and databases the resultant data at appropriate intervals.
When tightly integrates solutions are brought together it allows exploitation of all five data
points in a time coordinated manner:
1. Route Analytics
2. NetFlow
3. Response-time baseline / investigations
4. Extended SNMP / ipSLA / CBQoS
5. Packet capture details
The result has allowed the US Military to address problems that happen in diverse areas at all
hours in a time-coordinated manner to pinpoint degraded user performance across multiple
theatres of war.
Conclusion
Many organizations are using end users to monitor application performance. Not only are end
users the most expensive monitors due to lost productive time experienced in poor
performance, but they are also subjective in their analysis and tire of reporting issues. Once a
consistent, scientific automated solution is in place its cost is quickly recouped and the
feedback loop is complete allowing leadership and technologists to work together effectively to
continually improve user performance. This reduces lost productive time and enables resources
to be allocated in pinpoint fashion to improve performance.
The ultimate attempt is to provide a network and application performance feedback loop that enables
leadership to continually optimize application response time. The diagram depicts the participation of
leadership, technologists and the automated solutions designed to meet the mission objective.
For additional reading authored by the presenter consider: “People Practices and Paradigms”, a
Network Management White Paper 1995.
REFERENCES
[1] B. Alderson, “People Practices and Paradigms”, a Network Management White Paper 1995
Bill Alderson practices global network and application critical problem resolution, including assisting the Pentagon
communications recovery immediately following 911, oversight on the re-architecture of Pentagon
communications following 911, network and application instrumentation and analysis of the Iraq and Afghanistan
theatres.