Embed
Email

Network and Application Performance Across the Theatre of War

Document Sample

Shared by: yunyi
Categories
Tags
Stats
views:
0
posted:
11/24/2011
language:
English
pages:
7
1









White Paper - Bill Alderson


 









Network and Application Performance

Across the Theatre of War



Abstract. Communications networks analysis and performance optimization is a complex and

difficult task, even at the best of times. However, when this process takes place in multiple

countries with extreme topography, communication links and networks of unpredictable and

varying quality, climatic extremes, and with a backdrop of the longest war in US history, the

challenge becomes even more daunting. In this paper, we discuss the forensic approach to the

performance diagnosis of a mission-critical application deployed by US Central Command in

the Middle East.





BACKGROUND



US Military biometric and other application owners aided their warfighter users by applying

state-of-the-art root cause diagnosis and performance optimization tools.



We were asked by military leadership to diagnose the root causes of issues affecting

application performance and, thereby, the military mission. The US Government enlisted the

expertise offered by our Critical Problem Resolution team rather than taking the time to

procure new tools, deploy and train warfighters to address the problem.



Procurement, deployment and training, plus time to gain experience is a process that could

take five years—so leadership used our expertise to diagnose the high-visibility Biometrics

application. The application is capable of sifting through millions of Biometric identities to

find the exact individuals involved in IED placement or other insurgent activities. The

military uses biometric enrollments of fingerprints and iris scans during contact with the

public at checkpoints and during military operations.



This biometrics capability is credited with assisting the surge success in both Iraq and

Afghanistan.





The Catalyst—Rapid Critical Problem Resolution (CPR)



Our rapid response deployable team members began learning the components and

characteristics of the Biometrics application, working with software developers, operations

organizations, and various entities around the world through which the application’s network

packets and transactions pass.







White Paper: NAPTW_v1_092611 page 1 of 7

2









Bill Alderson


 





We traced the path packets follow from the warfighters that collect biometric identities in the

field to the servers, and we analyzed steps necessary to match a biometrics enrolment to an

individual insurgent. Watch lists enable thousands of warfighters to distinguish insurgents

from ordinary citizens.



Our CPR analysts identified the application architecture, components including TCP/IP’s

Transport Control Protocol (TCP) application ports numbers and the associated

communicating processes employed by the application to enable identification of the

application in operation across the network. TCP ports provide one signature of the Biometric

payload communicating across the network as a transaction. Understanding processes and

application flow operation enabled our analysis teams to be ready to analyze biometric

transactions on the network once onsite in theatre for short durations.



Network analyzers and monitoring tools need to be placed in the actual path to collect the

packet captures. Network diagrams (often dated, and which are not particularly complete or

accurate) were used to identify application component locations so packets captures between

servers could be performed. Packet captures enable the subsequent microanalysis of the

sessions between servers: this is where knowledge of protocols and skill is required to

understand both network and application behavior. Some mistakenly believe that that

ownership of analyzers or expensive tools, used by trained interface operators, will ferret out

the problems- not so! Rather, problems are most often solved by a combination of extensive

application and protocol knowledge combined with years of experience analyzing critical

problems; this experience is found only among those who practice the art and science of

network and application forensics. Analysis training and capability are on the rise in the

military environment but have yet to become a focus in the military schoolhouse.



A typical application transaction typically comprises thousands of packets, and sometimes

involves more than one communications session. Transactions are analyzed to identify

performance limitations from errors or other issues in the environment. Protocols have

automatic recovery behavior apart from the application behavior that is designed to overcome

high latency inherent in long distance terrestrial or satellite systems and to retransmit lost or

severely delayed packets. In other words, the underlying protocols try to solve problems by

retransmitting packets at lower layers to insulate the application from the task of error

recovery. As these lower layer functions attempt to insulate the application we found severe

data redundancy consuming link bandwidth due to the automatic recovery algorithms

behavior. We found doubling of bandwidth consumption due to these issues and in a

bandwidth constrained environment that was an expensive penalty.



Root Causes Identified from the Limited “Slice” Analysis



The initial analysis amounted to a few weeks of detailed packet level analysis. Both the

environment and application were found to contribute to the degraded performance. Root

causes were explained and mitigation recommendations provided. Environmental issues in the

limited area analyzed were resolved on site. Analysis results were shared with the communities

of interest, and additional analysis performed with the developers in their simulation labs.



Findings included:

· Route change induced packet loss.

· Server slow response time.







White Paper: NAPTW_v1_092611 page 2 of 7
 

3









Bill Alderson


 





· Intermittent low throughput and high latency.

· Component issues, QoS incongruities, changing performance across links.

· Many network and application issues requiring packet level capture.



Expanding to a more comprehensive analysis—a full End-to-End Analysis—the same slice

analysis performed 24/7 across the entire Theatre.



We next responded to the request to comprehensively monitor and analyze Biometric

applications across the entire theatre including CONUS locations through which biometric

enrollments are processed, stored and watch lists are team of analysts applying a portfolio of

other industry tools to diagnose root cause across the Theatre of War. This effort required

over a year to complete.

DATASHEET









USCENTCOM’s tiered communications architecture connects operating bases and tactical

brigades to other bases in the Middle Eastern war AOR (Area of Responsibility), Europe,

Africa and CONUS locations via DoD’s DISA GIG infrastructure. USCENTCOM ties

monitoring systems to their tiered architecture providing a centralized worldwide gateway for

monitoring data consolidation and dissemination.



A Portfolio of Tools Deployed to Instrument the Theatre of War.



The idea was to instrument the war with instrumentation with movement toward the

instrumentation capability of the Apollo 13 lunar model, so that thousands CONUS experts

could help manage the environment remotely. The military already pays higher skilled people

CONUS, including experienced military personnel on rotation, but are unable to assist in the

meaningful and successful ways the Apollo 13 ground personnel were able to because of a lack

of effective instrumentation. The future of Network Centric Warfare is dependent up our

ability to execute network management in the theatre of war without the massive deployment

of contractors into the field. Imagine sending thousands of network management personnel to

the moon on our next trip to manage the lunar module network.



Each monitoring system performs local data reduction and analysis. This process ensures only

relevant resultant data is sent to the databases from which local user reports are queried and

delivered in data in user-facing web pages. This strategy provides world-class capability locally

with each system having some local metric reporting - but the real power is the centralized

database providing theatre-wide statistics on applications and network components in one or

all areas.



Six different monitoring capabilities provide:



1. Route Analytics to identify route instability, and conditions that may cause poor

performance across slower secondary circuits or end-to-end path MTU negotiation

problems.

2. Server Response Time, generating baselines and reporting deviation with events

identifying packet capture details required to definitively diagnose problems.

3. Netflow based rate and volume of application traffic indicates how much bandwidth a

client or group of clients is consuming giving hints on conditions limiting user’s

application performance.









White Paper: NAPTW_v1_092611 page 3 of 7
 

4










 

Bill Alderson







4. SNMP device status monitoring. Extending traditional SNMP to include CBQOS

and IPSLA tests, which generate small payloads between router ports synthetically

emulating application traffic and measuring layer three performance.

5. Packet capture at key hubs. Items 1–4 address the general problem empowering rapid

access to the problem packets to diagnose definitive root-cause of more detail-oriented

problems.

6. All these systems provide information for automated potential system documentation.



Route Analytics



Many problems are related to routing and switching changes. One way reliability is intended

DATASHEET









to improve is to utilize redundant equipment and paths. To utilize multiple paths through

multiple components spanning tree (data-link, layer two) and routing (network, layer three)

protocols dynamically choose the optimal or preferred path from a variety of cost metrics and

settings. When conditions occur necessitating a path change it affects many other paths.

When conditions are changing rapidly and in multiple places it can cause flapping (a

frequency of alternating changes) or chaos preventing stability for any period of time.



Convergence is the term used to describe the quiescent or lack of changes for a period of time.

Some networks or of changes or problems causing continual changes. Flapping of both data-

link layer two and network layer three together make diagnosing problems difficult as the two

layers are not managed using the same techniques or tools. Changing paths offer another

challenge for path MTU (maximum transmission unit – data link layer payload size) if a path

is changed to a smaller size it will have to renegotiate a new size adjusting the sending station’s

TCP MSS (maximum segment size).



Problems are compounded when firewalls, load balancers, TCP Accelerators, proxies and

other “our own man in the middle” MITM must adjust along the path. Further complicated

when firewalls block ICMP Type 3 code 4 MTU path discovery packets preventing IP to

negotiate the path MTU. One can imagine what happens when flapping paths have varying

MTU’s or permissions to negotiate Path MTU. These conditions are often present because

large networks with many controlling silos are inconsistent in handling or collaborating

settings that work consistently from end to end.



As routes are changing packets already in transit are caught without any place to go

momentarily which sends them upon the default route tree at the place of becoming orphaned.



This causes packets to spin until the time-to-live (TTL) expires and are discarded by the

router experiencing the zero TTL. Packets orphaned in this manner are considered lost by the

transport (TCP) or application layer protocol in charge and after reaching their dynamic

timeout value retransmit the data. After so many TCP retransmissions (typically 6 including

the first), each waiting twice the length of the former the session is considered closed.









White Paper: NAPTW_v1_092611 page 4 of 7
 

5









Bill Alderson


 

If the application layer does reissue a new TCP session and retry (typically not more than 3

times, if any) then it will go through another TCP attempt cycle and if continually incomplete

the application too will abort and the user process will fail. In a large environment like the

military using many routing protocols in a decentralized manner makes Route Analytics

invaluable to address root cause of packet loss and ineffective routing.

DATASHEET









Server Response Times



When a group of users experience slow response from a server it may be caused by, a.) the

network specific to their location, b.) the server they are using or the application they are using

or c.) the latency across the network increasing. Groups of servers are monitored by one local

collection component and the data is intelligently reduced before being sent to the central

database console.



The Response Time component delineates responsibility for degraded performance into three

areas:



a) server response time,

b) data transfer time, an indication of bandwidth starvation when increasing from

baseline levels.

c) network latency components graphed for rapid identification of the component

responsible.









White Paper: NAPTW_v1_092611 page 7 of 7
 

6










 

Bill Alderson

DATASHEET









Figure 2: Response Time Components Chart



Red
 -­‐
 Server
 Response
 Time
 
 

Yellow
 -­‐
 Data
 Transfer
 Time
 (bandwidth)
 

Green
 -­‐
 Network
 latency
 
 

Blue
 –
 Retransmission
 time
 





If slowdown or problem occurs a variety of investigations can be carried out, including:



• Application Rate & Volume corroboration from NetFlow charts.

• SNMP ipSLA test corroboration between nearby router ports.

• SNMP data collected on internetwork components in the path at the same time

the performance degradation occurred.

• Response time monitoring performance triggered packet captures.



An important element of response time monitoring is the investigations that allow

definitive diagnosis and subsequent problem mitigation. Often the response time data

itself definitively identifies the root cause. When additional data is required the triggered

investigation to automatically capture packets or identify the range of packets in a long-

term packet capture cache is necessary to perform additional packet trace application

analysis. When problems happen in diverse locations and times the automatic packet

capture investigation proved useful to bound the packets providing IP address and TCP

application session object data making analysis productive at determining root cause.



Retransmissions across high latency satellite circuits take so long that TCP stacks often

fail to recover gracefully. Compounding the problem are flapping routes that exacerbate

the condition. Our own Man In The Middle (MITM) devices such as proxies, bandwidth

optimizers, load balancers, firewalls and server and client TCP/IP stack offload engines

create interactions that require advanced analysis to diagnose.



The response time component provides both the user performance degradation points and

the detailed packet traces required for definitive diagnosis and remediation.









White Paper: NAPTW_v1_092611 page 6 of 7
 

7










 

Bill Alderson




  NetFlow Application User Rate and Volume Data



Netflow provided by routers builds an intrinsic monitoring system without the need for probes

at each location in the network. Instead the routers themselves are used as the probes. The

resultant time synced database of Netflow data is used to correlate slow users, groups, servers

and applications. The power is enhanced by tightly integrated solutions that allow queries of

response time, Netflow and SNMP / ipSLA data so that triangulation and corroboration of

problems can be displayed together. The Response Time system when indicating a slowdown

performs a path discovery at the time the problem occurred. This capability enables problems

across the path to be examined at the time the user experienced a problem bringing together

the notice of the user problem and correlation with a component in the infrastructure.

Viewing NetFlow data at this same time at various locations across the network allows

pinpointing of the performance problem.

DATASHEET









Traditional SNMP with enhancements



SNMP queries network components and databases the resultant data at appropriate intervals.

When tightly integrates solutions are brought together it allows exploitation of all five data

points in a time coordinated manner:



1. Route Analytics

2. NetFlow

3. Response-time baseline / investigations

4. Extended SNMP / ipSLA / CBQoS

5. Packet capture details



The result has allowed the US Military to address problems that happen in diverse areas at all

hours in a time-coordinated manner to pinpoint degraded user performance across multiple

theatres of war.



Conclusion



Many organizations are using end users to monitor application performance. Not only are end

users the most expensive monitors due to lost productive time experienced in poor

performance, but they are also subjective in their analysis and tire of reporting issues. Once a

consistent, scientific automated solution is in place its cost is quickly recouped and the

feedback loop is complete allowing leadership and technologists to work together effectively to

continually improve user performance. This reduces lost productive time and enables resources

to be allocated in pinpoint fashion to improve performance.



The ultimate attempt is to provide a network and application performance feedback loop that enables

leadership to continually optimize application response time. The diagram depicts the participation of

leadership, technologists and the automated solutions designed to meet the mission objective.



For additional reading authored by the presenter consider: “People Practices and Paradigms”, a

Network Management White Paper 1995.



REFERENCES

[1] B. Alderson, “People Practices and Paradigms”, a Network Management White Paper 1995



Bill Alderson practices global network and application critical problem resolution, including assisting the Pentagon

communications recovery immediately following 911, oversight on the re-architecture of Pentagon

communications following 911, network and application instrumentation and analysis of the Iraq and Afghanistan

theatres.



Related docs
Other docs by yunyi
article-24016
Views: 0  |  Downloads: 0
Bilanz_und_GuV
Views: 29  |  Downloads: 0
MEN'S GLEE CLUB
Views: 1  |  Downloads: 0
Advanced Oceanography Research Project
Views: 1  |  Downloads: 0
Teacher Check-out of Materials
Views: 3  |  Downloads: 0
Reversing the Trend
Views: 3  |  Downloads: 0
SAFE spare parts
Views: 47  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!