Docstoc

Is zero downtime achievable?

Document Sample
Is zero downtime achievable? Powered By Docstoc
					IS ZERO DOWNTIME
ACHIEVABLE?




New Frontiers Paper
Contents


Foreword ....................................................................................................................................................................................5
Part I – The need for continuous availability ..............................................................................................................................6
       Imagine a scenario… ..........................................................................................................................................................6
       Greater IT dependency ........................................................................................................................................................6
       Manual backup is no longer an option ................................................................................................................................7
Part II - The cost of IT failure......................................................................................................................................................8
       A lost selling opportunity......................................................................................................................................................8
       Higher operating costs ........................................................................................................................................................9
       A strategic issue ..................................................................................................................................................................9
Part III – How are we going to eliminate downtime? ..............................................................................................................10
       Existing technologies ........................................................................................................................................................10
       Breakthrough technologies ................................................................................................................................................11
       Improved processes and design ........................................................................................................................................11
       Enhanced service provision................................................................................................................................................12
       End-to-end monitoring ......................................................................................................................................................12
Part V – Conclusion – Is zero downtime achievable? ..............................................................................................................13
Notes and references ..............................................................................................................................................................14




© SITA 2009                                                                                                                                 IS ZERO DOWNTIME ACHIEVABLE 3
Foreword


The economic crisis is the latest in a series of seismic events to impact air transport over the last decade. Each one brings
into sharper focus the global interdependencies that our industry relies on for business stability and business continuity.
Prudent companies continue to invest to mitigate their consequences and limit the risks, even during these events.

I believe information and communication technology has now reached the point where it should be treated as one of those
dependencies. Every day we see more technology being put in front of passengers and employees while some airlines are
building their whole business model around e-commerce. Continuous availability of IT infrastructure should therefore be
seen as an essential component of business continuity plans. Ultimately, the aim must be for zero downtime.

However, achieving such a long term vision will require significant investment in time and resources. That means it will
initially remain the preserve of network-critical sites or technology hotspots, such as hub airports where the business risk is
highest. It will also require a fundamental change in the way service provision for IT is delivered. Service providers will need
to evolve their offerings by providing service levels and pricing that reflect the criticality of IT to different parts of the
customer’s business.

There will be huge challenges. Not least the ability to achieve continuous availability end-to-end – along the whole process
or service, over a mixture of customer and third-party equipment and software. Today, neither service providers nor
customers have sufficient visibility to proactively identify and address problems end-to-end, before they arise.

To tackle this there needs to be much greater cooperation between IT partners and customers in the sharing of
performance data. Achievement should eventually enable service providers to offer integrated end-to-end Service Level
Agreements (SLAs) based on end-user availability and response times.

The solution is not to build perfect IT systems, but to design systems that are instead defect-tolerant. The evolution of
emerging technologies, such as virtualization and self-healing systems that monitor, diagnose and repair their own internal
problems, are a major step in the right direction. Even so, eliminating downtime will need to be accomplished in the context
of a continuous improvement cycle.

This New Frontiers paper provides a glimpse into the future, when the ability of airlines and airports to support an
‘always-on’ environment will be critical.

Enjoy the read.



Dave Bakker
Senior Vice President
SITA Global Services




© SITA 2009                                                                                       IS ZERO DOWNTIME ACHIEVABLE 5
Part I – The need for continuous availability


Over the next five years, the air transport industry’s dependence on IT-based processes and operations will increase.
It will be matched by a heightened expectation from users for greater reliability of that technology. Tolerance of downtime
will diminish.

Meeting that heightened expectation will require IT providers – including internal IT departments – to change their mindset
to focus on ‘continuous availability’ rather than the ‘high availability’ that has previously been the norm. It means going
beyond relying on good planning and risk analysis to catch the majority of downtime events. Ultimately, it means aiming for
zero downtime.

Imagine a scenario…
It is the peak travel season. A computer switch fails, knocking out the airport communications system. The problem needs
to be diagnosed and located; the switch needs to be replaced. In the meantime check-in kiosks stop functioning, as do
desktop terminals.

Automated boarding gates cannot validate the passenger’s right to board so cease operation. Disembarking passengers
cannot be processed by border management agencies, so need to remain airside. Aircraft are left parked at the gates
unable to depart. Arriving aircraft need to be diverted to other airports. Airline schedules are disrupted across connected
parts of the route network.

Today, hundreds of airport and airline workers would need to be mobilized to enact manual backup plans for processing
tens of thousands of passengers and pieces of luggage. It is a time consuming and frustrating process for all affected.
Security and safety can be compromised and the cost burden enormous.

When a US Customs computer outage at Los Angeles airport (LAX) in August 2007 happened on a peak summer travel
day with nearly 25,000 international passengers arriving, border management agents were not willing to take on the risk of
processing passengers manually.

It took 10 hours to fix the computer problem. Seventy-three flights were affected with 17,000 inbound passengers left for
up to nine hours on aircraft. A further 16,000 outbound passengers were affected. The outage forced some planes to sit on
the tarmac for so long that airport workers had to refuel them to keep their power units and air conditioners running.
Maintenance trucks drove around the airport, with workers hooking up tubes to aircraft to service aircraft toilets.

That was over two years ago and manual backup plans proved no substitute. But what of the future when there will be
higher passenger numbers and greater dependency on information technology?

Greater IT dependency
The list of emerging technology trends within the air transport industry highlights the situation.
I   Heavier reliance on e-commerce/e-business and the use of the Internet
I   Greater use of kiosks/mobile phone/web check-in
I   Real-time biometric security screening
I   Automated boarding gates
I   Airport-based employees armed with digital handheld devices
I   RFID and sensor technologies for baggage/cargo/asset tracking
I   E-enabled aircraft downloading and uploading operational information




6 NEW FRONTIERS PAPER                                                                                               © SITA 2009
Manual backup is no longer an option
The result will be dramatic increases in both network traffic volumes and the complexity of IT hardware and software. This
will put an increasing strain on IT systems within the industry and as a consequence, the probability of experiencing failures
or unavailability will also increase.

Dealing with it requires good disruption management, but as technology gradually supplants people, relying on manual
backup systems will be increasingly untenable.

The situation is further aggravated by the industry’s growing reliance on external providers, such as public networks.

The five 9s - 99.999% - are now a standard telecom industry benchmark for network availability, so major outages are rare,
but they do happen. A severed fibre optic cable in December 2008, under the Mediterranean, caused severe disruption to
countries in the Middle East for over two days, with up to 70 per cent of all internet traffic and telephone communications
between Europe and Africa affected. Internet traffic had to be rerouted through Asia and the US to keep people connected.
A similar outage halted communications between Europe, Africa and Asia earlier in 2008.

Technology trends within the air transport industry, such as cloud computing and Software-as-a-Service (SaaS), will
exacerbate the dependence on third party suppliers. Infrastructure problems affecting ‘the cloud’ will become an issue.
Amazon, whose S3 service is the most-used ‘cloud’ utility service, has had its share of outages. In 2008, there were two
major ones in February and July. The one in July lasted eight hours. Microsoft also experienced nearly 22 hours of
downtime on its fledgling Azure Services Platform in March 2009.

Furthermore, airlines are making their websites more interactive. This brings new opportunities, but also provides new
potential for downtime as websites become stacked with third-party coding and applications, any one of which may fail and
cause knock-on consequences for the host website in terms of performance or downtime. When the website tracking and
counter company SiteMeter incorrectly updated its script in August 2008 – many websites have it embedded into their
pages to get visitor statistics – a number of popular sites reportedly crashed.




                                                                Integrating external services into the
                                                                industry's IT landscape is adding a new
                                                                level of vulnerability.




© SITA 2009                                                                                     IS ZERO DOWNTIME ACHIEVABLE 7
Part II - The cost of IT failure


IT downtime can be very disruptive with the consequences often disproportionate to the actual event as lost selling
opportunities and operational problems start to accumulate.

A lost selling opportunity
Airlines, in particular, are increasingly positioning themselves as e-commerce companies. Currently, around 27% of ticket
sales globally are derived from online sources1, of which around 90% are generated on the airlines’ own website2. The trend
is increasing. In mature markets, such as North America, online sales represent around 50%-60% of total ticket sales,
making it the primary sales channel. Some carriers have set themselves targets in excess of 90% of sales. Ryanair has set a
goal of 100%3. This will put airlines in the position of e-commerce companies, such as eBay and Amazon that almost exclusively
rely on their websites being available.

A four month study by Pingdom on the availability of 42 international airline websites found that average uptime was
99.49%, equating to 14 hours and fifty-two minutes of downtime per website4.




                                                        Accumulated downtime of airline
                                                        websites during period 19 November
                                                        2008 – 19 March 2009

                                                        Graphic courtesy of Pingdom.com




8 NEW FRONTIERS PAPER                                                                                               © SITA 2009
In today’s context for major airlines, the roughly 25% of ticket sales sold through their own websites5 means downtime could
equate on average to as much as US$ 9,000 a minute, or more than US$ 500,000 an hour, in lost online selling opportunity6.

That lost opportunity is not just restricted to ticket sales. Ancillary revenues initiated from the website can also be hit and
website downtime can have a ripple effect on the revenues of online partners.

A more intangible cost is damage to the brand. Repeated downtime for critical functions, such as baggage management or
check-in, can negatively impact the customer’s perception and loyalty. Underperforming websites for example, frustrate
customers, driving them to the competition. So site downtime does not just prevent current transactions – it can impact
future revenues as well.

Higher operating costs
While lost opportunities to generate sales are relatively easy to calculate, the operational cost when IT fails is less
quantifiable. Getting inbound and outbound passengers through the airport involves executing a tight set of parallel and
connected activities to meet a demanding flight schedule. A failure at any point can have knock-on consequences that
cause costs to multiply upwards, fast.

Staff overtime and lost productivity come into the mix. But what is clear is that airline and airport employees are becoming
more IT dependent, which means they will be less able to function normally in times of outages.

Consumer protection is also a variable in the equation. Passengers in some parts of the world now have enhanced rights to
compensation in the event of disruption to their journey or their baggage’s journey.

A strategic issue
What is clear is that IT uptime and business performance are now inextricably linked. As such, IT resilience needs to
become part of good risk management discussed at the highest levels of decision making rather than left to the IT
department. It is now no longer a technology issue, but a strategic issue.

As the sharp falls in the share prices of both BAA owners Ferrovial and British Airways following the problematic opening of
T5 at Heathrow Airport underlines, IT is now critical not just to the operational performance of an airport or airline, but also
to the business performance7.

Nevertheless, achieving continuous availability carries with it a cost – the cost of over-engineering systems and processes,
the cost of implementing best practices, the cost of new tools for predictive maintenance and condition-based monitoring
to mitigate what are relatively infrequent events. Mapping the criticality of IT to the level of business risk if it fails is therefore
a key requirement for air transport organizations.




© SITA 2009                                                                                            IS ZERO DOWNTIME ACHIEVABLE 9
Part III – How are we going to eliminate
downtime?

A quick look in the rear-view mirror shows what can happen when IT fails, but how do we mitigate this going forward?
The evolution of existing and new technologies, as well as improved processes and use of best practice and standards,
are all going to play a part.

Existing technologies
A number of technologies for increasing the robustness and resilience of the industry’s computing infrastructure are already
a fundamental part of data centre and network design.

Redundancy
The most common approach is to provide multiple levels of redundancy or backup systems for the critical components
along the chain, including end-user devices, network connections and interfaces, servers and storage devices. That
includes redundant power supplies and cooling systems. This eliminates many of the single points of failure that could
cause downtime.

Cables represent a major point of failure, even for wireless-based systems, which generally use wired networks for
backhaul. The use of more diverse network routings can limit the impact, while faster traffic rerouting techniques are helping
to limit downtime and expedite service restoration.

Server load balancing
Load balancing has been used to reduce downtime for many years and most of the methods are well understood. Load
balancing enables services to still be provided even if there is a failure in individual parts of the linked servers. Most load
balancing is done as an integral part of the Application Delivery Controller (ADC), which incorporates some level of ‘health
monitoring’ as well as directing network traffic and ensuring application uptime.

At the global level, server load balancing involves geographically distributed data centres. This provides an additional degree
of availability by accounting for site-level disruptions and outages.

Link aggregation
Link aggregation is a method of combining multiple physical network links into a single logical link. If one physical link goes
down, the other can still handle the traffic. After the failed link is repaired, the system automatically reconfigures to use all
active network links. The change of link is transparent to the end user who experiences no downtime.

Mirroring
Mirroring technology can significantly boost data redundancy capabilities. The mirroring system creates a duplicate database
in a separate location and synchronizes the data instantaneously. In the event that one database is disabled, the other one
takes over. Mission-critical applications can be mirrored across multiple data centres, providing disaster recovery options.




                                                                  Data centre improvements through new
                                                                  technologies such as virtualization will
                                                                  significantly reduce downtime.




10 NEW FRONTIERS PAPER                                                                                                 © SITA 2009
Breakthrough technologies
While these techniques are well established and the benefits of well documented, there are some new advances that are
now moving out of the R&D lab and into mainstream production environments.

Self-healing systems
Autonomic and self-healing systems that monitor, diagnose and repair their own internal problems have attracted a lot of
interest for reducing server downtime. Initial results appear promising. Self-healing architectures can trigger automated
reactions such as initiating backup systems, ordering replacement parts, or downloading fixes from online collaboration
software, while ensuring users of the system do not notice problems.

In addition, system administrators can be provided with a more informative and effective means to identify and prevent
future critical errors. Over time, the learning nature of self-healing components should ensure always-on systems.

Major vendors, such as Sun Microsystems and IBM, are investing in self-healing technology. “…autonomic technology
allows us to envision a day…when all IT problems are resolved in a fraction of the time it takes today. This has the potential
to unleash enormous productivity gains from such a dramatic decrease in downtime,” Dr. Kazuo Iwano, a vice president at
the IBM Software Laboratory in Yamato, Japan8.

Virtualization
The promise of significant costs savings from the consolidation of servers has made virtualization an emerging trend within
the air transport industry. According to the 2009 Airline IT Trends Survey, 90% of airlines will have made some investment in
virtualization within the next few years9.

In effect, virtualization frees software from the underlying servers running it. This makes it particularly useful for rapid
disaster recovery as hardware failures no longer cause application downtime. The software is automatically moved from
one server to another without any transaction discontinuity or loss, making the change transparent to a customer or
employee. This capability works equally across different geographical locations providing an additional degree of protection
from site-level disruptions.

Improved processes and design
Technology advances will only take the industry so far. Sustained reliability improvements require a critical coupling between
technology, processes and design. That means embedding best practices into the management of IT operations.

The effectiveness of a network system is determined by the effectiveness of the network’s processes. In essence, the
implementation of best practices is “the single most effective means” to reduce and/or mitigate the impact of outages.
So concludes the US Network Reliability Steering Committee (NRSC)10 having studied network outage frequency since its
creation by the FCC in 1992.

IT best practices today are largely governed by the IT Infrastructure Library (ITIL), a set of recommendations covering areas
such as incident management, problem management, change management, release management and the service desk.
At the most fundamental level ITIL helps IT organizations provide reliable and consistent service to end users. As such, it
has become the de facto global standard for high quality IT service management. Now in its third version, ITIL reinforces the
importance of a lifecycle approach for reducing IT downtime.

Another way of reducing downtime is through good design. This gives start-up airlines and low cost carriers with their
simplified IT structures an advantage. For established players it is much harder, with most air transport processes and
systems so tightly interlinked that new technologies, more often than not, have to be bolted onto older systems. Rather
than reducing downtime this introduces greater failure risks.




© SITA 2009                                                                                    IS ZERO DOWNTIME ACHIEVABLE 11
Enhanced service provision
The primary goal for service provision in any organization must be to ensure that IT infrastructure remains available at all
times. The air transport industry is no exception, but the value that IT brings air travel makes a vision of zero downtime a
greater imperative. To achieve it though requires a re-think in the way service provision for IT is delivered.

Take a typical airport setup illustrated below. Each of the six individual components has an uptime of 99% or more. But
availability is only as good as the weakest link, so when viewed from an end-to-end perspective it translates into uptime of
only 80% for the end-user.

End-to-end service mangement: today’s solution


                             05:00   06:00   07:00   08:00   09:00   10:00   11:00 12:00   13:00   14:00   15:00   16:00
                    Time                                                                                                   % available

                       Agent                                                                                               99%
                      Airport                                                                                              99%
             CUTE Servers                                                                                                  99.9%
                      Airport                                                                                              99.9%
                  Host DCS                                                                                                 99.99%
                Data Centre                                                                                                99.9%

     End-user experience                                                                                                   80%

                                     Unavailable             Available

Even when it is up and running the service may not be performing at an acceptable level. Service degradation can often be
viewed as ‘hidden’ downtime when it deteriorates to a level that makes it difficult for customers, employees and business
partners to perform the activities they want. It is therefore important to measure availability and response time from the
end-user perspective.

End-to-end monitoring
Central to the issue is end-to-end monitoring. But it is complex. Network connectivity, end-user devices, software and
applications can all be sourced from different vendors. Getting visibility of these third-party systems will require service
providers to cooperate and work much more closely with airline and airport customers and actively engage in the two-way
sharing of data.

The technology to do it is already becoming available. Next-generation monitoring tools that can capture the performance of
multi-vendor components are starting to be deployed by service providers, giving a snapshot of end-to-end visibility. Although
it will be a number years before they become a common feature of service management within the air transport industry.

The data from those tools will provide a single, cogent view to monitor and manage linkages between IT systems and
business processes enabling service providers to proactively identify and address problems before they become major
ones, along the entire IT chain. It will be an iterative process. Improved data quality will drive better predictive capabilities,
leading to incremental progress in reducing downtime.

Another beneficial outcome will be the ability to monitor and report with real-time data on service performance. That should
eventually enable external service providers to offer integrated end-to-end SLAs based on end-user availability and
response times. It is a capability that will be increasingly expected by airlines and airports.



12 NEW FRONTIERS PAPER                                                                                                           © SITA 2009
Part V – Conclusion – Is zero downtime
achievable?

Aiming for zero IT downtime is a tough challenge. It is clear that even major companies such as Google, eBay, or Amazon
that depend on technology for their existence, are not immune to downtime. But the future business requirements of the air
transport industry will make it more reliant on networked technology, so it is a challenge that must be faced.

The way forward is not easy. Achieving zero downtime carries with it a cost to mitigate what are relatively infrequent events.
As such, IT resilience is no longer a technology issue, but a strategic issue. Airlines and airports will need to treat it as part
of good risk management, deciding in which part of their operations continuous availability makes sense and where it does
not. Airlines may see downtime as an ‘acceptable risk’ at their outstations, but not at their hub airports or at other network
critical sites.

That makes it a board level responsibility outside the remit of the IT department. Investing the time and money to implement
policies and technologies that will move towards zero downtime requires leadership from the highest levels of decision making.

External service providers will also need to adapt by offering a variety of service and charging schemes to reflect the level of
criticality of IT to the customer’s business. In effect, airlines and airports will pay an ‘insurance’ premium to service providers
for continuous availability in locations where the business risk justifies it.

The aim is to design systems that can survive defects and so ensure continuous uptime of the industry’s IT infrastructure.
In five years time advanced tools and predictive techniques, coupled with more effective best practices and service
management, will make the complex IT systems underlying the industry more reliable than they are today.

But achieving the longer term vision of zero downtime is a goal that will only be achieved if it is shared not just by
third-party IT providers but by the airlines, airports and other stakeholders themselves. Failure to understand the cost of not
acting or investing in it carries the real risk that a business will become less competitive over time.




© SITA 2009                                                                                       IS ZERO DOWNTIME ACHIEVABLE 13
Notes and references


Note 1, Page 8: 2009 Airline IT Trends Survey conducted annually by Airline Business and SITA

Note 2, Page 8: 2008 Airline IT Trends Survey

Note 3, Page 8: Ryanair press release – 10th March 2009

Note 4, Page 8: Report available from link:
                http://www.pingdom.com/_img/press/pingdom_20090423_report_airline_websites_downtime.pdf

Note 5, Page 8: 2008 Airline IT Trends Survey

Note 6, Page 8: SITA estimate

Note 7, Page 9: The share price of British Airways and Ferrovial, owners of Heathrow Airport, were down more than 8%
                and 3% respectively during the two trading days following the opening of T5 on March 27, 2008.

Note 8, Page 11: CNET 2006

Note 9, Page 11: 2009 Airline IT Trends Survey is conducted by Airline Business and SITA. Results are available at
                 www.sita.aero

Note 10, Page 11:The Network Reliability Council (NRC) was established by the FCC in January 1992, to bring together
                leaders of the telecommunications industry and telecommunications experts from academic and
                consumer organizations to explore and recommend measures that would enhance network reliability. It
                produces a biennial report.




14 NEW FRONTIERS PAPER                                                                                           © SITA 2009
For further information, please contact SITA by telephone or e-mail:

Africa
+27 11 5177000
info.africa@sita.aero

East & Central Europe
+41 22 747 6000
info.east.central.europe@sita.aero

Latin America & Caribbean
+55 21 2111 5800
info.latin.america.and.caribbean@sita.aero

Middle East & Turkey
+961 1 657200
info.middle.east.turkey@sita.aero

North America
+1 770 850 4500
info.northamerica@sita.aero

North Asia & Pacific
+65 6545 3711
info.north.asiapacific@sita.aero

North Europe
+44 (0)20 8756 8000
info.northeurope@sita.aero

South Asia & India
+65 6545 3711
info.south.asia.india@sita.aero

South Europe
+39 06 965111
info.southeurope@sita.aero




© SITA 09-THW-048-1. All trademarks acknowledged. Specifications subject to change without prior notice. This literature provides
outline information only and (unless specifically agreed to the contrary by SITA in writing) is not part of any order or contract.

				
DOCUMENT INFO
Description: This New Frontiers paper provides a glimpse into the future, when the ability of airlines and airports to support an ‘always-on’ environment will be critical. Every day we see more technology being put in front of passengers and employees while some airlines are building their whole business model around e-commerce. Continuous availability of IT infrastructure should therefore be seen as an essential component of business continuity plans. Ultimately, the aim must be for zero downtime.