DataCenterDowntime Feb2011

Document Sample
DataCenterDowntime Feb2011 Powered By Docstoc
					The Truth and Consequences
  of Data Center Downtime




         © 2011 Emerson Network Power
     Emerson Network Power: The global
     leader in enabling Business-Critical Continuity
                                             Fire Pump Controller
                               Paralleling
                               Switchgear
 Automatic Transfer                                                     Surge Protection
      Switch

                                                                           Uninterruptible Power
                                                                            Supplies & Batteries
                                                                                                   Integrated Racks
Perimeter Precision
     Cooling                                                              Cold Aisle
                                                                                                                  Cooling
                                                                         Containment

         Row Based
      Precision Cooling                                                                                          Rack Power
                                                                                           Rack                 Distribution Unit
                      Extreme-Density
                      Precision Cooling
                                                                                                                KVM Switch

                                                                                                                 Monitoring


                                                                                                                      UPS
                                             Power Distribution Units
               Data Center
              Infrastructure
               Management




                                                       2010
                                                     © 2011 Emerson Network Power
Emerson Network Power –
An organization with established customers




                      2010
                    © 2011 Emerson Network Power
Presentation topics


• Emerson Network Power overview

• National Survey on Data Center Downtime: Frequency,
  Duration and Cost, Dr. Larry Ponemon, Founder and President,
  Ponemon Institute

• Preventing the Most Common Causes of Downtime: Root
  Cause Analysis, Best Practice Prevention and Technology,
  Peter Panfil, Vice President and General Manager, Liebert North
  America AC Power, Emerson Network Power

• Question and Answer session




                              2010
                            © 2011 Emerson Network Power
National Survey on Data
Center Downtime:
Frequency, Duration and
Cost
Dr. Larry Ponemon
Founder and President
Ponemon Institute




    © 2011 Emerson Network Power
About the Ponemon Institute

• The Institute is dedicated to advancing
  responsible information management
  practices that positively affect privacy,
  data protection and information security in business and government
• The Institute conducts independent research, educates leaders from
  the private and public sectors and verifies the privacy and data
  protection practices of organizations
• The Institute is a member of the Council of American Survey
  Research Organizations (CASRO), and Dr. Ponemon serves as
  CASRO’s chairman of Government and Public Affairs Committee of
  the Board
• The Institute has assembled more than 50 leading multinational
  corporations called the RIM Council, which focuses the development
  and execution of ethical principles for the collection and use of
  personal data about people and households



                              2010
                            © 2011 Emerson Network Power
About the studies

• Purpose: Determine the frequency and
  cost of unplanned data center outages
• Study 1: 453 individuals in U.S.
  organizations who have responsibility for data center operations
    – Perceptions about data center criticality, availability and outages
    – Perception differences between executives and associates
• Study 2: Develop an activity-based costing model derived from
  actual meetings or site visits for 41 data centers that experienced
  a complete or partial unplanned data center outages to capture both
  direct and indirect costs related to:
    –   Damage to mission critical data
    –   Impact of downtime on organizational productivity
    –   Damages to equipment and other assets
    –   Cost to detect and remediate systems and core business processes
    –   Legal and regulatory impact, including litigation defense cost
    –   Lost confidence and trust among key stakeholders


                                  2010
                                © 2011 Emerson Network Power
Perceptions about data center availability




           Agree: Combines strongly agree and agree responses
           Disagree: Combines strongly disagree, disagree and
                           unsure responses
                            2010
                          © 2011 Emerson Network Power
Perception differences between
senior management and operators




                  Supervisor and below
                   Director and above
                    2010
                  © 2011 Emerson Network Power
Experience with unplanned data center outages

  Experienced one or more                                       Frequency of
unplanned outages data center                          unplanned data center outages
   over the past 24 months                                over the past 24 months




                 Total data center outage: Entire facility is down
                Partial outage: Limited to individual rows and rack
                Device-level outage: Individual servers and IT units
                                 2010
                               © 2011 Emerson Network Power
Extrapolated duration of data center
outages in minutes




            Total data center outage: Entire facility is down
           Partial outage: Limited to individual rows and rack
           Device-level outage: Individual servers and IT units
                            2010
                          © 2011 Emerson Network Power
Extrapolated frequency of complete data center
outages by square footage
    Frequency
    Duration




                      2010
                    © 2011 Emerson Network Power
Extrapolated frequency of complete data center
outages by industry




             Extrapolated frequency of unplanned outages
                             over two years
                          2010
                        © 2011 Emerson Network Power
Study 2: Activity-based cost framework for the cost of
data center outages




           Interviewed and audited 41 data center managers who
                     experienced an unplanned outage
                            2010
                          © 2011 Emerson Network Power
  Cost loadings from ABC Framework

                                      Direct                  Indirect     Opportunity
Cost activity centers                                                                    Total
                                       cost                     cost          cost
Detection                               52%                          48%       0%        100%
Equipment cost                          60%                          40%       0%        100%
IT productivity loss                    23%                          77%       0%        100%
End-user productivity loss              22%                          78%       0%        100%
Third parties                           35%                          41%      24%        100%
Recovery                                22%                          78%       0%        100%
Ex-post response                        53%                          47%       0%        100%
Lost revenue                            33%                          26%      41%        100%
Business disruption                     24%                          30%      45%        100%
Average contribution                   36%                       52%          12%

                       Interviewed and audited 41 data center managers who
                                 experienced an unplanned outage
                                        2010
                                      © 2011 Emerson Network Power
Average cost by category




            Results shown are derived from the analysis of
             41 data centers located in the United States
                          2010
                        © 2011 Emerson Network Power
Total cost by industry sector




               The average duration of the outage for the
                   41 data centers was 102 minutes
                           2010
                         © 2011 Emerson Network Power
Total cost for partial and total shutdown




           Results shown are derived from the analysis of 41 data
                    centers located in the United States
                             2010
                           © 2011 Emerson Network Power
Preventing the Most Common
Causes of Downtime: Root
Cause Analysis, Best Practice
Prevention and Technology
Peter Panfil
Vice President and General manager
Liebert North America AC Power
Emerson Network Power



    © 2011 Emerson Network Power
Were the unplanned outages during
the 24 months preventable?




                     2010
                   © 2011 Emerson Network Power
Total cost by industry sector




             Data centers experienced multiple outages during
                      the 24 month period surveyed
                            2010
                          © 2011 Emerson Network Power
#1: Battery failure

• 65% of outages caused by battery failure
                A single bad cell among thousands can take down a facility

  How?          Batteries have a limited life expectancy

                False confidence; no indication of problems until needed


• Service life of a battery varies, dependant on:
   – Frequency of usage
   – Ambient temperatures
   – Quality of connections and terminals
• The weakest link in critical power



                                2010
                              © 2011 Emerson Network Power
#1: Battery failure

 Best Practice: Preventive Maintenance
• Service contracts for inspections and testing
   – Monthly, quarterly and annual actions need to be taken




                                2010
                              © 2011 Emerson Network Power
#1: Battery failure

 Best Practice: Real-Time Monitoring
• Measure the internal DC resistance of all battery cells
• Combination of hardware and software
   – Alarm management via email and SMS
   – Measures the reliability of the entire battery
      • Strap
      • Inter tier connections
      • Plates
      • Battery connection posts/ terminals
• Proactively indentify and replace bad batteries


                    White Paper: Implementing Proactive Battery
                              Management Strategies
                       to Protect Your Critical Power System
                                  2010
                                © 2011 Emerson Network Power
#2: UPS capacity exceeded

• 53% of outages caused by lack of UPS capacity
               IT gets added without knowledge of infrastructure impact

  How?         Redundant UPS loaded over 50%
               Should UPS or battery failure occur, the remaining UPS cannot support 101% of the load


               IT usage is variable, not static


• IT growth outpaces AC Power infrastructure growth
• Disconnect between Facilities and IT
   – The owner of the UPS might not be IT
• Battery runtime is also dependant on how much load is being
  supported



                                      2010
                                    © 2011 Emerson Network Power
#2: UPS capacity exceeded

 Best Practice: Additional UPS Cores for
 capacity and redundancy
• Keep redundant UPS capacities at 30% - 40%
   – IT load must not exceed the total capacity of a single UPS
   – Efficiency of the Liebert NXL optimized at partial loads
• Size the new UPS system on best-case growth
• Real-time capacity monitoring to manage load balancing
• UPS configured in a parallel redundant configuration




               Some data centers willing to trade redundancy
                     for capacity – analyze the costs,
                            risks and benefits
                                2010
                              © 2011 Emerson Network Power
#2: UPS capacity exceeded

• Options for parallel redundant UPS

    UPS                                                      UPS         UPS          UPS
            UPS      UPS
    Core                                                     Core        Core         Core
            Core     Core    STS
                                                                SS        SS            SS




   System Control Cabinet                                  Paralleling Cabinet

                                   IT Load                                                   IT Load

                   N+1                                                          1+N
 Centralized static transfer switch                     Distributed static switches
 System-level control, fault tolerant                   Individual cores manage load transfers
 Size of STS determines total capacity                  Cannot parallel different sized UPS

                   White Paper: High-Availability Power Systems, Part II:
                                  Redundancy Options
                                          2010
                                        © 2011 Emerson Network Power
#3: Accidental EPO / Human error

• 51% of outages caused by user error
               Pushing the EPO thinking it’s a light switch

  How?         Improper equipment operation could drop the entire facility

               Careless installation of servers damages infrastructure


• Many people involved in data center operation
   – Too many cooks…
   – Alarms and control panels everywhere
• 100% preventable
• Most cost-effective root cause to solve



                               2010
                             © 2011 Emerson Network Power
#3: Accidental EPO / Human error

 Best Practice: Documentation, Standard
 Procedures, Training and Remote Monitoring

                           Shield EPO
                                                    Documented
              Escort
                                                    Maintenance
              Visitors
                                                    Procedures

        No Food or          Infrastructure                   Labeling
          Drink               Monitoring                     One-Lines
                                                          Follow
              Keep it                                   Processes;
              Clean                                      No Short
                            Personnel                      Cuts
                             Training

                           2010
                         © 2011 Emerson Network Power
#3: Accidental EPO / Human error

• Best practices for EPO
   –   A / B EPO in A / B data centers
   –   Separate EPO from the fire alarm
   –   Remove local EPO from UPS and PDUs
   –   Provide physical protection
   –   Provide maintenance and test features
   –   Document and label
   –   Training
• 2011 code changes
   – NFPA 70 – 645-10, Disconnecting Means




                                2010
                              © 2011 Emerson Network Power
#4: UPS equipment failure

• 49% of outages caused by UPS failure
                UPS has components with a finite life, some need replaced

  How?         UPS repaired with non-OEM parts

               Blame the UPS when it’s really the batteries


• Reliability of a UPS only lasts as long as the shortest
  component life
   – Liebert design philosophy addresses this issue by reducing the number
     of parts, thus decreasing the chance of a failure
• UPS designed to prevent outages, not cause them



                               2010
                             © 2011 Emerson Network Power
#4: UPS equipment failure

 Best Practice: Preventive Maintenance by an
 experienced technician
• At least two PM visits per year
• OEM technician using OEM parts and calibration
• MTBF for units that received two PM’s is 23 times higher than a
  machine with no PM service events per year




               White Paper: The Effect of Regular, Skilled Preventive
                 Maintenance on Critical Power System Reliability
                                 2010
                               © 2011 Emerson Network Power
#5: Heat- and water-related

• 35% of outages caused by water incursion
• 33% of outages are heat-related
                Cooling leaks and chilled water distributed in-row

  How?          Repairs to in-row cooling causes chilled water leaks

               Server densities are rising, so is the heat


• As densities increase, cooling is brought closer to the IT load
   – For some in-row cooling products, water is on top of, next to and below
     critical electrical equipment
   – Solving the heat problem, but causing a water problem




                                 2010
                               © 2011 Emerson Network Power
#5: Heat- and water-related

 Best Practice: Utilized refrigerants, easier
 maintenance and leak detection monitoring
• R410A and Glycol for row-based units
    – Eliminate the need for water in the row
• Monitor for leaks under the floor




    Refrigerant-based high
       density cooling             Point or zone detection    Front and rear parts
                                                                    access
• Importance of easy maintenance for row CW units
    – Do you need to remove the in-row unit for repair?

                                 2010
                               © 2011 Emerson Network Power
#5: Heat- and water-related

 Best Practice: Optimized airflow
• Containment
   – Increases cooling capacity and energy efficiency
• Temperature sensors
   – Supply and return
   – Rack-level
• Utilize temperature data to control
  and optimize cooling output
   – Variable Speed Drives
   – Digital Scroll Compressors


                White Paper: Combining Cold Aisle Containment with
                 Intelligent Control to Optimize Data Center Cooling
                                        Efficiency
                                 2010
                               © 2011 Emerson Network Power
#5: Heat- and water-related

• Optimized airflow not only prevents heat-related outages, it
  improves cooling efficiency

                                                                                  Digital Compressor
                                                                                  Variable Speed Fan




                      Requires less fan power per kW of cooling
                         Leverages variable fan speed control
          Operates with digital scroll technology for variable capacity control
                               Up to 33% efficiency gain
                                     2010
                                   © 2011 Emerson Network Power
What could be done to prevent unplanned
outages in the future?




           How to make the case for more resources
                        and budget?
                What can be done short-term?
                        2010
                      © 2011 Emerson Network Power
Next steps

1. Educate your senior leaders on frequency and impact of downtime
   on your business
    – 56% of senior leaders think downtime doesn’t happen often
2. Utilize Cost of Downtime data to justify infrastructure improvements
    – Develop a business case or your own ABC model
3. Grab the “low-hanging fruit”
    – No cost to ensure IT staff doesn’t bring a Big Gulp onto the server floor
4. Conduct assessments and audits
    – Assess batteries, capacity, airflow– vendors can help
5. Talk to your infrastructure vendors
    – Service contracts, new technology, more best practices




                                  2010
                                © 2011 Emerson Network Power
Q & A, further reading




Dr. Larry Ponemon, Founder and                  Peter Panfil, Vice President and
President, Ponemon Institute                    General Manager, Liebert North
•    National Survey on Data Center             America AC Power, Emerson
     Outages                                    Network Power
•    Coming Soon: Cost of Data                  •   Addressing the Leading Root
     Center Outages
                                                    Causes of Downtime

                                 2010
                               © 2011 Emerson Network Power

				
DOCUMENT INFO