Case Study: How to Write a Telecom Disaster Prevention and Recovery Plan
Linda Henning Telecommunication Manager GBMC - 2007
1
GBMC – Greater Baltimore Medical Center
2
GBMC includes:
Greater Baltimore Medical Center
The 292-bed Medical Center located on a beautiful 106 acre suburban campus, is Central Maryland’s leading community hospital. Employs 3000 people. It serves nearly 22,000 inpatients annually, and handles approximately 60,000 emergency room visits.
Babies are us! We have delivered 22,200 babies in the last 5 years. GBMC performs 40,146 inpatient and outpatient surgeries per year.
Hospice of Baltimore
Provides comfort and care to patients with life-limiting illnesses. Hospice workers care for an average of 376 patients a day.
GBMC Foundation
Supports the GBMC mission by managing fundraising efforts.
3
GBMC Telecom Staff
(with 140 years combined telecom experience!)
Consists of: • Linda Henning – Manager, former ROLM trainer, 1986. Reports to VP-CIO of MIS. • Bertha James – Telephone Operator Supervisor of 10 hospital operators covering 24/7 by 365. • Sandy V. – Telecommunication Specialist, former ROLM trainer, 1984. • Don Walker – PBX Engineer, former ROLM ATAC, 1982. • Milt Webb – PBX Engineer, former ROLM engineer, 1980 • Mark Brenner – PBX Engineer, former ROLM system designer, 1983 • Kathy P. – 20+ years Bell System/Verizon billing expertise.
4
GBMC - Telecom at-a-glance
MAIN CAMPUS
Siemens Model 80 System 1 4976 Ports System 2 2899 Ports Cornet to the 14 year old 100% reliable, wonderful Siemens Model 70 CBX 4720 Ports
REMOTE SITES
Model 10 CBX 200 Ports Model 10 CBX 150 Ports Model 30 HiCom 200 Ports Model 30 HiCom 100 Ports 1 HiCom 150 12 Key Systems
5
GBMC - Telecom at-a-glance
6 Nodes of Phonemail – 50+ Channels 3 Xpressions Servers 950 Voice users converted from Phonemail 200 Unified Messaging 200 Call Processing menus still in Phonemail 2 Agile Servers SDC – Intellidesk, Intellispeech, Webservices Spectralink – 400 phones and counting
6
Why subject yourself to the pain of writing a Disaster Prevention & Recovery Plan?
7
The Auditors are Coming! The Auditors are Coming!
WHY?
• Since September 11th, auditors no longer accept our word that everything will be okay….. • Legislators have instituted the Sarbanes-Oxley Act as well as HIPAA.
9
Top 3 Things on the Auditors “Hit” List
1. Disaster Recovery Plan
2. Data Center security 3. Documented change control
10
This GBMC disaster plan was written as a result of NOT finding any comprehensive plan for Telecom on the internet, in a text book or on “for sale” templates. This is a condensed version of a 79 page plan
SO……..
11
Start writing one now or a consultant will be writing one for you!
12
Now to the content portion of this presentation………
There are 3 parts to this plan: I. II. Prevention Plan Recovery Plan
III. Business Resumption
13
Part I
Prevention Plan
14
Write a Mission Statement
GBMC’s Mission Statement
It is the mission of the Telecommunications Department to provide quality telecommunication services to support the Company's business goals. This Plan has been developed under the direction of Linda Henning, Telecommunications Manager. With this Plan, the Telecommunications Department will:
15
GBMC’s Mission Statement
1. Ensure that critical telecommunications systems and facilities are sufficiently backed up and protected so that critical Company telecommunications equipment will be recovered within 4 hours of an outage occurring, depending upon the severity level of the outage. During this four-hour window, telecommunication disaster recovery equipment may be deployed. 2. Provide telecommunications recovery in the most economical way possible, covering essential applications and operations based on their relationship to the business.
16
GBMC’s Mission Statement
3. Restore normal telecommunications operations as soon as possible after a disaster. 4. Protect employees, equipment, facilities, and data involved. 5. Coordinate telecommunications recovery activities with applicable Company recovery plans and local, state and government disaster recovery plans.
17
What are the plan objectives?
The objectives of the Telecommunications Department disaster recovery plan are as follows: 1. 2. 3. 4. To protect human life; To minimize risk to the hospital; To prepare to recover critical operations; To safeguard the hospital against lawsuits;
18
What are the plan objectives?
The objectives of the Telecommunications Department disaster recovery plan are as follows: 5. 6. 7. 8. 9. To protect the hospital’s competitive position To preserve patient confidence and goodwill; To define what is at stake; To make a preliminary business impact analysis; To form a synopsis of recovery strategy.
19
What form of disasters could occur in your area?
What is a disaster?
Natural Causes
Human Error
Intentional Causes
Fire Flood Lightning Earthquake Hurricane Tornado Temperature
Programming Errors Sabotage Improper Maintenance Terrorism Unauthorized Personnel Vandalism Lack of Training Computer Viruses Carelessness Disgruntled Employees Cable Cuts Theft Union Activities
20
What are your Critical Telecom assets?
Device/ Asset Telecommunication Staff Model 70 CBX HiCom Model 80 – System 1 & System 2 Phonemail – 3 systems Spectralink Intellidesk – Intellispeech - Webservices 2 Servers in Data Center Xpressions – 3 Servers in Data Center Agile – 2 Servers in Data Center Zetron Siemens Siemens Siemens Siemens SDC Siemens Siemens Comm-Tronics Vendor Customer Number
21
What are the basic levels of a disaster?
22
Four Basic Severity Levels of a Disaster Minor - A minor interruption to telecommunications operations, e.g., hardware, software, facilities or personnel, which has a negligible effect on the Company. i.e. unplugged telephones, which produce switch errors.
23
Four Basic Severity Levels of a Disaster
Intermediate - An interruption that causes the telecommunications operations center to activate alternative communications strategies and closely monitor the situation. For example, Spectralink, Phonemail alarm, Intellidesk problems.
24
Four Basic Severity Levels of a Disaster
Major - An interruption to operations which will cause an extended (but recognized, usually through previous experience) delay in user services. For example, a motherboard in the switch, PRI card, Phonemail, Xpressions 1 or 2, Intellidesk.
25
Four Basic Severity Levels of a Disaster
Critical - An interruption which forces the communications center to shut down; people, hardware, software and facilities are impacted. For example, a power outage has occurred; a PBX is lost; however, backups can restore the system database. Depending on cause and severity of problem such as loss of power, flood, or cable cut.
26
Decision Criteria
Specific decision criteria which management will use to decide on the disaster status of an individual event include the following: Determine if loss of human life or severe injury is possible. Determine what impact the loss of communications will represent to the affected department(s), division(s), business unit(s), etc. Determine to what extent backup systems and/or facilities are readily available to be used in the outage. Determine if spare components, backup software, etc. are readily available to facilitate system recovery in advance of vendor/carrier support. Obtain input from police, fire, building maintenance staff, and other knowledgeable sources as to the potential damage.
27
• • • • •
Review what you have in place now…. Do you Have?
By-pass Phones Switch redundancy Halon or Environmentally friendly equivalent No sprinklers in Switch room - so no flood
28
Review what you have in place now…. Do you Have?
Cell Phones and PDA Type Phones with charged batteries UPS back-up batteries DC power plant and gel batteries Access controlled Switchroom – no vandalism Back-up Generators
29
Review what you have in place now….
Do you Have?
2 Way Radios DISA (Direct Inward System Access) turned off Custom Redirect from Verizon or other carrier to reroute calls 24/7 In-House Help Desk
You now have the foundation of your very own plan!
30
Part II
Recovery Plan
31
Recovery Plan
Scope of the Disaster Recover Plan GBMC’s Telecommunication Business Resumption Plan is designed to respond to a disaster at the main campus or other facilities under the oversight of the Telecommunications Manager. A disaster is defined as an incident that damages the facility, equipment or any mission-critical functions, such as the implementation of emergency codes. This plan provides recovery tasks checklists, forms and procedures required to effect a timely recovery.
32
Low Probability high impact types of Disasters
In the event of a Telecom disaster, the following specific recovery events need to be initiated, depending on the type and severity of the disaster. There are several types of outages that are beyond GBMC and Telecom’s prevention planning and preparedness. These are:
• • • • A major cut or damaged fiber optic or copper cable on GBMC’s campus or in the Towson area. Verizon Central Office failure. Water damage to the PBX Switchroom, resulting in a total room outage of the telephone switches. In the event of a major cable cut, GBMC will still have internal phone services.
33
Disaster Recovery Site
• Is there another location that can handle the calls? If so, arrange for design of Custom/Switch Redirect from your dial tone vendor. • This would be your hot site. MIS usually has a location like Sunguard in Philadelphia, PA. • For Telecom is not so easy, unless you have a fully functional duplicate PBX at the hot site.
34
GBMC’s Disaster Recovery site
GBMC has a live back-up Disaster Recovery site. This location is our Patient Accounting Office in Timonium, MD. This location was chosen for the following reasons: Similar telephone systems that telecommunications staff has technical training to modify software to implement changes required to accommodate GBMC’s main campus telephone calls. Service is provided out of a different Verizon Central Office than the hospital. Short driving distance but off campus. Ease of deploying telecom staff to help answer phone calls.
35
•
• • •
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
GBMC’s Telecommunications Department can institute the following disaster recovery activities based on the severity of the disaster, in accordance with the GBMC Incident Command Protocol.
REACT
Deploy full-scale disaster program Notify backup communications center, hot/cold site; Patient Accounting Alert vendors, carriers, suppliers; Verizon, Paetec, Siemens
36
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume Establish communications command center Re-deploy telecommunications staff, e.g., attendants; to Disaster Recovery Site - Patient Accounting Timonium MD Activate alternate facilities, systems; Custom Redirect Distribute Cell Phones Determine how long company can operate in recovery mode
37
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
RECOVER Begin recovery to backup center, if needed; Re-establish local dial tone, Re-establish 800 service, other switched services; Reroute critical analog/digital circuits
38
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
Recover commercial power, backup power sources Recover PBX/Key/Voicemail/ACD systems Recover communication system software, databases, etc.
39
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
REPAIR AND TEST
Replace damaged outside facilities Replace damaged communications systems, software; Re-cable main distribution frame, intermediate frames, if required Test recovered systems for proper operation Test recovered network assets for proper operation
40
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
RESUME Re-establish and verify network integrity Re-establish and verify security Employee, operational logistics maintained;
41
Summary of Activities – The 4 R’s
React, Recover, Repair and Resume
Begin cleanup; Continue to inform employees, management, customers, and media of recovery status Conduct recovery review; document analysis of recovery.
42
Order of Service Restoration This took a very long time to prioritize.
Order of Service Restoration
Communications/MIS Department Emergency Department Lab Pharmacy Respiratory Nursing Units Unit 59 - SICU Unit 25, 26, 27, L&D NICU & Newborn Nursery Unit 34, 35, 36, 37,38 Unit 43, 45, 46, 48 Unit 54, 57, 58 Plant Ops Security 43
Recovery Sequence
MAIN CAMPUS RECOVERY SEQUENCE
Hardware/Applications 1. HiCom 80 2. CBX Model 70 3. Zetron 4. Intellidesk – SQL Server 5. Spectralink 6. Phonemail 7. Xpressions Server 1 8. Xpressions Server 2 9. Agile server 1 10. Agile server 2 11. Intellispeech 12. IntegraTRAK 13. Web Services 14. Xpressions Test App 2 Expected Recovery Window* 12-24 6-12 1-6 1-6 24+ 12-24 12-24 12-24 24+ 24+ 1-6 24+ 1-6 24+ Level of Disaster Critical Critical Critical Major Major Major Major Major Intermediate Intermediate Minor Minor Minor Intermediate Individual(s) Responsible for Validation Telecom PBX Engineer Telecom PBX Engineer Database Administrator Database Administrator Telecom PBX Engineer Telecom PBX Engineer Xpressions Database Administrator Xpressions Database Administrator Telecom PBX Engineer Telecom PBX Engineer Database Administrator Call Accounting Database Admin Database Administrator Xpressions Database Administrator
44
Recovery & Restoration Time Frames
You may be held accountable for these, so be careful what you put in writing! Be very careful with what you can deliver because you will have to test the plan to prove it works GBMC tested our plan on November 29, 2006.
Recovery and Restoration Time Frames
The action(s) taken immediately following a disaster of any proportions fall into four timed phases. Listed below are the primary events that should occur in each phase. Many of the events in one phase will occur concurrently as a result of the efforts of various members of telecommunications disaster recovery teams.
45
Recovery & Restoration Time Frames
1-6 Hours After Being Notified 1. Protect human lives; 2. Assess damages; 3. Notify vendors, carriers, users; Siemens, SDC, Verizon, Comm-Tronics, Paetec 4. Establish command center; if needed 5. Notify senior management Administrator -on-Call 6 Notify Help Desk
46
Recovery & Restoration Time Frames
6-12 Hours After Being Notified 1. Notify users so they can assist in the recovery, if necessary; telecommunication operator needed at Patient Accounting to answer phones. 2. Establish hardware, software, and facility requirements; 3. Order necessary equipment and supplies; 4. Move off-site tapes, documentation to backup site; 5. Move emergency components to backup site;
47
Recovery & Restoration Time Frames
12-24 Hours After Being Notified
1. 2. 3. 4. 5. 6. 7. 8. Transportation system fully operational; Establish operations at backup site Activate and test operating system software; i.e. test call processing, incoming calls Activate and test system databases; Test/verify all new/replacement equipment; Test/verify all transmission facilities; Verizon Custom Redirect Restore disk files using backup tapes; Modify call routing
48
Recovery Procedures
What would you do if…..?
49
Major Loss of External Connectivity
Major Loss of Internodal or Interswitch Connectivity
Evacuation of Switch Room Extended Loss of Electrical Power Physical Damage to Switch Room Physical Damage to PBX Hardware
Major Loss of Campus Telephony Loss of Cable Infrastructure
50
This is what GBMC would do……
51
Major Loss of External Connectivity
Response Procedure Responsibility
Immediately upon detection
1. 2. 3. 4. 5.
6. 7. 8.
Determine what sites are impacted. Isolate what circuits are down. Reroute outgoing calls to back-up circuits & alternate pathways. Verify status of switch hardware – if source of failure, replace from stock or place service order. Verify integrity of cable infrastructure – if source of failure, initiate repair procedures & determine source liability. Notify Telco vendor of circuit failure and diagnostic findings. Open trouble ticket. Interact with vendor until circuit restored. Verify circuit functionality, after service restored.
1. 2. 3. 4. 5.
Telecom Manager PBX Engineer PBX Engineer PBX Engineer PBX Engineer
6. 7. 8.
Telecom Staff Telecom Staff Telecom Staff
52
Major Loss of Internodal or Interswitch Connectivity
Response Immediately upon detection 1. 2. 3. 4. Procedure Determine which equipment is impacted If collocated nodes, isolate problem and replace failing component. If distributed node, if source of failure, initiate repair procedures & determine source liability. If HiCom link, verify integrity of cable infrastructure - if source of failure, initiate repair procedures. Verify Cornet circuit and hardware – if source of failure replace from stock or place service order. Verify connectivity hardware – if source of failure place service order & expedite Interact with vendor until circuit restored Verify functional connectivity. 1. 2. 3. 4. 5. 6. 7. 8. Responsibility PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff
5. 6. 7. 8.
53
Major Loss of Campus Telephony
Response Procedure Responsibility
Immediately upon detection
1. 2. 3. 4. 5. 6.
Determine extent of service loss. Isolate node(s), or site as source failure – if physical damage call Verizon. If power or cooling, go to Plant Ops. If Common Control hardware, replace from stock or place service order. If service order placed, interact with vendor until hardware restoration. If HiCom expedite. Verify system functionality..
1. 2. 3. 4. 5. 6.
Telecom Manager & PBX Engineer Telecom Staff Telecom Staff PBX Engineer Staff PBX Engineer Staff PBX Engineer Staff
54
Loss of Cable Infrastructure
Response Procedure Responsibility
Immediately upon detection
1. 2.
Determine location of cable disruption. Issue stop work orders to perpetrator of damage, if on campus, secure area, maintain safety standards. Ascertain extent of damage, initiate repair procedures & determine source of liability. If extended repair, reroute essential functions at impacted area to temporarily restore service. Interact with vendor until service is restored. Verify service restoration..
1. 2.
Telecom Manager & PBX Engineer PBX Engineer
3. 4. 5. 6.
3. 4. 5. 6.
Contact Data Center Tech PBX Engineer Staff PBX Engineer PBX Engineer
55
Evacuation of Switch Room
Response Procedure Responsibility
Immediately upon notification or upon detection
1. 2. 3. 4.
5. 6. 7. 8.
Maintain safety of staff & comply with all safety officer instructions. Secure switch room door on exiting. Establish remote login of PBX. Maintain close contact with emergency responders to ascertain ongoing conditions & ability to reoccupy space. Upon access, ascertain any damage to system and/or infrastructure & initiate proper response. If extended repair, reroute essential functions at impacted area to temporarily restore service. Interact with vendor until service is restored. Verify service restoration
1.
Telecom Manager
56
Extended Loss of Electrical Power
Response Procedure Responsibility
Immediately upon detection
1. 2. 3. 4. 5.
Verify functioning of DC power plant back-up generator and UPS. Call GBMC Plant Ops to ascertain status of repair and restoration efforts. Monitor Switch Room functions and backup power/cooling supply. Monitor transition back to normal power supply and verify normal functionality. Verify service restoration.
1. 2. 3. 4. 5.
Telecom Manager & PBX Engineer PBX Engineer PBX Engineer Staff PBX Engineer Staff PBX Engineer
57
Physical Damage to Switch Room
Response Procedure Responsibility
Immediately upon detection
1. 2. 3.
Ascertain any damage to equipment or support infrastructure. Determine impact of damage on immediate operations Coordinate repair efforts with vendor to minimize operational disruptions.
1. 2. 3.
Telecom Manager & PBX Engineer PBX Engineer PBX Engineer and Plant Ops
58
Physical Damage to PBX Hardware
Response Immediately upon detection or access Procedure Responsibility
1. 2. 3.
4. 5. 6. 7. 8.
Ascertain full extent of damage, loss of service and salvage potential. Restore critical and essential services on existing capacity; initiate emergency by-pass service if necessary. Relocate hardware and expedite delivery of needed replacement components to restore basic DID and essential outbound services. If loss is significant & extended, arrange for Custom Redirect. Arrange for DID intercept message with details and bypass numbers from Verizon. Initiate recovery plans with vendor; coordinate vendor implementation team; establish milestone and timeline. Interface with vendor during recovery activity. Integrate restored services into operation during off times to minimize disruptions
59
1. 2. 3. 4. 5. 6. 7. 8.
Telecom Manager & PBX Engineer PBX Engineer Telecom Staff Telecom Staff Telecom Manager Telecom Staff Telecom Staff Telecom Staff
Part III
Business Resumption Plan
60
BUSINESS RESUMPTION
Restore Critical Business Functions Coordinate and restore the original site. Restore hardware systems. Restore software systems. Restore power/UPS. Replace fire detection and suppression systems.
61
BUSINESS RESUMPTION
Restore Critical Business Functions Address additional security concerns. Rewire the facility. Restore original LAN configuration Restore original wide-area network configuration. Test new hardware and software.
62
BUSINESS RESUMPTION
Restore Critical Business Functions Train operations personnel on new equipment. Train employees on new equipment. Schedule migration back to original site. Coordinate return to original site.
63
Wrap up Activities
1. 2. 3. 4. 5. 6. 7. Review critical events log. Evaluate vendor performance. Recognize extraordinary achievements. Prepare final review and activity report. Aid in liability assessments. Schedule compensatory time off Schedule the party!
64
Not a good note taker?????
This may just make your day!
65
The Data Disk includes: This PowerPoint presentation A copy of GBMC’s Disaster Prevention Recovery & Business Resumption Plan
66
Questions?
Resources • The Definitive Guide to Business Resumption Planning - Leo A Wrobel • Telecommunications Disaster Recovery Plan Template – Paul F Kirvan • Twenty Years of My Life in Telecom – Linda Henning
67
One More Thing…….. Please fill out your session evaluation !
Have a safe trip home!!