Docstoc

Disaster Recovery Test Case Study

Document Sample
Disaster Recovery Test Case Study Powered By Docstoc
					Disaster Recovery Test Case Study Part One: Planning
Last month I had the opportunity to lead a cross-functional team at a financial services company in planning and conducting a major disaster recovery exercise. Knowing how popular case studies are to our readers of the IT Management Guide, I decided to chronicle the key activities of this effort in this four-part series. In this initial segment I discuss how the project came about, how I assembled the team, and how we developed the objectives and assumptions for the exercise. In subsequent segments I will discuss the planning meetings and simulated walk-through; the results of the exercise itself; and the many lessons learned from the activity. Where appropriate I have replaced names of individuals, departments, applications and locations with fictitious versions. The project was born out of IT's desire to test the restoration of one of the company's most critical application systems at the firm's newly built out-of-state recovery facility. IT managers asked me to coordinate what we described as an operational disaster recovery exercise. We would simulate the primary data center being completely disabled, and proceed to restore business operations supported by this critical application system. The system consisted of dozens of separate applications and databases. The first decision was to determine how many applications to include in this exercise. After careful review we decided on 17 of the most critical applications. Managers wisely chose not to include all of them to keep the scope of the project to a reasonable level, to be able to complete it within six weeks, and because this was the first attempt at a recovery exercise of this magnitude. The second decision was to include several business users to ensure the team had functionally restored all applications and that they could support the business processes of the users. This turned out to reap huge benefits later on in terms of building a spirit of trust, credibility and shared responsibility between the IT and business departments. The third issue was deciding how to realistically simulate the primary data center being down without adversely impacting 24-hour production that needed to run. The technical teams accomplished this by revising access control lists in routers and using modified host files to route traffic to the recovery site. These decisions all combined to help with the next one, which was determining who would participate in the exercise, and what their roles would be. 10.10.2 Objectives The overall purpose of this exercise was to flush out issues associated with the recovery of a key application system at an out-of-state recovery data center. It was not designed to be a Pass or Fail test. This was a recovery exercise intended for the staff to learn about, build upon, and make improvements to our overall recovery strategies. The team identified 13 specific objectives for this recovery exercise listed below in Figure 1. 1. 2. 3. 4. Determine to what degree the 17 key applications can be recovered to the location xyz recovery data center. Determine the minimum time needed to recover the entire system (recovery time objective, or RTO). Determine the minimum amount of data that cannot be recovered (recovery point objective, or RPO). Improve the partnering and teambuilding among appropriate individuals from the business units, IT and business continuity by collaborating on the development and execution of a successful recovery plan. Conduct a simulation walk-through one week prior to the recovery exercise to validate the sequence, thoroughness, dependencies and estimated timeframes of required recovery activities. Demonstrate that two separate branch offices can have all of their processing done using only the location xyz recovery data center. Demonstrate that a branch office can access web applications. Verify that there is no single point of failure associated with the primary data center in recovering systems at the location xyz recovery data center. Evaluate the development and execution process of the exercise by analyzing the results of surveys submitted by all participants.

5. 6. 7. 8. 9.

10. Conduct a lessons learned session to identify and prioritize mistakes to avoid and improvements to implement. 11. Develop action plans to implement improvements. 12. Compile a final report on the results of the exercise. 13. Determine the priority and sequence of bringing up applications. Figure 1 Recovery Exercise Objectives Assumptions The IT sponsoring managers and I developed an initial set of assumptions for this exercise. I discussed these with the entire cross functional team during subsequent planning meetings. Figure 2 lists our final set of assumptions. 1. 2. 3. Critical application system xx will be tested for recovery and will include 17 specific applications (names removed here for proprietary reasons). Specific application qq6 was not initially available at the recovery data center during preparations th for this exercise, but portions of it were available by May 13 . This exercise will also test for the recovery of any support applications that are needed to run the primary applications listed above. These support applications include, but are not limited to, the following: o Internet Explorer o Outlook/Exchange The issue of how Citrix will be used in this exercise is to load software on to the Citrix servers at the recovery data center. This exercise will simulate outside access from an outside party to the appropriate website and will include a test submission from an outside party partner. QA will test the appropriate applications as part of this. Replication between the primary and recovery data centers will be stopped for this exercise. Network connections will be severed between the participating branch offices and the primary data center. The branch offices will be used in this exercise to process test data. The test data will be created using all normal processes up to, but not including, final processing; the DBA and QA groups will be involved to ensure the test data is backed out from the system. The scenario selected for this exercise will be viable, practical and meaningful; the scenario will be described in sufficient detail so as to simulate an actual event. Branch office xyz will be used in this exercise to test application qq2. The Marketing and Business Development departments will not be included in this exercise. The routing of data from branch office aaa to the recovery data center will be pre-tested prior to st May 1 . The loading of the software on the recovery data center Citrix servers will be pre-tested prior to st May 1 . Full database consistency checks (DBCC) will be performed on all 13 database servers to ensure data integrity. If any of the DBCCs fail, the data will not be re-replicated. Citrix will be loaded on the recovery data center desktops prior to the exercise. Employee A will shadow the activities of the users at the branch offices from her desktop at the primary data center. The QA rep will be using a laptop at home for this exercise. We do not need to block traffic from the location xyz sales office to the primary data center. There will be interruptions to the Exchange servers at the recovery data center and to email services; the Help Desk will issue advisories about these interruptions.

4. 5.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

Figure 2 Recovery Exercise Assumptions Next Steps In part two of this series, I discuss the weekly planning meetings I conducted in preparation for the exercise, and the simulated walk-through we performed to verify the sequence, thoroughness, dependencies and estimated timeframes of required recovery activities.

This is the second of a four-part case study on planning and conducting an operational recovery exercise. In part one I described how the project began, the decisions that the executives sponsors and I made at the outset, and the initial objectives and assumptions we developed. In this segment I describe how I effectively conducted weekly planning meetings, and a simulated walk-through of the exercise. 10.11.2 Weekly Planning Meetings The executive sponsors from IT helped me determine the composition of the cross functional team who would plan and execute the recovery exercise. The kickoff meeting would be the key to a strong start so some extra planning went into it. I sent out a meeting invitation two weeks in advance and selected an optimal day, time and location to ensure a good turnout. These logical items may seem trivial but actually go a long way to gaining the all-important support and buy-in. For meetings of this type I assign a scribe to take minutes and a timekeeper to keep us on pace. Figure 1 shows the agenda I used. I employed a similar agenda for all subsequent weekly planning meetings. Not all meeting facilitators prefer to use such a tight, time-oriented agenda, but I found this approach very helpful. When upwards of twenty highly-opinionated IT professionals meet to discuss details of a recovery exercise, a tightly enforced agenda becomes very valuable.

Figure 1 Agenda for Recovery Exercise Kickoff Meeting Every week managers added a few more participants which usually brought fresh and meaningful perspective to items such as our objectives, assumptions and estimated time-frames. Spirited discussions from these perspectives usually led to numerous new action items to follow-up on each week. Table 1 shows a sample of types of action items captured and reviewed, and the format we used to track them.

Table 1

Table 2

Table 2 Simulated Walk-Through

Two weeks prior to the test, the recovery team and I scheduled and conducted a simulated walk-through. By now we had identified all the required steps needed to recover the 17 applications to our remote, out-of-state recovery data center. There were several reasons for conducting this walk-through:

     

identify any missing steps ensure recovery steps were in the proper sequence validate that all predecessor and successor steps were correctly identified verify that the responsible person for each step was properly assigned revise any time estimates for steps involving start, end and duration times identify any additional issues that needed to be resolved prior to the exercise

The walk-through was well-attended and accomplished all of its objectives. All necessary tasks and assignments had been correctly identified, and only a few of the time estimates needed to be revised. Next Steps The final steps of preparation involved testing the special network and server procedures. These procedures would ensure that all branch office traffic generated during the exercise would route to the recovery data center, even though normal production at the primary data center would still be processing. Once these procedures tested successfully, we were ready for the actual exercise. In part three of this series, I describe the results of the operational disaster recovery exercise conducted on Saturday, May 13, 2006.

This is the third part of a four-part case study of an actual disaster recovery exercise conducted in May 2006. In this segment I describe the results of the exercise by sharing key portions of the final report on the exercise. I begin with the executive summary, followed by the scenario used, and end with the estimated and actual time-frames of the 29 recovery tasks. Where appropriate I have replaced names of individuals, departments, applications and locations with fictitious versions. Executive Summary This report summarizes the results of the disaster recovery exercise conducted on May 13, 2006. The overall purpose of this exercise was to identify and resolve issues associated with the recovery of key applications at the xyz data center. By all accounts the exercise was a success and provided much useful information for employees to learn about, build upon, and make improvements to our overall recovery strategies. The following are among the highlights of this exercise:

    

All 13 specific objectives of this exercise were met (100%) 11 of the 17 applications were tested successfully by QA (65%) 11 of the 17 applications were tested successfully by Users (65%) Total of 22 individuals from 10 departments took part in the exercise Business customers from business unit xx and business unit yy participated in the exercise

This report consists of seven sections and ten appendices (not all of which are shown here). Following the executive summary are the status of the objectives and their methods of verification, and the final version of the assumptions used in this exercise. Next are the observations and issues documented during the exercise, followed by the lessons learned and their resulting post-exercise action items to implement improvement suggestions. In the lessons learned section (which will be in Part Four of this series), the

responses are listed in priority order based on the voting by the respondents. The distribution of the voting is also shown. It is expected that another similar recovery exercise of these applications will be conducted in mid-October 2006, and that many of these improvement suggestions will be implemented by that time. The appendices include such items as the lists of participants, the scenario used, the pre-exercise action items, meeting attendee roster, and the variety of recovery tasks performed along with estimated and actual duration times. Scenario Used At approximately 7:25am on Saturday, May 13, 2006, a small fire is reported at the location xx facility of company yy. The fire stems from faulty wiring in a server cabinet in the location xx campus data center. For reasons unknown the fire suppression system does not activate and the fire quickly spreads to several other cabinets. By this time the fire department has been notified of the incident and has trucks rolling to the site. At 7:37am the first truck arrives, and by 7:53am the fire is extinguished. Employee A of the Enterprise Storage group of IT is notified of the fire by the Network Operations Center (NOC) engineers at 7:32am. The engineers assess the damage and find it is limited to servers supporting application system qq, and the large scale storage array that houses all of its data. Employee A contacts his manager at 7:48am and they agree that the application system qq needs to be recovered immediately to the location xyz recovery data center. At approximately 8:00am, employee A contacts employee B and employee C at the recovery data center and advises them to initiate recovery actions for the application system qq. Recovery Tasks This section describes the 29 tasks required to recover the critical application system involved with this exercise. Table 1 shows the description of each task, the dependent tasks associated with each one (sometimes referred to as upstream/downstream or input/output), the person responsible for performing each task, the estimated and actual times, the variance (delta) between estimated and actual times, and any comments. As the comments section shows, there were initial problems with bringing up some of the databases in that they were designated as 'suspect'. The problem was eventually traced to an undetected failed script the night before during database shutdowns. The table also shows that the majority of tasks were completed ahead of schedule. This was important for two reasons. One is that the overall recovery time is a key measure of business continuity and relates to many factors concerning true business impact. The second is that is helps to estimate more accurately in future exercises. The overall expected recovery time was 3 hours 40 minutes. If we subtract out time for unexpected troubleshooting (not likely to occur again), the actual recovery time was 3 hours 20 minutes. In Part Four I discuss the lessons learned from this exercise and the follow-up actions that resulted from them

Table 1 Recovery Tasks (1 of 2)

Table 1 Recovery Tasks (2 of 2)

This is the final installment of the four-part case study on conducting an operational recovery exercise. In part one I discussed the preparations for, and the conducting of, the actual meeting with the business unit sponsor. In part two I described the weekly planning meetings and the structured walk-though of the exercise, and in part three I shared the compiled results of the exercise. In this part I show how the recovery exercise team captured, analyzed and followed up on several lessons learned.

Lessons Learned – What We Did Well
The recovery team conducted a lessons learned sessions within one week of the exercise. I facilitated this meeting in a manner described in the earlier section. This included a round robin method to solicit input on what we did well, and a nominal group technique to prioritize the responses. After compiling the feedback, I wrote and distributed a brief analysis of the results. Table 1 displays this information. I asked each participant to rank their top seven responses. Responses ranked first received seven points, those ranked second received six points, and so on. The far right column of Table 1 shows the distributions of these rankings. The total points received by each response served to prioritize them, as shown by the second column of the table. Table 1 Prioritized Actions Done Well

What Did We Do Well?

#

Pts

Response

Distribution

1

45

Exercise preparation uncovered production problems.

7,7,7,6,5,5,5,2,1

2

34

Good communication prior to exercise.

7,7,7,6,4,3

3

32

Identified what needs to be fixed.

7,6,5,4,4,4,1,1

T4

27

Set reasonable expectations prior to the exercise.

6,6,5,4,2,2,2

T4

27

Achieved the goals of the exercise.

7,7,6,4,2,1

T4

27

Good communication during the exercise.

7,6,5,5,4

7

26

Good identification of issues prior to exercise.

7,5,5,4,3,2

8

23

Good participation from cross-functional teams

6,5,5,4,2,1

9

21

Identified the right owners to the right pieces.

6,5,5,3,1,1

T10

19

Pre-planning and assumptions done well.

6,3,3,3,2,1,1

T10

19

Good response to problems.

7,4,4,2,2

What Did We Do Well?

#

Pts

Response

Distribution

12

17

Agendas, action items and daily meetings were fruitful.

6,6,2,2,1

13

14

It was correct to make it a functional test.

7,3,3,1

14

10

Involved the right customers.

6,3,1

15

9

Daily and weekly meetings complemented each other.

3,3,3

16

5

Video communication was a plus.

4,1

17

4

Good leadership was displayed.

4

18

3

Geographic logistics added realism to the test.

3

19

2

Made good use of the bridge line.

2

20

0

Pulled servicing in early to get a feel for the process.

Analysis The fact that the preparation for this exercise uncovered some unknown production problems was by far highest rated response in this category with 45 points. The next two highest ranked entries, with 34 and 32 points, involved actions prior to (good communication) and during (identifying fix-it needs) the exercise. The next three responses tied for fourth place with 27 points and also involved activities prior to (reasonable expectations), during (communication) and after (goals all met) the exercise. The next response ranked just under these three with 26 points and involved good identification of issues beforehand. The next four ranked responses were grouped closely and rounded out the top ten with 23, 21 and a tie for 19 points. These entries involved favorable impressions of participation, planning and problem-solving. The remaining responses touched on such items as meeting management, video teleconferencing and geographic logistics. Lessons Learned – What We Could Do Better In a manner similar to collecting input on 'What We Did Well', I facilitated gathering feedback on 'What Could We Do Better?', shown in Table 2. This information was extremely important in learning how to improve future exercises, and led to several follow-up action items, shown in Table 3. This concludes the case study on an actual operational disaster recovery exercise. The effort was successful in that the team restored critical software applications at a designated recovery site within reasonable time-frames, and that business users were able to verify the functional operation of the software.

A number of improvement suggestions came out of the exercise, and these will be put into practice during upcoming months. Another similar exercise will be conducted in about six months. Table 2 Prioritized Improvement Suggestions

What Could We Do Better?

#

Pts

Response

Distribution

1

49

Extend time between system build out and exercise.

7,7,7,6,6,6,5,3,1,1

T2

45

Provide more time for application build out.

7,7,7,6,6,6,5,1

T2

45

Evaluate timelines/scope for better schedule estimates.

7,7,5,5,4,4,4,3,3,2,1

4

34

Validate and pre-test systems prior to the exercise.

7,6,6,6,5,3,1

5

30

Improve assignment of roles.

7,6,6,5,4,2

6

24

Provide more clearly defined handoffs.

5,5,4,3,3,2,2

7

20

Limit those who want to act as leaders during exercise.

6,4,3,2,2,2,1

8

19

Build out applications sequentially, not all at once.

7,5,3,3,1

T9

18

Increase participation by Enterprise Architecture.

7,5,4,2

T9

18

Provide more time to plan/execute exercise properly.

6,5,4,3

11

16

Improve Friday meetings.

5,5,3,2,1

12

13

Ensure attendees prepare better for Friday meetings.

4,4,2,2,1

13

12

Test host files more thoroughly prior to exercise.

4,4,3,1

14

10

Increase attendance at weekly planning meetings.

7,2,1

T15

4

Improve testing of VPN concentrator POC.

4

What Could We Do Better?

#

Pts

Response

Distribution

T15

4

Improve terminology to clarify purpose of exercise.

3,1

17

3

Provide more machines for shadowing.

2,1

18

0

Improve use of bridge line.

Analysis The three highest ranked responses (with 49, 45 and 45 points) all dealt with time issues. It is clear that the majority of participants feel more time is needed in future exercises to ensure that software applications are fully built out and successfully tested prior to the exercise. This is included as one of the follow-up action th th items. Two other responses, ranked slightly lower in 8 and 9 place (tied) with 19 and 18 points, also dealt with time issues. Two other follow-up action items are a result of lessons learned improvement suggestions. The fourth th highest ranked response with 34 points was to validate and pre-test systems, and the 13 entry with 12 points was to test host files more thoroughly. Responses ranked 5 and 6 with 30 and 24 points, respectively, both involved exercise management, and consisted of improving role assignments and providing more clearly defined handoffs. These responses also resulted in follow-up improvement items. Table 3 Follow-up Action Items
th th

#

Date Asgnd

Description

Resp Person

ECD

Rev ECD

ACD

1

6/09

Resolve issue of security badge not allowing access into building D.

Associate A

8/01

2

6/09

Resolve Citrix access problem with home laptop computer.

Associate B

8/01

6/09

3

6/09

Ensure adequate time is provided to staff for application build-outs

Associate C

8/01

4

6/09

Ensure script that shuts down clusters is checked for successful completion.

Associate D

8/01

#

Date Asgnd

Description

Resp Person

ECD

Rev ECD

ACD

5

6/09

Ensure that host files are tested prior to exercise.

Associate E

8/01

6

6/09

Improve and clarify the assignment of roles for next exercise.

Associate F

8/01

7

6/09

Provide more clearly defined handoffs for next exercise.

Associate G

8/01

8

6/09

Emphasize proper handoffs during WalkThru simulation

Associate H

8/01

9

10

ECD=Estimated Completion Date Rev-ECD=Revised Completion Date ACD=Actual Completion Date


				
vverge vverge
About