Stuff Happens How to Assess Risks and Set Objectives for Business Continuity Plans Presented by Jon William Toigo CEO of Toigo Partners International Founder of The Data Management Institute Abstract • Many disaster recovery planning efforts get bogged down when planners become preoccupied with identifying all of the risk scenarios that might impact company operations. Truth be told, enumerating natural and man-made hazard potentials, then assigning quantitative values to their likelihood of occurrence, is a fool's errand. , • What is needed is a straightforward assessment of assets -data and infrastructure -- to determine what needs to be protected, supplemented by an estimation of the cost of an interruption of access to assets (from whatever the cause) for 24, 48 and 72 hours. • The sources for this information are business stakeholders: so, yes, you will need to talk to end users. (Audible groans expected.) A Beginning is a Delicate Thing • Welcome • Most disaster recovery planning projects fail at inception, not at execution. • Why? • Asset criticality is improperly assessed • Objectives are improperly formed • Testing methods are an afterthought • And as a practical matter… Novice Planners Tend to Get Bogged Down in… Endless scenario construction Exacting vulnerability assessment Mindless risk quantification Scenario-Driven Plans Don’t Work… • If you are striving to write the perfect disaster scenario, you are in the wrong line of business. Try Hollywood or maybe the SciFi Channel instead. There’s potentially a lot more money in it… Disasters Don’t Play Out According to Script • Hurricane Hugo taught us not to rely on cell phones… • Hurricane Katrina didn’t destroy New Orleans. The storm had passed hours before the levies broke… • Hurricane Elena didn’t shut down Tampa Bay. Businesses, state emergency management did… More Lessons from Experience • There are a many things that threaten business operations… Flooding User/Equipment Error Snow & Ice Fire Tropical Weather Cyclones/Tornadoes Lightning Earthquakes Malicious Software Hackers/Disgruntled Employees Industrial Accidents Nuclear Events Biological Events Chemical Events Explosions Civil Unrest Following an Incident… • The cause of the fire is largely irrelevant • Was it electrical wiring? • A Molotov cocktail? • Refuse of paper near a copying machine? • A lightning strike? • Isn’t that what we pay those CSI guys to figure out? • The goal is to restore access to business critical data as quickly as possible. The Real Value of Vulnerability Assessment • Risk avoidance planning • To identify hazards so appropriate disaster prevention strategies can be developed • Also, to avoid wasting hard to come by , g y dollars on disaster recovery capabilities that make little sense • Examples: • Point in Time Split Mirrors • Expensive fire suppression systems for facilities that are (1) unlikely to have fires and/or (2) unlikely to receive effective protection from them Quantifying Risk is a Waste of Time • No actuarial tables for disaster • We have absolutely no idea when or if a disaster will occur • Those who say otherwise are speaking with forked tongue • That hasn’t stopped the institutionalized BS around risk quantification Case in Point: Annual Loss Expectancy • ALE: Developed in 1979 by the National Bureau of Standards • An effort to establish basis for Return on Investment analyses on DR program costs (e.g., to cost-justify disaster recovery planning • My silliness detector is on ORANGE To calculate simple risk exposure, two variables • • P(L) is the probability of loss, and it is a threat frequency value S(L) is the severity of the potential loss In theory, by factoring these two components together, we can determine a risk exposure numeric. To summarize: • • P (L) x S (L) = R (E) R (E) = the total risk exposure SLE: An Improvement? Really? • Single Loss Expectancy (SLE) = cost of a single event on a given target • • ALE = P (L) x SLE, except that probability of loss is unknown S b tit t St d d A Substitute Standard Annual l Frequency Estimate (SAFE) Value or some other agreed upon probability factor for P (L) ALE = SAFE x SLE Disaster event cripples 80% of a critical database valued at $100,000. It is expected that such an event will occur once in two years, thus ALE = .5 x $80,000 How much should you spend to prevent a potential $40,000 loss? SAFE Value .01 .02 .1 .2 .5 1 10 20 Frequency of Occurrence Once every 100 years Once every 50 years Once every 10 years Once every 5 years Once every 2 years Once a year 10 times a year 20 times a year • • Still seems pretty arbitrary, but this is how most risk assessment tools in the market do the math What Quantitative Analysis is Really For… • Building a FUD case that sounds “scientific” to justify a capability that in the best of circumstances never needs to be used… • Justify expenditures by reference to RTO/RPO • • Recovery time objectives (RTOs) still make sense (see below) Recovery point objectives (RPOs) fail the logic test: if you need to set RPOs at all, you really need to go to real-time load balancing with failover (aka “Always On” strategies) Time to data is what really counts! • • Which leads me to the following… Forget the FUD • Focus instead on business processcentric impact analysis • To define time-to-data requirements and data restore criticality • To provide foundation for good plan objectives, which in turn become the foundation for intelligent selection of appropriate recovery techniques • Since the world is an imperfect place, understand that requirements alone will not dictate recovery strategy adopted • Budget constraints and testing efficacy will mitigate “perfect” strategy • That is the essence of a risk profile So, What is BP-Centric Impact Analysis? • Common sense • First, ask management which business processes they believe to be most critical to the organization • Winnows down the list of BPs to a manageable number to start • Might be wrong, might overlook BP interdependencies, but you will discover those • Next, get with business unit experts to “deconstruct” processes • Break processes into tasks, tasks into workflows • Identify data assets associated with workflow Why Do I Have to Talk to Users? • Business unit managers can estimate cost of an outage of 24, 48 and 72 hours… • Not necessarily accurate • But, when you aggregate the results, the cost estimate of outage impact at least has the merit of coming from management’s management s own lieutenants • Understand the BP to understand data protection targets • Data inherits its criticality like so much DNA from the process that it serves • Metadata information like “date last accessed” or “date last modified” lacks sufficient granularity for proper analysis “Know thy Data” is the First Commandment of Effective DR Planning Okay. So How Do We Do It? Seems Simple, But There are Always Challenges • Challenge 1: Users aren’t always forthcoming with information • Waterboarding has been outlawed; other techniques depend on technology that is only available in sci-fi movies • Your inquiries often viewed with suspicion • Solution • First, senior management must give data collection effort visible backing • Second, you must conduct your research in a professional manner • Schedules and pre-preparation • Prepared questions: no fishing • Courtesy and respect: no criticism Pave the Way for Later Participation • Challenge 2: Finding experts • Manager not always the real expert or may not have the time to engage, may try to delegate • Okay, provided that surrogate is knowledgeable; do not settle for procedures manuals (usually out of date) • Cultivate a “stakeholder” • For each business process – preferably for each task, or workflow (the more granular the better) • Not only as a source of analytical data, but also as a liaison for change management and testing later Avoid Landmines • Challenge 3: Set expectations • Users get nervous if questions are overly technical or suggest or imply criticism of procedural efficiencies or costs: stay on point • You want to • Get a snapshot of the procedure itself • Understand what immediate resources are used to perform tasks (only end user computing resources such as applications, workstations, telephony, office equipment, forms, etc.) • Learn how many users perform task or workflow, characteristics of workload, access particulars, security procedures • Learn what data is used to perform work and what data is produced by the work Details Count • Challenge 4: Unrealistic to believe that you will become an expert on every business process, or even most of them • Solution: • Use simple flowcharts to model workflow steps and annotate with resources (these can be checked with IT and defined in greater detail later) • Use a simple numbering scheme to map data to workflows, tasks and BPs • Create some forms that might simplify data collection and normalization: a spreadsheet, database or web form can also work, but paper forms may be less frightening to users • Arrange to follow-up with stakeholder later, should questions arise Fleshing Out the Details • Normalize your data about BPs, tasks, workflows and data in/outputs • Embellish each data in/output with any special requirements • • • • Retention (how long must data be retained, d i d does i h it have a “ “stale” d l ” date?) ?) Encryption (must data be encrypted for storage?) Volatility (does data get updated frequently?) Accessibility (is data used by multiple users, applications, processes?) Simple Classification Scheme a te Fre qu en ss cy Fre qu en St a cy le by Sp ec ial Re te n tio De n let ion Re qu Re ire fe r me en ce nt Da ta ? Crit ica lity Se cu rity me nt ire qu Re he r Ot Ot he r Re qu ire me nt Up d Ac ce Data Name Workflow ID Just sort the spreadsheet and basic data classes separate out like a parfait… Shampoo, Rinse, Repeat • Often heard complaints… • What if there are a lot of tasks, workflows and data inputs and outputs? • I have thousands of business processes: you really think we can do this in a timely way? • This is your classic waterfall approach, which has already come under criticism in fields like application development. Isn’t it easier just to backup/mirror everything? • Why is man born only to suffer and die? • OMG, you folks complain a lot… But I Feel Your Pain… • Truth be told, there aren’t a lot of short cuts • Hazards of NOT classifying data by business process-related criteria include • Application of inappropriate or overly expensive protection strategies to relatively unimportant or low priority data assets Conversely, the failure to include mission critical data assets in protection schemes Lack of predictable time-to-data outcomes from ANY data protection scheme None of the extra business value that should derive from data analysis • • • • Any way you cut it, the results can’t be good A Few Tools that Can Help • Tek-Tools Storage Profiler • • Lightweight deployment Good storage assessment reporting for identifying dupes and dreck First “deep blue math” e-discovery algorithm that works Aimed at information governance and storage capacity reclamation Classify data by user profile Microsoft Active Directory support improving Robust reporting on the way • Digital Reef • • • NSM (Novell Storage Manager) • • • Just a Gol’ Dern Second… • You are talking about an “SRM” tool (Tek-Tools), an information governance tool (Digital Reef) and a file lifecycle management tool (Novell): what has that got to do with business impact analysis? • Each provides the means to • • Locate data assets on infrastructure Cull out the junk data and facilitate the introduction of intelligent data management to develop policy-based schemes for classifying data for data protection service provisioning Simplify manual classification activity, especially with respect to user files, which are the largest component of overall data stored by most companies today • In the Process • Do a bit more digging… • Work with IT to discover • Where and how data is currently hosted on infrastructure • The downtime record the hosting platform • What services are currently provided to the da a a data assets (e.g., de-duplication, on-array tiering, ( g , d dup a o , o a ay g, on-array PIT mirroring, thin provisioning, data protection including CDP, backup, mirroring, etc.) • Estimated costs of administration and maintenance • Work with accounting to discover • Current hosting facility power costs • Depreciated asset value of hardware platform, annual maintenance contracts, software licenses • Current DR program costs (off-site storage, hot site, WANs for replication, media costs…) With the Information You Have Gathered… • Not only can you determine… • • • Estimated outage costs (for budgetary justification) Data asset criticality (for recovery task prioritization) Hosting platform replacement requirements (for strategy building) • Not only can you develop preliminary detailed continuity planning objectives, providing… • • • Tasks to be performed Conditions for performing them Standards for evaluating success • You also have the ingredients to build a pretty effective data model of your business! ACME Company Detailed Summary Interdependencies Hard and Soft Costs Criticality Hard and Soft Costs Outage Costs Volume, Volatility, Access Frequency Hard and Soft Costs Regulatory Reqs: (e.g., Privacy, Preservation, Protection, Discovery) Hard and Soft Costs Interdependencies Over Time and With Effort… Imparting to Continuity Planning a Full Business Value Case Data model can offer insights to guide archiving and disk hygiene strategies that reduce storage costs while improving data protection efficiency… Data model can yield improved data protection strategy design and reduced downtime risk… information governance and security planning also yp g facilitated… Data model can improve user productivity by enabling archiving strategy that reduces file system clutter; better advice for business decision makers… Others are Taking the Journey www.c4project.org email@example.com Questions? Thanks!