20070826095926Ir by wuzhenguang


									Disaster Recovery Planning
         Disaster Recovery Planning
   DRP is the process of regaining access to the data,
    hardware and software necessary to resume critical
    business operations after a natural or human-induced
   A disaster recovery plan (DRP) should also include
    plans for coping with the unexpected or sudden loss
    of key personnel, although this is not covered in this
    article, the focus of which is data protection.
   DRP is part of a larger process known as business
    continuity planning (BCP).
         What is the difference DRP
         and BCP (1/2)
   Disaster recovery is the process by which you
    resume business after a disruptive event.
   The event might be
       something huge-like an earthquake or the terrorist
        attacks on the World Trade Center
       something small, like malfunctioning software caused by
        a computer virus.
   Given the human tendency to look on the bright
    side, many business executives are prone to
    ignoring "disaster recovery" because disaster seems
    an unlikely event.
        What is the difference DRP
        and BCP (2/2)
   "Business continuity planning" suggests a more
    comprehensive approach to making sure you can
    keep making money.
   Often, the two terms are married under the acronym
   At any rate, DR and/or BC determines how a
    company will keep functioning after a disruptive
    event until its normal facilities are restored.
          What do these plans include
   All BC/DR plans need to encompass
       how employees will communicate
       where they will go
       how they will keep doing their jobs.

   The details can vary greatly, depending on the
    size and scope of a company and the way it
    does business.
          What do these plans include
   For example, The plan at one global
    manufacturing company
       restore critical mainframes with vital data at a
        backup site within four to six days of a disruptive
       obtain a mobile PBX unit with 3000 telephones
        within two days
       recover the company's 1000-plus LANs in order of
        business need
        set up a temporary call center for 100 agents at a
        nearby training facility.
        Events that necessitate
        disaster recovery
   Natural disasters
   Fire
   Power failure
   Terrorist attacks
   Organized or deliberate disruptions
   Theft
   System and/or equipment failures
   Human error
   Computer viruses
   Testing
          Prevention against data loss
   Backups sent off-site in regular intervals
       Includes software as well as all data information,
        to facilitate recovery
   Create an insurance copy on Microfilm or
    similar and store the records off-site.
       Use a Remote backup facility if possible to
        minimize data loss
   Storage Area Networks (SANs) over multiple
    sites make data immediately available without
    the need to recover or synchronize it
        Prevention against data loss
   Surge Protectors — to minimize the effect of
    power surges on delicate electronic equipment
   Uninterruptible Power Supply (UPS) and/or
    Backup Generator
   Fire Preventions — more alarms, accessible
   Anti-virus software and other security
        Techniques and technology
   Mirroring
       Disk mirroring : Redundant arrays of inexpensive disks 1
       Server mirroring: web / ftp /email
   RAID : RAID0 – 6 and combination
   On-site data storage
       Back up - Tape / optical disk
   Off-site data storage (backup-site)
       Cold sites
       Warm sites
       Hot site
   Mirroring can occur locally or remotely.
       Locally means that a server has a second hard drive that
        stores data.
       A remote mirror means that a remote server contains an
        exact duplicate of the data. The second drive is called a
        mirrored drive.
   Data is written to the original drive when a write
    request is issued. Data is then copied to the
    mirrored drive, providing a mirror image of the
    primary drive.
   If one of the hard drives fails, all data is protected
    from loss.
         Disk mirroring (RAID1)
    The replication of logical
    disk volumes onto separate
    physical hard disks in real
    time to ensure continuous
    availability, currency and
   A mirrored volume is a
    complete logical
    representation of separate
    volume copies
           Server mirroring
   Mirror sites are most commonly used to provide multiple
    sources of the same information, and are of particular value
    as a way of providing reliable access to large downloads.
   Mirroring is a type of file synchronization
   Web server
       To preserve a website or page, especially when it is closed or is about
        to be closed.
       To counteract censorship and promote freedom of information
   Email server
       To protect loss of email information
   ftp server
       To allow faster downloads for users at a specific geographical location
       Load balancing
         Redundant arrays of
         inexpensive disks (RAID)
   The organization distributes the data across multiple
 smaller disks, offering protection froma crash that
  could wipe out all data on a single, shared disk.
 Benefits of RAID include the following

       Increased storage capacity per logical disk volume
       High data transfer or I/O rates that improve information
       Lower cost per megabyte of storage
       Improved use of data center floor space
   RAID Level 0 -aka. a stripe set or
    striped volume) splits data evenly
    across two or more disks (striped)
    with no parity information for
   It is important to note that RAID 0
    provides zero data redundancy.
   RAID 0 is normally used to increase
   A RAID 0 can be created with disks
    of differing sizes, but the storage
    space added to the array by each
    disk is limited to the size of the
    smallest disk
   A RAID 1 creates an exact
    copy (or mirror) of a set of
    data on two or more disks.
   This is useful when read
    performance or reliability are
    more important than data
    storage capacity.
   Such an array can only be as
    big as the smallest member
   A classic RAID 1 mirrored pair
    contains two disks (see
    diagram), which increases
   A RAID 2 stripes data at the bit (rather than block) level, and uses a
    Hamming code for error correction.
   Extremely high data transfer rates are possible.
   RAID 2 is the only standard RAID level which can automatically recover
    accurate data from single-bit corruption in data.
   At the moment, there are no commercial implementations of RAID-2
   RAID Level 3uses byte-level
    striping with a dedicated parity
   RAID 3 is very rare in practice.
   One of the side-effects of RAID
    3 is that it generally cannot
    service multiple requests
   This comes about because any
    single block of data will, by
    definition, be spread across all
    members of the set and will
    reside in the same location.
   So, any I/O operation requires
    activity on every disk.
   RAID Level 4 uses block-level striping
    with a dedicated parity disk.
   This allows each member of the set to
    act independently when only a single
    block is requested.
   RAID 4 looks similar to RAID 3 except
    that it stripes at the block level, rather
    than the byte level.

   In the example , a read request for
    block "A1" would be serviced by disk 0.
    A simultaneous read request for block
    B1 would have to wait, but a read
    request for B2 could be serviced
    concurrently by disk 1.
   A RAID 5 uses block-level striping with
    parity data distributed across all
    member disks.
   RAID 5 has achieved popularity due to
    its low cost of redundancy.
   A minimum of 3 disks is generally
    required for a complete RAID 5
   In the example, a read request for
    block "A1" would be serviced by disk 0.
    A simultaneous read request for block
    B1 would have to wait, but a read
    request for B2 could be serviced
    concurrently by disk 1
   A RAID 6 extends RAID 5 by
    adding an additional parity
    block, thus it uses block-level
    striping with two parity blocks
    distributed across all member
   Improve reliability
   Like RAID 5, the parity is
    distributed in stripes, with the
    parity blocks in a different place
    in each stripe.
Nested RAID
Storage Model
        Storage Area Network
   The Storage Network Industry Association (SNIA)
    defines the SAN as a network whose primary
    purpose is the transfer of data between computer
    systems and storage elements.

   A SAN consists of a communication infrastructure,
    which provides physical connections; and a
    management layer, which organizes the
    connections, storage elements, and computer
    systems so that data transfer is secure and robust.
        SAN ‘s definition
   Put in simple terms, a SAN is a specialized,
    high-speed network attaching servers and
    storage devices
   It is sometimes referred to as “the network
    behind the servers.”
   A SAN introduces the flexibility of networking
    to enable one server or many heterogeneous
    servers to share a common storage utility,
    which may comprise many storage devices,
    including disk, tape, and optical storage.
          SAN Component
   SAN Connectivity
       the connectivity of storage and server components
        typically using Fibre Channel (FC).
   SAN Storage
       TAPE /RAID /ESS (Enterprise Storage System)
        /JBOD (Just Bunch of Disk) /SSA (Serial Storage
   SAN Server
       Windows /Unix /Linux and etc
        Switched Fabric
   An infrastructure specially designed to handle
    storage communications called a fabric.
   A typical Fibre Channel SAN fabric is made up
    of a number of Fibre Channel switches.
   Today, all major SAN equipment vendors also
    offer some form of Fibre Channel routing
    solution, and these bring substantial scalability
    benefits to the SAN architecture by allowing
    data to cross between different fabrics without
    merging them.
         Fiber Channel protocol
   Fibre Channel is a layered protocol. It consists of 5 layers,
   FC0 The physical layer, which includes cables, fiber optics,
    connectors, pinouts etc.
   FC1 The data link layer, which implements the 8b/10b encoding
    and decoding of signals.
   FC2 The network layer, defined by the FC-PI-2 standard,
    consists of the core of Fibre Channel, and defines the main
   FC3 The common services layer, a thin layer that could
    eventually implement functions like encryption or RAID.
   FC4 The Protocol Mapping layer. Layer in which other
    protocols, such as SCSI, are encapsulated into an information
    unit for delivery to FC2.
          IP Storage Networking
   FCIP (Fiber Channel over IP)
       It is a method for allowing the transmission of Fibre
        Channel information to be tunneled through the IP
   iFCP (Internet Fiber Channel Protocol)
       It is a mechanism for transmitting data to and from
        Fibre Channel storage devices in a SAN, or on the
        Internet using TCP/IP
   Internet SCSI (iSCSI)
       It is a transport protocol that carries SCSI
        commands from an initiator to a target.
         FCIP (Fiber Channel over IP)
   FCIP encapsulates FC frames within TCP/IP, allowing
    islands of FC SANs to be interconnected over an IP-
    based network
   TCP/IP is used as the underlying transport to provide
    congestion control and in-order delivery FC Frames
   All classes of FC frames are treated the same as
   End-station addressing, address resolution, message
    routing, and other elements of the FC network
    architecture remain unchanged
   iFCP is a gateway-to-gateway protocol for
    implementing a fibre channel fabric over a TCP/IP
   Traffic between fibre channel devices is routed and
    switched by TCP/IP network
   The iFCP layer maps Fibre Channel frames to a
    predetermined TCP connection for transport
   FC messaging and routing services are terminated at
    the gateways so the fabrics are not merged to one
   iSCSI is a SCSI transport protocol for mapping of
    block-oriented storage data over TCP/IP networks

   The iSCSI protocol enables universal access to
    storage devices and Storage Area Networks (SANs)
    over standard TCP/IP networks
        Back up site
   A backup site is a location where a business can
    easily relocate following a disaster, such as fire,
    flood, or terrorist threat. This is an integral part of
    the disaster recovery plan of a business.
   A backup site can be another location operated by
    the business, or contracted via a company that
    specializes in disaster recovery services.
   In some cases, a business will have an agreement
    with a second business to operate a joint disaster
    recovery facility.
         Cold Sites
   A cold site is the most inexpensive type of backup
    site for a business to operate.
   It provides office spaces to operate
   It does not include backed up copies of data and
    information from the original location of the
    business, nor does it include hardware already set
   The lack of hardware contributes to the minimal
    startup costs of the cold site, but requires additional
    time following the disaster to have the operation
    running at a capacity close to that prior to the
       Warm Sites
   A warm site is a location where the business
    can relocate to after the disaster that is
    already stocked with computer hardware
    similar to that of the original site, but does
    not contain backed up copies of data and
         Hot Sites
   A hot site is a duplicate of the original site of the
    business, with full computer systems as well as near-
    complete backups of user data.
   Ideally, a hot site will be up and running within a
    matter of hours. This type of backup site is the most
    expensive to operate.
   Hot sites are popular with stock exchanges and other
    financial institutions who may need to evacuate due
    to potential bomb threats and must resume normal
    operations as soon as possible.
        How to choose
   Choosing the type is mainly decided by a
    company's cost vs. benefit strategy.
   Hot sites are traditionally more expensive than
    cold sites since much of the equipment the
    company needs has already been purchased
    and thus the operational costs are higher.
   However if the same company loses a
    substantial amount of revenue for each day
    they are inactive then it may be worth the
   The advantages of a cold site are simple--cost.
    It requires much fewer resources to operate a
    cold site because no equipment has been
    bought prior to the disaster.
   The downside with a cold site is the potential
    cost that must be incurred in order to make
    the cold site effective.
   The costs of purchasing equipment on very
    short notice may be higher and the disaster
    may make the equipment difficult to obtain.
          Discovery Planning steps (1/3)
   Assess business impact and risk.
       This should include an assessment of the business unit's
        function and, preferably, a business impact analysis
       The purpose of the assessment is to determine the
        business unit's relative contribution to the larger
        organization (monetary and functional).
       The greater the potential impact, the more money a
        company should spend to restore a system or process
       For instance, a stock trading company may decide to pay
        for completely redundant IT systems that would allow it
        to immediately start processing trades at another
          Discovery Planning steps (2/3)
   Develop a Disaster Recovery framework.
       Data should be categorized by importance. Two
        measures of importance are used, RTO and RPO.
       Recovery Time Objective (RTO) is the acceptable
        amount of time between the disaster and the post-
        disaster resumption of function (how long can we
        wait to restore data?).
       Recovery Point Objective (RPO) is the acceptable
        data roll-back (how current does the data have to
          Discovery Planning steps (3/3)
   Develop a recovery strategy and then a
    written Disaster Recovery Plan.
       That written plan should address at a minimum:
        response, recovery, and resumption of services
        detailed tasks.
   Adjust information systems to make Disaster
    Recovery easier.
       This includes consolidating servers and data,
        perhaps with a Storage Area Network or other
        archival storage method.
          Important factors (1/3)
   Communication
       Personnel — notify all key personnel of the
        problem and assign them tasks focused toward
        the recovery plan.
       Customers — notifying clients about the problem
        minimizes panic.
   Recall backups
       If backup tapes are taken offsite, these need to be
        recalled. If using remote backup services, a
        network connection to the remote backup location
        (or the Internet) will be required.
         Important factors (2/3)
   Facilities
       having backup hot sites or cold sites for larger
        companies. Mobile recovery facilities are also
        available from many suppliers.
   Prepare your employees
       during a disaster, employees are required to work
        longer, more stressful hours, and a support
        system should be in place to alleviate some of
        the stress. Prepare them ahead of time to ensure
        that work runs smoothly.
          Important factors (3/3)
   Business information
       backups should be stored in a completely separate
        location from the company

   Testing the plan
       provisions, directions, frequency for testing the
        plan should be stipulated.
         Things to do in DRP (1/4)
   Here are 10 absolute basics your plan should cover:
    1. Develop and practice a contingency plan that
    includes a succession plan for your CEO.

    2. Train backup employees to perform emergency
    tasks. The employees you count on to lead in an
    emergency will not always be available.

     3. Determine offsite crisis meeting places for top
     Things to do in DRP (2/4)
4. Make sure that all employees-as well as
executives-are involved in the exercises so that they
get practice in responding to an emergency.

5. Make exercises realistic enough to tap into
 employees' emotions so that you can see how they'll
 react when the situation gets stressful.

6. Practice crisis communication with employees,
 customers and the outside world.
     Things to do in DRP (3/4)
7 Invest in an alternate means of communication in
  case the phone networks go down.

8. Form partnerships with local emergency response
  groups-firefighters, police to establish a good
  working relationship. Let them become familiar with
  your company and site.
     Things to do in DRP (3/3)
9. Evaluate your company's performance during
 each test, and work toward constant
 improvement. Continuity exercises should
 reveal weaknesses.

10. Test your continuity plan regularly to reveal
 and accommodate changes. technology,
 personnel and facilities are in a constant state
 of flux at any company.
        Top mistakes in disaster
        recovery (1/3)
1. Inadequate planning:
     Have you identified all critical systems,
     do you have detailed plans to recover them to the current
     Everybody thinks they know what they have on their
      networks, but most people don't really know how many
      servers they have,
     how they're configured, or what applications reside on
      them-what services were running,
     what version of software or operating systems they were
          Top mistakes in disaster
          recovery (2/3)
 2 Failure to bring the business into the planning and
testing of your recovery efforts.

 3 Failure to gain support from senior-level managers.
The largest problems here are:
       Not demonstrating the level of effort required for full
       Not conducting a business impact analysis and addressing all
        gaps in your recovery model.
    Top mistakes in disaster
    recovery (3/3)
   Not building adequate recovery plans that outline
    your recovery time objective, critical systems and
    applications, vital documents needed by the
    business, and business functions by building
    plans for operational activities to be continued
    after a disaster.

   Not having proper funding that will allow for a
    minimum of semiannual testing.

To top