High Availability with Microsoft SQL Server

Document Sample
High Availability with Microsoft SQL Server Powered By Docstoc
					 MICROSOFT SQL SERVER
      HIGH AVAILABILITY
AND DISASTER RECOVERY
 Michael Poremba // October 2008
    Database HA & DR Experience…
2




       Work with business to determine HA or DR
        requirements for applications and data?

       Design HA or DR solutions?

       Administer HA or DR process?

       Still learning MS SQL Server HA & DR capabilities?
    Scope of this Presentation
3



    Presentation Focus        Beyond Scope of Presentation

       Data Availability        In-depth how-to
                                  (available elsewhere)
         Data recovery          Partitioned views (federated)
         High availability      Advanced DBA techniques
                                 Custom application logic
         Disaster recovery
                                 3rd-party software solutions
       Technology Focus         Alternate DBMS engines
                                  (e.g. Oracle; DB2)
         MS SQL Server          HA on virtual machines
         Physical servers       Complex scenarios & solutions
         SANs                   Load balancing
4   Introduction to Data Availability
    So, you need to make your
    production database bulletproof…
    Data Availability Continuum
5


    Degrees of protection for information systems:
                        Business Risk         Solution
    Data Recovery       Data loss             Redundant data
    High Availability   Downtime of           Redundant system
                        database service      components
    Disaster Recovery   Downtime of           Redundant systems
                        business operations   and facilities
    Business Case for Availability
6



    High Availability            Disaster Recovery

       Keep business-critical      Protect against loss of
        applications available       data center
       Secondary:                  Secondary:
         Server   maintenance        Application  upgrades
                                      Infrastructure upgrades
    Service Level Agreement (SLA)
7

       Permitted downtime (planned vs. unplanned?)
         Uptime SLA             Downtime             Downtime
                                 per Year            per Month
         99.9%                  8.76 hours       43.8 minutes
         99.99%               52.6 minutes       4.38 minutes
         99.999%              5.26 minutes     0.438 minutes
       Acceptable data/transaction loss
       Application response times
       Mean time to recovery

    Note: Database uptime is not equivalent to application availability
           Failures of other application services
           Network outages
    Protect What?
8


       Application data stores
         Databases
         Files
         Other data repositories
       Database services
           DBMS availability for applications
       Application services
           Application availability for users and external systems

    Databases are the heart of most information systems;
      they deserve the highest affordable protection.
    Database Failure Scenarios
9



    Physical Infrastructure Failures   Logical Data Failures

       Storage subsystem                 Operator errors
         Disk                              DBMS  interruption
         Controller                        Drops / deletes

       Network                           Application defects
       Server                            DBMS defects
       Power                             Data corruption
     Service Recovery Strategies
10



     Standby   Failover Behavior             SQL Server Feature
     Mode
     Cold    • Manual intervention required • Backup and restore
     standby   to restore offline data copy
     Warm    • Data copy online and ready • Transaction log
     standby • Manual failover required       shipping
                                            • Database mirroring
     Hot     • Automatic failover           • Database mirroring
     standby                                • Failover clustering
     Data Recovery—Terminology
11


     Terminology varies for source vs. copy
     High Availability Strategy   Data Source   Data Copy
     Backup and Restore           Database      Backup
     Log Shipping                 Primary       Secondary
                                                Standby
     Database Mirroring           Principal     Mirror
     Failover Clustering          Primary       Secondary
                                  Active        Passive
                                                Standby
                                                Inactive
12   Data Recovery
     [Briefly…]
     Database Backups
13


        Traditional backup types
          Full backup
          Differential backup
          Transaction log backup

        Disk is better than tape
          First backup to disk (separate physical disk volume)
          Detect exceptions encountered during backup
          Verify backup files
          Copy backup files to tape or remote disk

        Data retention policy for backup files
     Database Backup Strategy
14


     Backup of user databases not sufficient for recovery
      System database

      Master database

      MSDB database

      Model database

      External data stores…
     Synch with External Data Stores
15


     Synchronize recovered database with external data
     stores:
      Identity column seeds

      Full-text indexes
         (SQL Server 2000)
        LDAP entries
        File system objects
        Other databases
     Backup Retention Policy
16


        Location of backup files
        Duration of retention
        Protection of sensitive data
          Sarbanes/Oxley     (SOX)
          HIPAA

          Internal   policies for data management and protection
        Access to backups from offsite data storage
     Data Recovery Process
17

        Backup file sets                          Recovery strategy depends on
            Full baseline, differential, and       failure scenario
             transaction logs                          Create comprehensive failure
        Retrieving backup files                        matrix
            Offsite storage                           Devise recovery strategy for
                                                        each scenario
            Tape
                                                       Does worst-case recovery
            Network copy                               scenario fit within SLA
            Dependency on multiple                     parameters?
             people to get access to               Recovery time; SLA
             backup files
                                                   Include future data growth in
                                                    recovery plan
                                                   Fully test recovery
                                                    strategies—practice is
                                                    essential
18   High Availability
     High Availability
19


        Minimize or avoid service downtime
          Whether   planned or unplanned
        When components fail,
         service interruption is brief or non-existent
          Automatic   failover
        Eliminate single points of failure (as affordable)
          Redundant   components
          Fault-tolerant servers
     Redundant Components
20

     Objective: Avoid single points of failure (where affordable)
     Approach: Use redundant components for database service
      Database server nodes

      Server components

            ECC RAM; failure-tolerant HW & OS
        DBMS instance
        User databases
        Storage devices
        Storage unit components
            MPIO: Interfaces; paths; switches; controllers
            RAID: Disks
        Networking
            MPIO: Interfaces; paths; switches
        Data copies
            E.g. Recovering torn page from mirror in SQL Server 2008
     Transaction Log Shipping
21


        Warm standby solution
        Duplicate user database
          Copy    transaction logs to standby server & restore
        Database available for read-only access
          Usersmust disconnect for logs to be applied
          Two database licenses required if querying standby

        Manual application failover
        Supported on standard hardware
        Possible data loss (unapplied transactions)
     Database Mirroring
22

        Redundancy at user database level
            Duplicate copy of user database
            Independent storage devices
            Multiple copies of instance databases                                                     witness
        Mirrored over private network channel                                                         (optional)
            Mirror always redoing transactions from principal
            Negligible impact on transaction throughput
        Multiple mirroring modes:                                           node A                       node B
            High-availability: commit @ log on mirror; automatic failover
            High-protection: commit @ log on mirror; manual failover
            High-performance: commit when logged on principal
                                                                              Local Storage      Local Storage
        Very fast automatic failover—seconds                                 · local sys DBs    · local sys DBs
            Requires witness server                                          · source user DB   · mirror user DB
        Mirror-aware application client connection
            Provided by client library
            Database connection string must specify both servers
        Mirror may be available for read-only access (snapshots)
        Works with standard hardware
     Mirror Witness
23


        With mirroring, more than one server is required to
         decide on failover
        Witness automates failover from primary to mirror
          Watches database availability
          Reports observations back to principal and mirror

        Runs in separate SQL Server instance (Express is OK)
        Prevents “split brain” scenario
        Very low resource consumption
            Can be witness for multiple databases
        Not a single point of failure
     SQL Server Failover Clustering
24


        Two clustered nodes
          Active/Passive   config
        MS SQL services
          Running   on virtual server   node A                    node B


        Shared storage device                    Shared Storage
          User databases                         · system DBs
                                                  · user DBs
                                                  · quorum
          System databases
          Quorum drive
          Redundant internal
           components
     Active/Passive Failover Clustering
25

        Redundancy at database instance level
            All databases fail over together
            Shared copy of system databases
        Single data copy on shared storage
         device
            No I/O overhead reducing throughput
            Storage unit is single point of failure for   node A                    node B
             cluster
        All database services are clustered
            SQL Agent; Analysis Services; Full-Text                Shared Storage
             engine, MS DTC                                         · system DBs
                                                                    · user DBs
        Automatic failover (up to minutes)                         · quorum
        DBMS accessed over virtual IP
        Database not available from inactive
         node for DB client connections
            Storage is controlled by one cluster node
             at a time
        Requires hardware certified by Microsoft
         for Microsoft Cluster Service
     HA Comparison
26



     Database Mirroring                  Failover Clustering
        Scope: user DB                     Scope: DBMS instance
        Standard hardware                  Certified hardware
        One SQL license                    One SQL license
         (unless querying snapshots on       (only one node can access
         mirror)                             database)
        Very fast failover (seconds)       Automatic failover (up to minutes)
        OS flexible (e.g. 32/64)           Enterprise OS
        Independent storage                Shared storage
        Independent services               Clustered services
        Reporting on mirror                Standby not available
        Geographic separation OK           Servers are usually co-located
     Considerations for HA
27


        HA complements backup and recovery strategy
            Does not replace data recovery plan
        Application service availability is often determined by
         a network of interdependent services
          Availability can be difficult to define (e.g. partial failures)
          Failure probability difficult to measure or compute
        Increased system complexity could lead to lower service
         availability!
          Operator error a leading cause of availability issues
          Increased number/types of system components
          More complex to configure and administer
        Data Recovery Requirements
28




                                                                                                                      Failover Clustering
                                                             High-Performance




                                                                                                  High-Availability
                                                                                High-Protection
                                                             DB Mirroring –




                                                                                DB Mirroring –




                                                                                                  DB Mirroring –
                                              Log Shipping
                                 Backup and
     Requirements                Recovery

     Cost                        Low          Low/Med        Medium             Medium            Medium              High

     Relative complexity         Low          Low            Medium             Medium            High                High

     Data loss                   Possible     Latest log     Possible           None              None                None

     Scope of duplication        Database     Database       Database           Database          Database            DBMS

     Failover                    Downtime     Downtime       Manual             Manual            Seconds             Up to minutes

     Client redirect             Manual       Manual         Automatic          Automatic         Automatic           Automatic

     Rolling upgrades & maint.   No           No             OS & DB            OS & DB           OS & DB             OS

     Access data on secondary    Restore      Read-only      Snapshot           Snapshot          Snapshot            No

     Geographic separation       OK           OK             OK                 Latency?          Latency?            Latency?
29   Disaster Recovery
     Disaster Recovery
30


        Minimize downtime of business operations
          Redundant     systems and facilities
        SQL Server features:
          Transaction  log shipping
          Database mirroring

          Failover clustering

        Other technologies
          Storage-based     mirroring
     Disaster Recovery Planning
31


        Data security requirements
        Clarify SLA, data loss allowance
        Evaluate system cost vs. data protection
        Failure analysis
        System redundancy
        Process validation
        Training for personnel
          Prevention practices
          Executing disaster recovery and business continuity

        Practice, practice, practice
     Business Continuity Facility
32


        System redundancy
          Systems:Web servers app servers; database, etc.
          Data: Databases; data files on OS; security info, etc.

          Networking: Domain, routing, subnet, VIPs, etc.

        Alternate facilities
          Network  bandwidth
          Physical or network access by operations staff

        Failover
          Often    a deliberate decision, using manual failover
     Data Redundancy
33


        Synchronous redundancy
          Network bandwidth cost
          Network latency and application performance
          Network reliability
        Asynchronous redundancy
          Risk of data loss
          More cost-effective
          Resilient to network latency issues
        Candidate Technologies
          SQL Server database mirroring
          Failover clustering with SAN-based mirroring
        DR Using Database Mirroring
34


            Two sites: Primary and DR location
            Separate failover clusters at each site
            SQL Server database mirroring between sites
                                                             witness
                                                             (optional)

     failover cluster at site A                                                  failover cluster at site B


       node A1                         node A2   database    node B1                           node B2
                                                 mirroring


                    Shared Storage A                                      Shared Storage B
                    · local sys DBs                                       · local sys DBs
                    · local quorum                                        · local quorum
                    · source user DB                                      · mirror user DB
        DR Using SAN-Based Mirroring
35


           Two sites: Primary and DR location
           Four-node failover cluster; one virtual IP address
           SAN-based mirroring between sites
           Manual cluster failover
     failover cluster nodes at site A                                    failover cluster nodes at site B


       node A1                          node A2               node B1                        node B2

                                                  storage-
                                                   based
                   Shared Storage A               mirroring             Shared Storage B
                   · system DBs                                         · system DBs
                   · quorum                                             · quorum
                   · user DBs                                           · user DBs
36   Complimentary Technologies
     [Skip if time is running short.]
     SAN-Based Data Mirroring
37


        Data blocks duplicated at storage level
            Similar to transaction log shipping
        Copy performed in sequence and coordinated with
         database checkpoint
            Ensures consistency of mirrored data files
        Synchronous or asynchronous mirroring
        Co-located or geographically dispersed—both are OK
            SAN link bandwidth must support database I/O rate
        May require extra feature support from SAN vendor
        Could rely on Failover Clustering for HA
     SQL Server Database Snapshots
38


        Read-only point-in-time database snapshot
        No data is copied—instantaneous
                    snapshot pages tracked separately from
          Historical
           changing pages
        Snapshots can be maintained indefinitely
          Limited   only by available storage
        Snapshot copy can be used for reporting
          Read-only,   so no locking issues
     SQL Server Replication
39

        Transactional replication            Subscriber databases
            High transaction volume           available for reporting
            Low data latency required        Replicate data subsets
            Mixed technologies:              Some data loss is possible
             Integrates with other DBMS
                                              Periodically validate
        Merge replication                     replicated data
            Bi-directional data changes
            Typically server-to-client
        Snapshot replication
            Large, infrequent data
             changes
            Data change latency OK
            Best for smaller data sets
40   App Development and Admin
     Considerations for App Developers
41

        App services tolerant to database service interruptions
        Application transactions must be handled in code—data consistency
        Exception handling for transaction retry, connection recovery
        Requires coding standards, code reviews, and testing
        Bulk data operations
        Transaction volume impacts rollback time during failover
        Batch jobs must be run on alternate nodes
        Don’t bypass transaction logging
        Synchronization with external data sources?
        Be aware of database recovery model
        Mirroring uses FailoverPartner in connection string
        Use TCP/IP as client protocol
     Considerations for Admins
42

        Use identical server hardware, when possible
        Design network redundancies, when feasible
            Consider network latency for geographic separation
        Always manage through virtual cluster, not individual cluster nodes
        Retest failover/failback after HA maintenance
        Diagnose after failover
            Repair alternate node
            Resynchronize data, as necessary
            Be aware of primary/secondary locations
            Ensure application services are connected and functioning properly
        Keep server node configurations synchronized:
            Service pack and patch levels
            Duplicate non-redundant resources
            Jobs; logins and permissions; OS & sys objects
     HA Risks
43


        System performance degradation
        HA system complexity leads to availability issues
        Some system failures not planned for
        Backup and recovery planning incomplete
        Administrators not fully trained or informed
        User databases not synchronized with other data
         sources
     Common Admin Use Cases
44


        Maintain HA nodes
          Hardware   maintenance
          Rolling upgrades and software patches

        Resynchronize the redundant copy
          Re-synch  mirror
          Restart log shipping

        Diagnose and repair
          Diagnose   cause of failover
          Repair failed node and restore failover capabilities
          Test failover and failback
     Common Admin Actions
45


     Train and practice administrators to:
      Initiate a database mirror

      Manually failover mirror database or cluster node

      Add/remove passive node from mirror or cluster

      Upgrade/patch servers nodes

      Restart or redirect application services
46   More Information
     References—Books
47



     High Availability                    Related Topics

        Microsoft SQL Server 2008 High      Pro SQL Server 2005 Replication
         Availability with Clustering &       by Sujoy Paul, 2006.
         Database Mirroring                  Pro SQL Server 2005 Service Broker
         by Michael Otey, 2009.               by Klaus Aschenbrenner, 2007.
        Microsoft SQL Server High           The Rational Guide to SQL Server
         Availability                         2005 Service Broker
         by Paul Bertucci, 2004.              by Roger Wolter, 2006.
        Pro SQL Server 2005 High
         Availability
         by Allan Hirt, 2007.
     References—Presentations
48

        Microsoft Load Balancing and Clustering
         http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/
         14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt
        SQL Server 2005 High Availability
         http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
        High Availability Technologies In SQL Server 2000 And SQL Server 2005
         http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
        Meeting the Availability Challenge
         http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-
         4F7D46FCEAA6/HighAvailability.ppt
        Disaster Recovery Mistakes
         http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt
        SQL Server 2005 High Availability
         http://blogs.msdn.com/sql2005event/attachment/564303.ashx
        Effective Usage of SQL Server 2005 Database Mirroring
          http://www.sqlserver-qa.net/SSQA-
         Effective%20Usage%20of%20SQL%20Server%202005%20Database%20Mirroring_show.ppt
     References—Articles
49


        Achieve High Availability for SQL Server
         http://technet.microsoft.com/en-us/magazine/cc162477.aspx

        Geographically Dispersed Clusters in Windows
         Server 2003
         http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx

        Restoring file and filegroup backups
         http://support.microsoft.com/kb/281122/en-us

        Restoring specific tables or rows from backups
         http://support.microsoft.com/kb/321836/en-us

        Maintaining Availability During Upgrades
         http://msdn.microsoft.com/en-us/library/ms191449.aspx