Clustering by wuzhenguang

VIEWS: 2 PAGES: 103

									A Tutorial on
Microsoft Cluster Server™




                        1
   Outline
 Cluster Abstractions
 Cluster Architecture
 Cluster Implementation
 Application Support
 Q&A




                           2
       Cluster Goals
 Manageability
  ◦ Manage nodes as a single system
  ◦ Perform server maintenance without affecting
    users
  ◦ Mask faults, so repair is non-disruptive
 Availability
  ◦ Restart failed applications & servers
    un-availability ~ MTTR / MTBF , so quick repair.
 ◦ Detect/warn administrators of failures
 Scalability
  ◦ Add nodes for incremental
    processing
    storage
    bandwidth                       3
        Fault Model
 Failures   are independent
 So, single fault tolerance is a big win
 Hardware  fails fast (blue-screen)
 Software fails-fast (or goes to sleep)
 Software often repaired by reboot:
 ◦ Heisenbugs
 Operations    tasks: major source of outage
 ◦ Utility operations
 ◦ Software upgrades
                                    4
Cluster: Servers Combined to Improve
Availability & Scalability
    Cluster: A group of independent systems working
     together as a single system.
     Clients see scalable & FT services (single system image).
    Node: A server in a cluster. May be an SMP server.
    Interconnect: Communications link used for intra-
     cluster status info such as “heartbeats”. Can be Ethernet.
           Client PCs                             Printers

Server A                                              Server B


            Disk array A           Interconnect
                                   Disk array B

                                          5
      Microsoft Cluster Server™
 2-node   availability Summer 97 (20,000 Beta Testers now)
 ◦ Commoditize fault-tolerance (high availability)
 ◦ Commodity hardware (no special hardware)
 ◦ Easy to set up and manage
 ◦ Lots of applications work out of the box.
 16-node   scalability later (next year?)



                                      6
    Failover Example
                              Browser

Server 1                                                 Server 2



     Web                                                  Web
     site                                                 site
  Database                                               Database


             Web site files

                                        Database files


                                              7
MS Press Failover Demo

   Client/Server
                         Resource States
                                 - Pending
   Software  failure
   Admin shutdown               - Partial
   Server failure
                                 - Failed


                             !   - Offline


                         8
                    Demo Configuration
Server “Alice”                                     Server “Betty”
SMP Pentium® Pro Processors                        SMP Pentium® Pro Processors
Windows NT Server with Wolfpack                    Windows NT Server with Wolfpack
Microsoft Internet Information Server              Microsoft Internet Information Server
Microsoft SQL Server                               Microsoft SQL Server

                                        Interconnect
                                        standard Ethernet
Local                                                                               Local
Disks               SCSI Disk Cabinet                                               Disks

                           Shared
                           Disks        Windows NT Server Cluster

Administrator                                                   Client
Windows NT Workstation                                          Windows NT Workstation
Cluster Admin                                                   Internet Explorer
SQL Enterprise Mgr                                              MS Press OLTP app
                       Demo Administration
    Server “Alice”                                            Server “Betty”
    Runs SQL Trace                                            Run SQL Trace
    Runs Globe




    Local                                                                               Local
    Disks               SCSI Disk Cabinet                                               Disks

                               Shared
                               Disks               Windows NT Server Cluster

Cluster Admin Console
Windows GUI                            SQL Enterprise Mgr                     Client
Shows cluster resource status          Windows GUI
Replicates status to all servers       Shows server status
Define apps & related resources        Manages many servers
Define resource dependencies           Start, stop manage DBs
Orchestrates recovery order
    Generic Stateless Application
    Rotating Globe
 Mplay32 is generic app.
 Registered with MSCS
 MSCS restarts it on failure
 Move/restart ~ 2 seconds
 Fail-over if
    ◦ 4 failures
           (= process exits)
    ◦ in 3 minutes
    ◦ settable default
                                11
   Demo Moving or Failing Over
   An Application
           X
                                                                   X




     AVI                                                     AVI
     Application                                             Application
Local                                                                      Local
Disks              SCSI Disk Cabinet                                       Disks

                        Shared
                        Disks          Windows NT Server Cluster


                   Alice Fails or
                   Operator
                   Requests move
Generic Stateful Application
NotePad
 Notepad   saves state on shared disk
 Failure before save => lost changes
 Failover or move (disk & state move)




                               13
Demo Step 1: Alice Delivering Service
                 SQL Activity                             No SQL Activity




         SQL                                                          SQL




                                                                       ODBC
 Local                                                                        Local
          ODBC




 Disks                SCSI Disk Cabinet                                       Disks

                            Shared
         IIS                Disks         Windows NT Server Cluster   IIS
          IP


                                      HTTP
2: Request Move to Betty
     No SQL Activity                                        SQL Activity




         SQL                                                         SQL
 Local                                                                        Local
          ODBC




                                                                       ODBC
 Disks              SCSI Disk Cabinet                                         Disks

                         Shared
         IIS             Disks           Windows NT Server Cluster   IIS
               IP                                                      IP


                                        HTTP
3: Betty Delivering Service
     No SQL Activity                                    SQL Activity

                                                            .




         SQL                                                     SQL
 Local                                                                    Local
          ODBC




                                                                   ODBC
 Disks           SCSI Disk Cabinet                                        Disks

                       Shared
         IIS           Disks         Windows NT Server Cluster   IIS
                                                                   IP
4: Power Fail Betty, Alice Takeover
   No SQL Activity                                    SQL Activity




         SQL                                                SQL
 Local                                                                     Local
          ODBC




                                                               ODBC
 Disks           SCSI Disk Cabinet                                         Disks

                      Shared
         IIS          Disks          Windows NT Server Cluster
                                                             IIS
          IP                                                          IP
5: Alice Delivering Service
                 SQL Activity                             No SQL Activity




         SQL
 Local                                                                      Local
          ODBC




 Disks                SCSI Disk Cabinet                                     Disks

                            Shared
         IIS                Disks         Windows NT Server Cluster
          IP


                                      HTTP
6: Reboot Betty, now can takeover
                 SQL Activity                             No SQL Activity




         SQL                                                          SQL




                                                                       ODBC
 Local                                                                        Local
          ODBC




 Disks                SCSI Disk Cabinet                                       Disks

                            Shared
         IIS                Disks         Windows NT Server Cluster   IIS
          IP


                                      HTTP
   Outline
 Cluster Abstractions
 Cluster Architecture
 Cluster Implementation
 Application Support
 Q&A




                           20
    Cluster and NT Abstractions
Cluster        Group               Resource

          Cluster Abstractions



            NT Abstractions

Domain         Node                Service

                              21
            Basic NT Abstractions
    Domain                       Node                              Service
   Service: program or device managed by a node
        e.g., file service, print service, database server
        can depend on other services (startup ordering)
        can be started, stopped, paused, failed

   Node: a single (tightly-coupled) NT system
        hosts services; belongs to a domain
        services on node always remain co-located
        unit of service co-location; involved in naming services

   Domain: a collection of nodes
        cooperation for authentication, administration, naming


                                                              22
         Cluster Abstractions
                             Resource                             Resource
Cluster                      Group

   Resource: program or device managed by a cluster
       e.g., file service, print service, database server
       can depend on other resources (startup ordering)
       can be online, offline, paused, failed

   Resource Group: a collection of related resources
       hosts resources; belongs to a cluster
       unit of co-location; involved in naming resources

   Cluster: a collection of nodes, resources, and groups
       cooperation for authentication, administration, naming



                                                             23
       Resources
 Cluster         Group               Resource
Resources have...
 Type: what it does (file, DB, print, web…)
 An operational state (online/offline/failed)
 Current and possible nodes
 Containing Resource Group
 Dependencies on other resources
 Restart parameters (in case of resource failure)


                                24
          Resource Types
   Built-in types                     Added by others
    ◦ Generic Application               ◦   Microsoft SQL Server,
    ◦ Generic Service                   ◦   Message Queues,
    ◦ Internet Information Server       ◦   Exchange Mail Server,
      (IIS) Virtual Root                ◦   Oracle,
    ◦ Network Name                      ◦   SAP R/3
    ◦ TCP/IP Address                    ◦   Your application?
    ◦ Physical Disk                         (use developer kit wizard).
    ◦ FT Disk (Software RAID)
    ◦ Print Spooler
    ◦ File Share

                                               25
Physical Disk




                26
TCP/IP Address




                 27
Network Name




               28
File Share




             29
IIS (WWW/FTP) Server




                   30
Print Spooler




                31
             Resource States
   Resources states:                             I’m       Online Go
    ◦ Offline: exists, not offering service       Online!
                                                                      Off-line!
    ◦ Online: offering service             Online                                Offline
                                                             Failed
    ◦ Failed: not able to offer service    Pending
                                                                 I’m
                                                                                 Pending

                                                Go
   Resource failure may cause:                 Online!
                                                                 here!     I’m
                                                            Offline        Off-line!
    ◦   local restart
    ◦   other resources to go offline
    ◦   resource group to move
    ◦   (all subject to group and resource parameters)
   Resource failure detected by:
    ◦ Polling failure                             32
    Resource Dependencies
   Similar to NT Service Dependencies
   Orderly startup & shutdown
    ◦ A resource is brought online after any
      resources it depends on are online.
    ◦ A Resource is taken offline before any
                                                               File Share
      resources it depends on
   Interdependent resources        IIS Virtual
                                                           Network Name
    ◦   Form dependency trees       Root

    ◦   move among nodes together
    ◦   failover together                   IP Address
                                            Resource DLL
    ◦   as per resource group          33
Dependencies Tab




                   34
         NT Registry
   Stores all configuration information
    ◦ Software
    ◦ Hardware
 Hierarchical (name, value) map
 Has a open, documented interface
 Is secure
 Is visible across the net (RPC interface)
 Typical Entry:
    \Software\Microsoft\MSSQLServer\MSSQLServer\
        DefaultLogin = “GUEST”
        DefaultDomain = “REDMOND”
                                   35
          Cluster Registry
 Separate from local NT Registry
 Replicated at each node
    ◦ Algorithms explained later
   Maintains configuration information:
    ◦ Cluster members
    ◦ Cluster resources
    ◦ Resource and group parameters (e.g. restart)
 Stable storage
 Refreshed from “master” copy when node joins
  cluster
                                       36
        Other Resource Properties
 Name
 Restart policy (restart N times, failover…)
 Startup parameters
 Private configuration info (resource type specific)
  ◦ Per-node as well, if necessary
 Poll Intervals (LooksAlive, IsAlive, Timeout)
 These properties are all kept in Cluster Registry




                                   37
General Resource Tab




                 38
Advanced Resource Tab




                 39
          Resource Groups
    Cluster          Group               Resource
   Every resource belongs to a
    resource group.                     Payroll Group
   Resource groups move
    (failover) as a unit            Web Server          SQL
                                                        Server
 Dependencies NEVER cross
  groups. (Dependency trees
  contained within groups.)
                                  IP Address Drive E:      Drive F:
 Group may contain forest of
  dependency trees                 40
Moving a Resource Group




                          41
             Group Properties
   CurrentState: Online, Partially Online, Offline
   Members: resources that belong to group
       members determine which nodes can host group.

   Preferred Owners: ordered list of host nodes
   FailoverThreshold: How many faults cause failover
   FailoverPeriod: Time window for failover threshold
   FailbackWindowsStart: When can failback happen?
   FailbackWindowEnd: When can failback happen?
   Everything (except CurrentState) is stored in registry

                                                        42
            Failover and Failback
   Failover parameters
    ◦ timeout on LooksAlive, IsAlive
    ◦ # local restarts in failure window
       after this, offline.
   Failback to preferred node
    ◦ (during failback window)
   Do resource failures affect group?
            Node \\Alice                                  Node \\Betty


                                    Failover

                                    Cluster     Cluster
                                     Failback
                                    Service     Service
                           IPaddr

                      name

                                                     43
Cluster Concepts
Clusters

  Cluster    Group        Resource

             Group        Resource

             Group        Resource

             Group        Resource
                     44
          Cluster Properties
   Defined Members: nodes that can join the cluster
   Active Members: nodes currently joined to cluster
   Resource Groups: groups in a cluster
   Quorum Resource:
     Stores   copy of cluster registry.
     Used   to form quorum.
   Network: Which network used for communication
   All properties kept in Cluster Registry
                                           45
Cluster API Functions
(operations on nodes & groups)

     Find and communicate with Cluster
     Query/Set Cluster properties
     Enumerate Cluster objects
       ◦ Nodes
       ◦ Groups
       ◦ Resources and Resource Types
     Cluster Event Notifications
       ◦ Node state and property changes
       ◦ Group state and property changes
       ◦ Resource state and property changes
                                      46
Cluster Management




                47
   Outline
 Cluster Abstractions
 Cluster Architecture
 Cluster Implementation
 Application Support
 Q&A




                           48
      Architecture
   Top tier provides        Failover Manager
    cluster abstractions
                                    Resource Monitor

                             Cluster Registry
   Middle tier provides                Global Update
    distributed operations                  Quorum
                                         Membership

   Bottom tier is               Windows NT Server
    NT and drivers
                             Cluster             Cluster
                             Disk Driver         Net Drivers


                                   49
Membership and Regroup
 Membership:                       Failover Manager

  ◦ Used for orderly addition and
                                           Resource Monitor
    removal from
                                    Cluster Registry
    { active nodes }
                                               Global Update
 Regroup:                                      Membership
  ◦ Used for failure detection                    Regroup
    (via heartbeat messages)
                                        Windows NT Server
  ◦ Forceful eviction from
    { active nodes }                Cluster             Cluster
                                    Disk Driver         Net Drivers


                                          50
Membership
 Defined cluster = all nodes           Failover Manager
 Active cluster:
                                               Resource Monitor
    ◦ Subset of defined cluster
    ◦ Includes Quorum Resource          Cluster Registry

    ◦ Stable (no regroup in progress)              Global Update

                                                    Membership
                                                      Regroup


                                            Windows NT Server


                                        Cluster             Cluster
                                        Disk Driver         Net Drivers


                                              51
         Quorum Resource
 Usually (but not necessarily) a SCSI disk
 Requirements:
    ◦ Arbitrates for a resource by supporting the
      challenge/defense protocol
    ◦ Capable of storing cluster registry and logs
   Configuration Change Logs
    ◦ Tracks changes to configuration database when
      any defined member missing (not active)
    ◦ Prevents configuration partitions in time


                                       52
      Challenge/Defense Protocol
   SCSI-2 has reserve/release verbs
    ◦ Semaphore on disk controller
 Owner gets lease on semaphore
 Renews lease once every 3 seconds
 To preempt ownership:
    ◦ Challenger clears semaphore (SCSI bus reset)
    ◦ Waits 10 seconds
      3 seconds for renewal + 2 seconds bus settle time
      x2 to give owner two chances to renew
    ◦ If still clear, then former owner loses lease
    ◦ Challenger issues reserve to acquire semaphore
                                      53
      Challenge/Defense Protocol:
      Successful Defense
 Defender Node
Reserve        Reserve             Reserve       Reserve         Reserve




  0    1   2     3   4     5   6     7   8   9   10    11   12    13   14   15   16


                         Bus Reset                                     Reservation
                                                                       detected

      Challenger Node


                                                  54
           Challenge/Defense Protocol:
           Successful Challenge
      Defender Node
Reserve




  0    1    2   3   4     5   6     7   8   9   10   11   12   13   14   15   16
                                                                              Reserve
                        Bus Reset
                                                                    No
                                                                    reservation
  Challenger Node                                                   detected



                                                     55
Regroup
   Invariant:
                                            Failover Manager
    All members agree on { members }
   Regroup re-computes { members }                Resource Monitor
   Each node sends heartbeat message       Cluster Registry
    to a peer (default is one per second)
                                                       Global Update
   Regroup if two lost heartbeat
                                                        Membership
    messages
                                                          Regroup
    ◦ suspicion that sender is dead
    ◦ failure detection in bounded time
                                                Windows NT Server
   Uses a 5-round protocol to agree.
    ◦ Checks communication among nodes.     Cluster             Cluster
    ◦ Suspected missing node may survive.   Disk Driver         Net Drivers
   Upper levels (global update, etc.)
    informed of regroup event.                    56
Membership State Machine
            Initialize
                                                      Search or
                                        Sleeping      Reserve Fails
 Start Cluster

            Member                     Search Fails        Quorum
            Search                                         Disk Search


       Found             Minority or    Regroup                Acquire (reserve)
       Online            no Quorum                             Quorum
       Member                                                  Disk
                              Non-Minority    Lost
                              and Quorum      Heartbeat    Forming
            Joining

                         Join                         Synchronize
                         Succeeds                     Succeeds
                                        Online
                                                      57
Joining a Cluster
  When a node starts up, it mounts and configures
   only local, non-cluster devices
  Starts Cluster Service which
     ◦ looks in local (stale) registry for members
     ◦ Asks each member in turn to sponsor new node’s
       membership. (Stop when sponsor found.)
    Sponsor (any active member)
     ◦   Sponsor authenticates applicant
     ◦   Broadcasts applicant to cluster members
     ◦   Sponsor sends updated registry to applicant
     ◦   Applicant becomes a cluster member
                                          58
         Forming a Cluster
         (when Joining fails)
 Use registry to find quorum resource
 Attach to (arbitrate for) quorum resource
 Update cluster registry from quorum resource
    ◦ e.g. if we were down when it was in use
 Form new one-node cluster
 Bring other cluster resources online
 Let others join your cluster



                                       59
Leaving A Cluster (Gracefully)
    Pause:
     ◦ Move all groups off this member.
     ◦ Change to paused state (remains a cluster member)
    Offline:
     ◦ Move all groups off this member.
     ◦ Sends ClusterExit message all cluster members
       Prevents regroup
       Prevents stalls during departure transitions
     ◦ Close Cluster connections
       (now not an active cluster member)
     ◦ Cluster service stops on node
    Evict: remove node from defined member list
                                              60
Leaving a Cluster (Node Failure)
 Node (or communication) failure triggers Regroup
 If after regroup:
    ◦ Minority group OR no quorum device:
       group does NOT survive
    ◦ Non-minority group AND quorum device:
       group DOES survive
   Non-Minority rule:
    ◦ Number of new members >= 1/2 old active cluster
    ◦ Prevents minority from seizing quorum device at the expense of a
      larger potentially surviving cluster
   Quorum guarantees correctness
    ◦ Prevents “split-brain”
                                                  61
      e.g. with newly forming cluster containing a single node
            Global Update
   Propagates updates to all
                                                   Failover Manager
    nodes in cluster
   Used to maintain replicated                           Resource Monitor

    cluster registry                               Cluster Registry

   Updates are atomic and                                    Global Update

    totally ordered                                            Membership

   Tolerates all benign failures.                               Regroup

   Depends on membership                              Windows NT Server
    ◦ all are up
    ◦ all can communicate                          Cluster             Cluster
                                                   Disk Driver         Net Drivers
   R. Carr, Tandem Systems Review.V1.2 1985,
    sketches regroup and global update protocol.
                                                         62
              Global Update Algorithm
   Cluster has locker node that regulates
    updates.
    ◦ Oldest active node in cluster
                                                               L
 Send Update to locker node
 Update other (active) nodes
    ◦ in seniority order (e.g. locker first)
    ◦ this includes the updating node
   Failure of all updated nodes:
    ◦ Update never happened                                S
    ◦ Updated nodes will roll back on recovery
   Survival of any updated nodes:
    ◦ New locker is oldest and so has update if any do.
    ◦ New locker restarts update                      63
            Cluster Registry
   Separate from local NT Registry
                                      Failover Manager
   Maintains cluster configuration
    ◦ members, resources, restart            Resource Monitor
      parameters, etc.
                                      Cluster Registry
   Stable storage
                                                 Global Update
   Replicated at each member
                                                  Membership
    ◦ Global Update protocol
    ◦ NT Registry keeps local copy                  Regroup


                                          Windows NT Server


                                      Cluster             Cluster
                                      Disk Driver         Net Drivers


                                            64
          Cluster Registry Bootstrapping
   Membership uses Cluster      Failover Manager
    Registry for list of nodes
    ◦ …Circular dependency
                                        Resource Monitor

                                        Cluster Registry
   Solution:                               Global Update
    ◦ Membership uses stale                  Membership
      local cluster registry                   Regroup
    ◦ Refresh after joining or
      forming cluster                Windows NT Server

    ◦ Master is either           Cluster             Cluster
      quorum device, or         Disk Driver         Net Drivers

      active members
                                       65
           Resource Monitor
   Polls resources:                Failover Manager
    ◦ IsAlive and LooksAlive
                                           Resource Monitor
   Detects failures
                                    Cluster Registry
    ◦ polling failure                          Global Update
    ◦ failure event from resource               Membership
   Higher levels tell it                         Regroup

    ◦ Online, Offline
                                        Windows NT Server
    ◦ Restart
                                    Cluster             Cluster
                                    Disk Driver         Net Drivers


                                          66
           Failover Manager
                                Failover Manager
   Assigns groups to nodes
    based on                           Resource Monitor

    ◦ Failover parameters       Cluster Registry
                                           Global Update
    ◦ Possible nodes for each
                                            Membership
      resource in group
                                              Regroup
    ◦ Preferred nodes for
      resource group                Windows NT Server


                                Cluster             Cluster
                                Disk Driver         Net Drivers


                                      67
Failover
(Resource Goes Offline)
                                     Notify Failover Manager.
          Resource Manager
          Detects resource error.
                                    Failover Manager checks:
                                    Failover Window and
                                    Failover Threshold
           Attempt to
           restart resource.
                                                                 Wait for
                                                                 Failback Window
                                           Are Failover
                                           conditions                No
            Has the                        within
     No     Resource                       Constraints?
            Retry limit
            been exceeded?
                                            Yes                             Leave Group in
                                                                            partially Online
                                                                            state.
               Yes
                                         Can another            No
           Switch resource               owner be found?
           (and Dependants)              (Arbitration)
           Offline.                                                   Notify Failover Manager
                                                                      on the new system to
                                             Yes                      bring resource Online.
                                                            68
Pushing a Group
(Resource Failure)
                             Resource Monitor
                             notifies Resource Manager
                             of resource failure.



                           Resource Manager                       Resource Manager notifies
                           enumerates all objects in the          Failover Manager that the
                           Dependency Tree of the failed          Dependency Tree is Offline
                           resource.                              and needs to fail over.




                             Resource Manager takes              Failover Manager performs
                             each depending resource             Arbitration to locate a new
                             Offline.                            owner for the group.




                                 Any
                                                                 Failover Manager on the
   Leave Group in     No                                   Yes   new owner node brings the
                                 resource has
   partially Online                                              resources Online.
                                 “Affect the Group”
   state.                        True


                                                                    69
Pulling a Group
(Node Failure)
    Cluster Service
    notifies Failover Manager
    of node failure.



    Failover Manager            Failover Manager performs
    determines which groups     Arbitration to locate a new
    were owned by the failed    owner for the groups.
    node.


                                  Failover Manager on the
   Resource Manager notifies      new owner(s) bring the
   Failover Manager that the      resources Online
   node is Offline                in dependency order.
   and the groups it owned
   need to fail over.




                                      70
Failback to Preferred Owner Node
  Group may have a Preferred Owner
  Preferred Owner comes back online
  Will only occur during the Failback Window
            (time slot, e.g. at night)

                             Resource Manager takes
     Preferred owner
                             each resource on the
     comes back Online.
                             current owner Offline.
                                                                 Failover Manager performs
                                                                 Arbitration to locate the
                                                                 Preferred Owner of
     Is the time within                                          the group.
                            Resource Manager notifies
     the Failback Window?   Failover Manager that the
                            Group is Offline
                            and needs to fail over to the          Failover Manager on the
                            Preferred Owner.                       Preferred Owner brings
                                                                   the resources Online.



                                                            71
   Outline
 Cluster Abstractions
 Cluster Architecture
 Cluster Implementation
 Application Support
 Q&A




                         72
             Process Structure
   Cluster Service                  A Node

    ◦   Failover Manager
    ◦   Cluster Registry
    ◦   Global Update                  Resource
    ◦   Quorum                         Monitor

    ◦   Membership                                        Private
                                                          calls
                           Cluster
   Resource Monitor       Service
    ◦ Resource Monitor
    ◦ Resource DLLs                   Resource
                                      Monitor
   Resources                                                       Resource
                                                   DLL   Private
    ◦ Services                                           calls

    ◦ Applications
                                              73
             Resource Control
   Commands                            A Node
    ◦   CreateResource()
    ◦   OnlineResource()
    ◦   OfflineResource()
    ◦   TerminateResource()
                                          Resource
    ◦   CloseResource()                   Monitor
    ◦   ShutdownProcess()                                    Private
   And resource events       Cluster                        calls
                              Service


                                         Resource
                                         Monitor

                                                      DLL   Private
                                                            calls
                                                                       Resource
                                                 74
Resource DLLs                                    I’m           Online Go
                                                 Online!
                                                                        Off-line!
                                          Online                                   Offline
                                          Pending              Failed              Pending
   Calls to Resource DLL
                                                                    I’m
    ◦ Open: get handle                          Go                  here!    I’m
    ◦ Online: start offering service            Online!
                                                              Offline        Off-line!
    ◦ Offline: stop offering service
         as a standby or
         pair-is offline
    ◦   LooksAlive: Quick check
    ◦   IsAlive: Thorough check        Resource
                                       Monitor
    ◦   Terminate: Forceful Offline
                                                DLL       Private
    ◦   Close: release handle                             calls
                                        Std                             Resource
                                        calls



                                                 75
    Cluster Communications
 Most communication via DCOM /RPC
 UDP used for membership heartbeat messages
 Standard (e.g. Ethernet) interconnects

               Management
               apps
               DCOM                            DCOM
    Cluster          DCOM / RPC: admin        Cluster
    Service          UDP: Heartbeat           Service
    DCOM / RPC                                DCOM / RPC
    Resource                                  Resource
    Monitors                                  Monitors

    Resource                                  Resource
    Monitors                                  Monitors

                                         76
   Outline
 Cluster Abstractions
 Cluster Architecture
 ClusterImplementation
 Application Support
 Q&A




                          77
     Application Support
 Virtual Servers
 Generic Resource DLLs
 Resource DLL VC++ Wizard
 Cluster API




                             78
            Virtual Servers
   Problem:
    ◦ Client and Server Applications
                                                                Virtual
      do not want node name to change                           Server
      when server app moves to another node.                    \\a:1.2.3.4
   A Virtual Server simulates an NT Node
    ◦ Resource Group (name, disks, databases,…)
    ◦ NetName and IP address
      (node: \\a keeps name and IP address as is moves)
    ◦ Virtual Registry (registry “moves” (is replicated))
    ◦ Virtual Service Control                                   Virtual
    ◦ Virtual RPC service                                       Server
                                                                \\a: 1.2.3.4
   Challenges:
    ◦ Limit app to virtual server’s devices and services.
    ◦ Client reconnect on failover (easy if connectionless --
      eg web-clients)
                                                       79
           Virtual Servers (before failover)
 Nodes \\Y and \\Z
  support virtual servers        SAP
                                       \\Y     \\Z
                                                     SAP
  \\A and \\B                    SQL                 SQL
 Things that need to fail
                                 S:\                 T:\
  over transparently             \\A                  \\B
    ◦ Client connection
    ◦ Server dependencies
    ◦ Service names              “SAP on A”   “SAP on B”
    ◦ Binding to local
      resources
    ◦ Binding to local servers           80
              Virtual Servers (just after failover)
 \\Y resources and groups                     \\Y               \\Z
  (i.e.Virtual Server \\A)
                                                           SAP         SAP
  moved to \\Z
 A resources bind to each other                           SQL         SQL
  and to local resources (e.g.,                            S:\         T:\
  local file system)                                 \\A               \\B
    ◦   Registry
    ◦   Physical resource
    ◦   Security domain
    ◦   Time
   Transactions used to make DB
    state consistent.                   “SAP on A”     “SAP on B”
   To “work”, local resources on
    \\Y and \\Z have to be similar
    ◦ E.g. time must remain monotonic           81
Address Failover and
Client Reconnection
   Name and Address rebind to             \\Y               \\Z

    new node                                           SAP         SAP
    ◦ Details later                                    SQL         SQL
   Clients reconnect                                  S:\         T:\
    ◦ Failure not transparent                    \\A               \\B

    ◦ Must log on again
    ◦ Client context lost
      (encourages connectionless)
    ◦ Applications could maintain
      context
                                    “SAP on A”     “SAP on B”
                                           82
Mapping Local References to Group-
Relative References
   Send client requests to
                                         \\Y               \\Z
    correct server
                                                     SAP         SAP
    ◦ \\A\SAP refers to \\.\SQL
                                                     SQL         SQL
    ◦ \\B\SAP refers to \\.\SQL
   Must remap references:                           S:\         T:\
                                               \\A               \\B
    ◦ \\A\SAP to \\.\SQL$A
    ◦ \\B\SAP to \\.\SQL$B
 Also handles namespace
  collision
 Done via
    ◦ modifying server apps, or   “SAP on A”     “SAP on B”
                                         83
        Naming and Binding and Failover
 Services rely on the NT node name and - or IP address
  to advertise Shares, Printers, and Services.
  ◦ Applications register names to advertise services
  ◦ Example: \\Alice\SQL (i.e. <node><service>)
  ◦ Example: 128.2.2.2:80 (=http://www.foo.com/)
 Binding
  ◦ Clients bind to an address (e.g. name->IP address)
 Thus the node name and IP address must failover along
  with the services (preserve client bindings)


                                    84
Client to Cluster Communications
IP address mobility based on MAC rebinding
   IP rebinds to failover MAC addr             Cluster Clients
   Transparent to client or server              ◦ Must use IP (TCP, UDP, NBT,... )
   Low-level ARP (address                       ◦ Must Reconnect or Retry after failure
    resolution protocol) rebinds IP             Cluster Servers
    add to new MAC addr.
                                                 ◦ All cluster nodes must be on same LAN
                                    Client
                    Alice <-> 200.110.12.4         segment
            Virtual Alice <-> 200.110.12.5
                    Betty <-> 200.110.12.6
            Virtual Betty <-> 200.110.12.7



                                             WAN
         Alice <-> 200.110.120.4                                         Betty <-> 200.110.120.6
 Virtual Alice <-> 200.110.120.5 Router:                         Virtual Betty <-> 200.110.120.7
                                 200.110.120.4 ->AliceMAC
                                 200.110.120.5 ->AliceMAC
                                 200.110.120.6 ->BettyMAC
                                 200.110.120.7 ->BettyMAC

                         Local Network
                                                            85
            Time

   Time must increase monotonically
    ◦ Otherwise applications get confused
    ◦ e.g. make/nmake/build
   Time is maintained within failover resolution
    ◦ Not hard, since failover on order of seconds
 Time is a resource, so one node owns time resource
 Other nodes periodically correct drift from owner’s time



                                              86
           Application Local
           NT Registry Checkpointing
 Resources can request that local NT registry sub-
  trees be replicated
 Changes written out to quorum device
    ◦ Uses registry change notification interface
   Changes read and applied on fail-over
     \\A on \\X                         \\A on \\B



            registry                           registry

                            registry
                       Quorum
                                        87
                       Device
Registry Replication




                   88
     Application Support
 Virtual Servers
 Generic Resource DLLs
 Resource DLL VC++ Wizard
 Cluster API




                             89
           Generic Resource DLLs
   Generic Application DLL
    ◦ Simplest: just starts, stops application, and
      makes sure process is alive
   Generic Service DLL
    ◦ Translates DLL calls into equivalent NT Server
      calls
      Online => Service Start
      Offline => Service Stop
      Looks/IsAlive => Service Status    Resource
                                          Monitor
                                                DLL Private
                                           Std        calls   Resource
                                           calls
                                            90
Generic Application




                  91
Generic Service




                  92
Application Support
 Virtual Servers
 Generic Resource DLLs
 Resource DLL VC++ Wizard
 Cluster API




                        93
          Resource DLL VC++ Wizard
 Asks for resource type name
 Asks for optional service to control
 Asks for other parameters (and associated types)
 Generates DLL source code
 Source can be modified as necessary
    ◦ E.g. additional checks for Looks/IsAlive




                                         94
Creating a New Workspace




                      95
Specifying Resource Type Name




                       96
Specifying Resource Parameters




                        97
Automatic Code Generation




                      98
Customizing The Code




                       99
Application Support
 Virtual Servers
 Generic Resource DLLs
 Resource DLL VC++ Wizard
 Cluster API




                     100
        Cluster API
   Allows resources to:
    ◦   Examine dependencies
    ◦   Manage per-resource data
    ◦   Change parameters (e.g. failover)
    ◦   Listen for cluster events
    ◦   etc.
 Specs & API became public Sept 1996
 On all MSDN Level 3
 On web site:
    ◦ http://www.microsoft.com/clustering.htm
                                      101
Cluster API Documentation




                       102
   Outline
 Cluster Abstractions
 Cluster Architecture
 Cluster Implementation
 Application Support
 Q&A




                           103

								
To top