HIVE_Seminar_09_16_2010_JiminKim_0 by yaofenji


									 HIVE: Fault Containment for Shared-
      Memory Multiprocessors

J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, "Hive: Fault
Containment for Shared-Memory Multiprocessors," In Proceedings of the Fifteenth ACM
Symposium on Operating Systems Principles, 1995.

                                  Jimin Kim

                RTCC Lab., Hanyang University
                    September 16th, 2010

1.   Overview
2.   Introduction
3.   Fault Containment
4.   Hive Architecture
5.   Fault Containment Implementation
6.   Memory Sharing Implementation
7.   Experimental Results

                         Overview (1/2)
 Hive
    Prototype of multicellular kernel architecture
      • IRIX 5.2 code base ( a version of UNIX SVR4)
      • SimOS hardware simulator (Stanford FLASH multiprocessor)

 Multicellular kernel
    Partitions the machine and runs an internal distributed system
     of multiple independent kernels called cells
    Advantages
      • Reliability
            Damage only one cell, Not crash the whole machine
      • Scalability
            Few kernel resources are shared on different cells, reducing
             synchronization delays
            Increasing the number of cells improves the parallelism of the
             operating system and also increases the locality of kernel memory

                     Overview (2/2)

 The Proposed Key Implementation challenges
   Fault containment
     • Confining the effects of faults to the faulted cell
     • Defending each cell against wild writes caused by faults in
       other cells
   Memory sharing among cells

The Reliability Problems of SMP OS

       (Multicellular kernel architecture)
 Each cell is an SMP kernel
 Each independently manages a portion of the
  processors, memory, and I/O devices

    (The Stanford FLASH Multiprocessor)
   a shared-memory multiprocessor
   Distributed main memory across the nodes
      • A representative CC-NUMA multiprocessor (Cache Coherent
        with Non-Uniform Memory Access time)
   Interconnected mesh network

      (Hive’s Fault Containment strategy)
 Three main components for wild write defense
   Firewall hardware
   Preemptive discard
     • Goal: To prevent bad reads
     • Method: The system discards pages writable by a failed cell
          when a software error is detected
   Aggressive failure detection
     • Goal: To Reduce the delay until preemptive discard occurs
     • Method: Heuristic checks

              Fault Containment (1/2)

 Hardware Faults
   After a fault, hardware must make several guarantees
     • Accesses to unaffected memory must continue
     • Processors that try to access failed memory must not be
       stalled indefinitely
     • The set of memory lines that could be affected by a fault must
       be limited

            Fault Containment (2/2)

 Software Faults
   Two approaches to prevent wild writes
     • Special-purpose hardware
     • Virtual address translation hardware
   This paper recommends special-purpose hardware approach,
    because of supporting higher reliability
     • Firewall hardware
          Write permission bit-vector
            associated with each page of memory

              Hive Architecture (1/6)
         (Fault Containment Architecture)
 There are three channels by which a fault in one cell
  can damage another cell
    A corrupt RPC request or reply
      • Method: Each cell sanity-checks all information received from
        other cells and sets timeouts whenever waiting for a reply.
    Direct remote corrupt data reads
      • Types: Invalid pointers, linked data structure that contain
        infinite loops, data values that change in the middle of an
      • Method: Careful reference protocol
    Causing wild writes.
      • Kernel code and data is protected by FLASH firewall
      • The problem is user-level pages
           Can be shared by processes running on different cells
              Hive Architecture (2/6)
         (Fault Containment Architecture)
 Two main issues for failure detection
    A cell that is alive but acting erratically can be difficult to
     distinguish from one that is functioning correctly.
    If one cell could declare that another had failed and cause it
     to be rebooted, a faulty cell which mistakenly concluded that
     other cells were corrupt could destroy the system.

 Hive’s two-part solution
    Cells monitor each other with heuristic checks.
      • Provides a hint alert
    Consensus among the cells is required to reboot a failed cell
      • When a hint alert is broadcast, all cells temporarily suspend
        processes running at user level and run a distributed
        agreement algorithm.

            Hive Architecture (3/6)
        (Resource Sharing Architecture)
 Implementation for resource
   Mechanism: kernel
   Policy : user-level process called Wax
     • Global view of the system state

 Memory sharing supported by Hive
   Logical-level sharing
     • A cell can access the page no matter
       where it is stored.
   Physical-level sharing
     • A cell that has a free page frame can
       transfer control over that frame to
       another cell.

            Hive Architecture (4/6)
        (Resource Sharing Architecture)
 Processor sharing
   Spanning tasks
     • Hive extends the UNIX process abstraction to span cell
     • The address space map of shared process is kept consistent
       among cells

             Hive Architecture (5/6)
         (Resource Sharing Architecture)
 Resource allocation problem of previous distributed
    Each kernel based on a incomplete global view
    The centralized kernel resource allocation decisions:
     performance bottleneck

             Hive Architecture (6/6)
         (Resource Sharing Architecture)
 Advantage of Wax
   Can see the complete, up-to-date view of system state
    through spanning tasks
   The threads of Wax running on different cells can
    synchronize with each other

    Wax reads state from all cells.

    Wax provides hints that control the
    resource management policies.

      Fault Containment Implementation
        (Careful Reference Protocol)
 Careful Reference Protocol
   1. Call the careful_on function.
      • Capture the current stack frame
      • Record which remote cell the kernel intends to access
   2. Before using any remote address, check that it is aligned
    properly for the expected data structure and that it addresses
    the memory range belonging to the expected cell.
   3. Copy all data values to local memory before beginning
    sanity-checks, in order to defend against unexpected changes.
   4. Check each remote data structure by reading a structure
    type identifier.
      • The type identifier is written by the memory allocator and
        removed by the memory deallocator.
   5. Call careful_off when done.

        Fault Containment Implementation
          (Careful Reference Protocol)
 An example of the careful reference protocol
    Clock monitoring algorithm
      • The clock handler of each cell checks another cell’s clock
        value on every tick
      • This is substantially faster than sending an RPC to get the
           1.16us Vs. 7.2us

         Fault Containment Implementation
                (Wild Write Defense)
 Two-part strategy
    FLASH firewall
    Preemptive discard policy

 FLASH firewall
      A 64-bit vector
         Each bit grants write permission to a processor
      A remote write without write permission fails with a bus error
      Only the local processor can change the firewall bits for the memory
       of its node
      Uncached accesses to I/0 devices on other cells always receive bus

 Preemptive discard
    Method to prevent applications from reading the corrupt pages caused
     by wild write
    Each cell determines which of its pages were writable by the failed cell
     and marks those pages as discarded

       Fault Containment Implementation
        (Failure Detection and Recovery)
 A cell is considered potentially failed if one of the
  following conditions occurs
    An RPC sent to it times out
    A shared memory location which it updates on every clock
     interrupt fails to increment: H/W or OS errors
    Data read (from the cell’s memory or received in a message)
     fails sanity checks: S/W errors

       Fault Containment Implementation
        (Failure Detection and Recovery)
 Recovery algorithms
   Use the two global barriers to synchronize the recovery
    processes of different cells
     • It has flushed its processor TLBs and removed any remote
     • Each cell joins the first global barrier
     • After the first barrier completes
           Revoke any firewall write permission
           It is during this operation that the virtual memory
            subsystem detects pages that were writable by a failed cell
     • Each cell joins the second global barrier after it has finished
       virtual memory cleanup.
     • Cells that exit the second barrier can safely resume normal

      Memory Sharing Implementation

 Terminology
   Client cell: A cell running a process that is accessing the
   Memory home: The cell that owns the physical storage for the
     • Cell 1 is the memory home in both parts of Figure
   Data home: The cell that owns the data stored in the page.
     • Cell 1 is the data home in Figure(a), but cell 0 is the data
        home in Figure(b).

          Memory Sharing Implementation
            (IRIX Page Cache design)
 Each page frame is managed by an entry in a table of
  page frame data structures (pfdats)
    The pfdats are linked into a hash table
    When a page fault occurs
      • Check the pfdat  read the vnode, if page is not present 
        allocate a page frame  fill it with data  insert it in the pfdat
        hash table

           Memory Sharing Implementation
              (Logical-Level Sharing)
 extended pfdats
    when one cell accesses
     a page cached by
     another, it allocates a
     new pfdat to record the
     logical page id and the
     physical address of the

 Export and import
    Set up the binding a
     page of one cell and
     extended pfdat on

 Shadow vnode
    Indicate that the file is

            Memory Sharing Implementation
   (The Detailed Mechanism for Logical-Level Sharing)
 1. The virtual memory system first checks the pfdat hash table on
  the client cell.
 2. The virtual memory system invokes the read operation on the
  vnode for that file.
     The file system uses information stored in the vnode to determine the data
      home for the file
     Sends an RPC to the data home.
 3. If the page is not already cached, the server side of the file
  system issues a disk read using the data home vnode.
 4. The file system on the data home calls export on the page.
     This records the client cell in the data home's pfdat,
        • Which prevents the page from being deallocated
        • and provides information necessary for the failure recovery algorithms.
     Export also modifies the firewall state of the page if write access is
 5. The server-side file system returns the address of the data page
  to the client cell.
 6. The client-side file system calls import,
     which allocates an extended pfdat for that page frame
     and inserts it into the client cell's pfdat hash table

         Memory Sharing Implementation
            (Physical-Level Sharing)
 The problems of logical-level sharing
    Poor load balancing and locality on a CC-NUMA machine

 The detailed mechanism for physical-level sharing
    1. It sends an RPC to the memory home asking for a set of
    2. The memory home moves the page frame to a reserved list
     and ignores it until the data home frees it.
    3. The data home allocates an extended pfdat
      • and manages the frame as one of its own (it must send an
          RPC to the memory home when it needs to change the firewall

Memory Sharing Implementation
   (Physical-Level Sharing)

Experimental Results


To top