XP FSU Computer Science

Document Sample
XP FSU Computer Science Powered By Docstoc
					   Improving the Reliability of
Commodity Operating Systems
Introduction

Nooks
  Allows existing OS extensions to execute safely
   in commodity kernels
  Use lightweight kernel protection domains
    Restricted write access to kernel memory
    Track and validate all modifications to kernel data
     structures
Motivation

Computer reliability a unsolved problem
  Cost of failures continues to rise
OS extensions have become prevalent
  70% of Linux kernel code
  35,000 drivers on Windows XP
  Written by people who are less experienced in
   kernel organization
Motivation

Extensions are leading cost of failures
  In Windows XP, drivers cause 85% of failures
  In Linux, device drivers introduce 7x errors than
   the rest of the kernel
  Extended OS cannot be tested completely
Nooks Approach

Target existing extension architecture
Use conventional C instead of type-safe
 languages
Aim to reduce the number of crashes due
 to drivers and extensions
Prototype implemented in Linux
Showed graceful recovery for 99% of fault
 injections
Related Work

Hardware approaches
  Capability-based architectures
    Recovery difficult for shared resources
  Segment architectures
    Difficult to program
New OS structures
  Microkernels
    Good fault isolation
    Rebooting required to restart services
Related Work

Transaction-based systems
  Works well for file systems
Language-based approaches
  Limited applicability
Architecture

Core principles
  Design for fault resistance, not fault tolerance
     Prevent and recover from most, not all
  Design for mistakes, not abuse
     Extensions are generally well-behaved (not
      malicious)
     Can explore the design space between unproctected
      and safe
Architecture

Implications
  + Can define an architecture that supports
    existing drivers with moderate performance
    costs
  - Malicious code can bypass these mechanisms
Goals

Isolation of kernel from extension failures
  Need to detect failures before they spread
Automatic recovery from failures
Backward compatibility
Functions

Reliability layer inserted between the
 extensions and the OS kernel
  Intercepts all interactions between the
   extensions and the OS kernel
Major functions
  Isolation
  Interposition
  Object tracking
  Recovery
Isolation

Lightweight kernel protection domain
  Write access to a limited portion of the kernel’s
   address space
Major tasks
  Creation, manipulation, and maintenance of
   lightweight kernel protection domains
  Inter-domain control transfer
Isolation

Extension procedure call (XPC)
  Similar to lightweight RPC
  Assume trusted interactions
  Asymmetric relationship
    Kernel has more privileges
Interposition

The Nooks interposition mechanisms
  Make sure that
    All control flows between the kernel and extensions
     are through the XPC mechanism
    All data flows between the kernel and extensions are
     managed by Nooks’ object-tracking code
Extensions and the kernel communicate
 through wrapper stubs
Object Tracking

Maintains a list of kernel data structures
 that are manipulated by an extension
Controls all modifications to those
 structures
Provides object info for cleanup when an
 extension fails
Object Tracking

An object must be copied into an
 extension before it is modified
Object tracking code verifies the type and
 accessibility of each parameter being
 passed
Recovery

Nooks detects software faults
  When kernel services are invoked incorrectly
  When an extension consumes too many
   resources
Actions
  Return to the extension
  Generate an error code
Recovery

Nooks detects hardware faults
  Processor raises an exception during extension
   execution
    Attempts to read unmapped memory
    Write memory outside of its protection domain
A user or a program trigger Nooks
 recovery explicitly
Recovery

Since extensions are decoupled from
 kernel, Nooks can freely release
 extension-held kernel structures, such as
 objects or locks, during the recovery
 process
Architecture
        Apache Web             Navigator Web         Quake3D Video
          Server                 Browser                 Game



                           Operating System Kernel

           Memory
                                 File System          Networking
         Management

                Nooks Kernel Runtime
         Network Nook            Video Nook

        Per-nook runtime      Per-nook runtime



         TCP/IP Driver

        Ethernet Driver         Video Driver          SCSI Driver

                Nooks Kernel Runtime


                                                     SCSI Controller
        Ethernet Card            Video Card
                                                         Card
Implementation

Linux 2.4.18
  Worst-case target
  18 months of development
  22,000 lines of Nooks code (vs. 2.4 million lines
   of Linux code and 50 million lines of Windows
   2003 code)
Isolation

Two parts
  Memory management
  Extension procedure call
Memory Management

Kernel has read-write access to the entire
 address space
Each extension is restricted to read-only
 kernel access and read-write access to its
 local domain
Nooks maintains a copy of the kernel page
 table for each domain
Memory Management

Changing protection domains is not as
 costly as changing processes
  Protection domains share kernel address space
Extension Procedure Call

Transparent to both the kernel and its
 extensions
Managed by two functions
  nooks_driver_call(func_ptr, arg_list, domain)
  nooks_kernel_call(func_ptr, arg_list, domain)
Deferred call mechanisms available
  Useful for network drivers to queue up packets
   and perform bulk transfers
Changes to Linux Kernel

Maintain coherency between the kernel
 and extension page tables
Detect exceptions that occurs within
 Nooks’ protection domains
Locate tasks that are no longer collocated
 on the kernel stack due to isolation
Interposition

Provides wrapper stubs between
 extensions and the kernel
  Transparent to the kernel and drivers
Kernel modifications
  Make standard module load to bind extensions
   to wrappers instead of kernel functions
  The kernel is initialized to interpose on the
   Nooks’ call into extensions
Interposition

Some data references are interposed
Certain objects are linked directly into the
 extension for reading
Kernel modification calls are wrapped
Performance critical data structure
  Shadow object in extension that are
   synchronized before and after XPCs
Otherwise, just XPCs
Wrappers

Within the kernel’s protection domain
Three basic tasks
  Check parameters for validity
  Create a copy of kernel objects in the
   extension’s protection domain
    No serialization/deserialization necessary
    Synchronization code placed in wrappers
  Perform an XPC into the kernel or extension
Automatically generated
Wrapper Code Sharing

50% of Nooks code base
Shared among multiple drivers
Object Tracking

Supports 43 kernel object types
Records the addresses of all objects in
 use by an extension
Records the association between the
 kernel and the extension versions of
 writable objects
Performs garbage collection
Determines whether to copy an object
Recovery

Recovery manager releases resources
  Unloading the extension
  Releasing its kernel and physical resources
  Reloading and restarting the extension
User-mode agent coordinates recovery
Each object is associated with a recovery
 function
Implementation Limitations

Nooks does not handle all possible errors
  Deliberate corruptions of system states
  Infinite loops
However, a moderate reduction of system
 crashes is a significant contribution
Achieving Transparency

Wrapper stubs for every call in the
 extension-kernel interface
Object-tracking code for every object type
 that is passed between the extension and
 the kernel
Nooks transparent to both the extension
 and the kernel
Reliability

Nooks can detect and recover 99% of
 extension faults
Test Methodology

Synthetic fault injection
  Automatically changes single instructions in the
   extension code to emulate common errors
     Uninitialized variables
     Bad parameters
Types of Extensions Isolated

Device drivers (network, sound cards)
Optional kernel subsystems (VFAT)
Application-specific kernel extension
 (kHTTPd)
Test Environment

VMware
  Allows automation of crash testing without
   reboots
5 extensions
  400 tests each
Test Results

Not all faulty-injection trials cause faulty
 behavior
System Crashes

A system crash is easiest to detect
  OS panics
  Hangs
  Reboots
Linux experienced 317 crashes
Nooks eliminated 313 crashes, or 99%
4 deadlocks
System Crashes

Sound blaster and VFAT extensions are
 process-oriented
  Fewer crashes
kHTTPd, pcnet32, e1000 are interrupted-
 based
  More crashes
Non-Fatal Extension Failures

Nooks cannot detect erroneous extension
 behaviors
  Network could disappear
  Mounted file system hangs
Recovery Errors

A faulting extension is unloaded, reloaded,
 and restarted
  Works well with kHTTPp
  Not as well with VFAT
     Corruptions can propagate to disk if not detected in
      time
Summary of Reliability Experiments

Nooks eliminated 99% of the system
 crashes in extensions
Nooks eliminated nearly 60% of non-fatal
 extension failures
Performance

Dell 1.7 GHz Pentium 4
890 MB of RAM
SoundBlaster 16
Intel Pro/1000 Gb Ethernet Adapter
7200 RPM, 41 GB IDE HD
Linux 2.4.18
Sound Benchmark

Plays an MP3 file at 128 Kb/sec
150 XPCs/sec
Nooks imposes little overhead
Network Benchmark

netperf performance tool
A node sends/receives a stream of 32 KB
 TCP messages via a 256KB buffer
  10% overhead
Compile Benchmark

Linux kernel compilation on VFAT
25% slowdown
Web Server Benchmarks

httperf
  Repeatedly request a 1-KB file and measure
   the maximum request rate
  60% slowdown
  CPU bound
SPECweb99
  3% slowdown
Summary

If the computation is not CPU bound, the
 penalty may not be important
Conclusions

Nooks is achievable with modest
 engineering effort
Extensions such as device drivers can be
 isolated without changes to extension
 code
Isolation and recovery can dramatically
 improve the system’s ability to survive
 extension faults

				
DOCUMENT INFO
mikesanye mikesanye
About