Document Sample
Pond Powered By Docstoc
The OceanStore Prototype
   Problem: Rising cost of storage
     Universal connectivity via Internet
     $100 terabyte storage within three years
   Solution: OceanStore
  Cooperative file system
  High durability
  Universal availability
  Two-tier storage system
    Upper tier: powerful servers
    Lower tier: less powerful hosts
More on OceanStore
   Unit of storage: data object
   Applications: email, UNIX file system
   Requirements for the object interface
     Information universally accessible
     Balance between privacy and sharing
     Simple and usable consistency model
     Data integrity
OceanStore Assumptions
   Infrastructure untrusted except in
     Most nodes are not faulty and malicious
   Infrastructure constantly changing
     Resources enter and exit the network
     without prior warning
     Self-organizing, self-repairing, self-tuning
OceanStore Challenges
   Expressive storage interface
   High durability on untrusted and
   changing base
Data Model
  The view of the system that is
  presented to client applications
Storage Organization
   OceanStore data object ~= file
   Ordered sequence of read-only versions
     Every version of every object kept forever
     Can be used as backup
   An object contains metadata, data, and
   references to previous versions
Storage Organization
   A stream of objects identified by AGUID
     Active globally-unique identifier
     Cryptographically-secure hash of an
     application-specific name and the owner’s
     public key
     Prevents namespace collisions
Storage Organization
   Each version of data object stored in a
   B-tree like data structure
     Each block has a BGUID
      • Cryptographically-secure hash of the block
     Each version has a VGUID
     Two versions may share blocks
Storage Organization
Application-Specific Consistency
   An update is the operation of adding a
   new version to the head of a version
   Updates are applied atomically
     Represented as an array of potential
     Each guarded by a predicate
Application-Specific Consistency
   Example actions
     Replacing some bytes
     Appending new data to an object
     Truncating an object
   Example predicates
     Check for the latest version number
     Compare bytes
Application-Specific Consistency
   To implement ACID semantic
     Check for readers
     If none, update
   Append to a mailbox
     No checking
   No explicit locks or leases
Application-Specific Consistency
   Predicate for reads
      • Can’t read something older than 30 seconds
      • Only can read data from a specific time frame
System Architecture
   Unit of synchronization: data object
   Changes to different objects are
Virtualization through Tapestry
   Resources are virtual and not tied to
   particular hardware
   A virtual resource has a GUID, globally
   unique identifier
   Use Tapestry, a decentralized object
   location and routing system
     Scalable overlay network, built on TCP/IP
Virtualization through Tapestry
   Use GUIDs to address hosts and
   Hosts publish the GUIDs of their
   resources in Tapestry
   Hosts also can unpublish GUIDs and
   leave the network
Replication and Consistency
   A data object is a sequence of read-only
   versions, consisting of read-only blocks,
   named by BGUIDs
   No issues for replication
   The mapping from AGUID to the latest
   VGUID may change
   Use primary-copy replication
Replication and Consistency
   The primary copy
     Enforces access control
     Serializes concurrent updates
Archival Storage
   Replication: 2x storage to tolerate one
   Erasure code is much better
     A block is divided into m fragments
     m fragments encoded into n > m
     Any m fragments can restore the original
Caching of Data Objects
   Reconstructing a block from erasure
   code is an expensive process
   Need to locate m fragments from m
   Use whole-block caching for frequently-
   read objects
Caching of Data Objects
   To read a block, look for the block first
   If not available
     Find block fragments
     Decode fragments
     Publish that the host now caches the block
     Amortize the cost of erasure
Caching of Data Objects
   Updates are pushed to secondary
   replicas via application-level multicast
The Full Update Path
   Serialized updates are disseminated via
   the multicast tree for an object
   At the same time, updates are encoded
   and fragmented for long-term storage
The Full Update Path
The Primary Replica
   Primary servers run Byzantine
   agreement protocol
     Need more than 2/3 nonfaulty participants
     Messages required grow quadratic in the
     number of participants
Public-Key Cryptography
   Too expensive
   Use symmetric-key message
   authentication codes (MACs)
     Two to three orders of magnitude faster
     Downside: can’t prove the authenticity of
     a message to the third party
     Used only for the inner ring
   Public-key cryptography for outer ring
Proactive Threshold Signatures
   Byzantine agreement guarantees
   correctness if not more than 1/3 servers
   fail during the life of the system
   Not practical for a long-lived system
   Need to reboot servers at regular
   Key holders are fixed
Proactive Threshold Signatures
   Proactive threshold signatures
     More flexibility in choosing the membership
     of the inner ring
   A public key is paired with a number of
   private keys
   Each server uses its key to generate a
   signature share
Proactive Threshold Signatures
   Any k shares may be combined to
   produce a full signature
   To change membership of an inner ring
     Regenerate signature shares
     No need to change the public key
     Transparent to secondary hosts
The Responsible Party
   Who chooses the inner ring?
   Responsible party:
     A server that publishes sets of failure-
     independent nodes
      • Through offline measurement and analysis
Software Architecture
   Java atop the Staged Event Driven
   Architecture (SEDA)
     Each subsystem is implemented as a stage
     With each own state and thread pool
     Stages communicate through events
     50,000 semicolons by five graduate
     students and many undergrad interns
Software Architecture
Language Choice
  Java: speed of development
    Strongly typed
    Garbage collected
    Reduced debugging time
    Support for events
    Easy to port multithreaded code in Java
     • Ported to Windows 2000 in one week
Language Choice
  Problems with Java:
    Unpredictability introduced by garbage
    Every thread in the system is halted while
    the garbage collector runs
    Any on-going process stalls for ~100
    May add several seconds to requests travel
    cross machines
Experimental Setup
   Two test beds
     Local cluster of 42 machines at Berkeley
      • Each with 2 1.0 GHz Pentium III
      • 1.5GB PC133 SDRAM
      • 2 36GB hard drives, RAID 0
      • Gigabit Ethernet adaptor
      • Linux 2.4.18 SMP
Experimental Setup
    PlanetLab, ~100 nodes across ~40 sites
     • 1.2 GHz Pentium III, 1GB RAM
     • ~1000 virtual nodes
Storage Overhead
   For 32 choose 16 erasure encoding
     2.7x for data > 8KB
   For 64 choose 16 erasure encoding
     4.8x for data > 8KB
The Latency Benchmark
  A single client submits updates of
  various sizes to a four-node inner ring
  Metric: Time from before the request is
  signed to the signature over the result
  is checked
  Update 40 MB of data over 1000
  updates, with 100ms between updates
The Latency Benchmark

           Update Latency (ms)           Latency Breakdown
  Key     Update 5%      Median   95%     Phase      Time (ms)
  Size     Size  Time     Time    Time    Check            0.3
            4kB     39       40     41   Serialize         6.1
            2MB   1037     1086   1348    Apply            1.5
            4kB     98       99    100   Archive           4.5
            2MB   1098     1150   1448     Sign           77.8
The Throughput Microbenchmark
   A number of clients submit updates of
   various sizes to disjoint objects, to a
   four-node inner ring
   The clients
     Create their objects
     Synchronize themselves
     Update the object as many time as
     possible for 100 seconds
The Throughput Microbenchmark
Archive Retrieval Performance
   Populate the archive by submitting
   updates of various sizes to a four-node
   inner ring
   Delete all copies of the data in its
   reconstructed form
   A single client submits reads
Archive Retrieval Performance
     1.19 MB/s (Planetlab)
     2.59 MB/s (local cluster)
     ~30-70 milliseconds
The Stream Benchmark
  Ran 500 virtual nodes on PlanetLab
    Inner Ring in SF Bay Area
    Replicas clustered in 7 largest P-Lab sites
  Streams updates to all replicas
    One writer - content creator – repeatedly
    appends to data object
    Others read new versions as they arrive
    Measure network resource consumption
The Stream Benchmark
The Tag Benchmark
  Measures the latency of token passing
  OceanStore 2.2 times slower than
The Andrew Benchmark
  File system benchmark
  4.6x than NFS in read-intensive phases
  7.3x slower in write-intensive phases

Shared By: