Docstoc

dist

Document Sample
dist Powered By Docstoc
					Distributed Systems

       Jeff Chase
     Duke University
         Challenge: Coordination
• The solution to availability and scalability is to decentralize and
  replicate functions and data…but how do we coordinate the
  nodes?
    –   data consistency
    –   update propagation
    –   mutual exclusion
    –   consistent global states
    –   group membership
    –   group communication
    –   event ordering
    –   distributed consensus
    –   quorum consensus
                   Overview
• The problem of failures
• The fundamental challenge of consensus
• General impossibility of consensus
   – in asynchronous networks (e.g., the Internet)
   – safety vs. liveness: “CAP theorem”
• Manifestations
   – NFS cache consistency
   – Transaction commit
   – Master election in Google File System
         Properties of nodes
• Essential properties typically assumed by model:
   – Private state
      • Distributed memory: model sharing as messages
   – Executes a sequence of state transitions
      • Some transitions are reactions to messages
      • May have internal concurrency, but hide that
   – Unique identity
   – Local clocks with bounded drift
               Node failures
• Fail-stop. Nodes/actors may fail by stopping.
• Byzantine. Nodes/actors may fail without stopping.
   – Arbitrary, erratic, unexpected behavior
   – May be malicious and disruptive
• Unfaithful behavior
   – Actors may behave unfaithfully from self-
     interest.
   – If it is rational, is it Byzantine?
   – If it is rational, then it is expected.
   – If it is expected, then we can control it.
   – Design in incentives for faithful behavior, or
     disincentives for unfaithful behavior.
                         Example: DNS
                                                         com
                                                           gov
                                                             org          generic TLDs
                                                               net
           DNS roots                               top-level     firm
                                                                   shop
                                                   domains           arts
                                                                       web
                                                    (TLDs)               us
                                                                           fr country-code
                                   .edu                                           TLDs


                                   duke
                                                         washington
     unc

cs                        mc       cs      env          cs
                                      www
                           vmm01    (prophet)


 What can go wrong if a DNS server fails?
 How much damage can result from a server failure?
 Can it compromise the availability or integrity of the entire DNS system?
                       DNS Service 101
                        WWW server for
                                                   – client-side resolvers
                          nhc.noaa.gov
                       (IP 140.90.176.22)
                                                      • typically in a library
                                                      • gethostbyname,
                            “www.nhc.noaa.gov is        gethostbyaddr
                              140.90.176.22”       – cooperating servers
      DNS server for
      nhc.noaa.gov                                    • query-answer-referral
                                                        model
“lookup www.nhc.noaa.gov”
                                        local
                                      DNS server
                                                      • forward queries among
                                                        servers
                                                      • server-to-server may use
                                                        TCP (“zone transfers”)
              Node recovery
• Fail-stopped nodes may revive/restart.
   – Retain identity
   – Lose messages sent to them while failed
   – Arbitrary time to restart…or maybe never
• Restarted node may recover state at time of failure.
   – Lose state in volatile (primary) memory.
   – Restore state in non-volatile (secondary) memory.
   – Writes to non-volatile memory are expensive.
   – Design problem: recover complete states reliably,
     with minimal write cost.
 Distributed System Models
• Synchronous model
   – Message delay is bounded and the bound is known.
   – E.g., delivery before next tick of a global clock.
   – Simplifies distributed algorithms
      • “learn just by watching the clock”
      • absence of a message conveys information.
• Asynchronous model
   – Message delays are finite, but unbounded/unknown
   – More realistic/general than synchronous model.
       • “Beware of any model with stronger assumptions.” - Burrows
    – Strictly harder/weaker than synchronous model.
       • Consensus is not always possible
        Messaging properties
• Other possible properties of the messaging model:
   – Messages may be lost.
   – Messages may be delivered out of order.
   – Messages may be duplicated.
• Do we need to consider these in our distributed
  system model?
• Or, can we solve them within the asynchronous model,
  without affecting its foundational properties?
Network File System (NFS)
        client                         server
                                    syscall layer
   user programs
                                          VFS
    syscall layer
        VFS
                                  NFS
                                 server


                                                UFS
  UFS         NFS
              client

                       network
               NFS Protocol
• NFS is a network protocol layered above TCP/IP.
  – Original implementations (and some today) use UDP
    datagram transport.
     • Maximum IP datagram size was increased to
       match FS block size, to allow send/receive of
       entire file blocks.
     • Newer implementations use TCP as a transport.
  – The NFS protocol is a set of message formats and
    types for request/response (RPC) messaging.
       NFS: From Concept to
         Implementation
• Now that we understand the basics, how do we make
  it work in a real system?
   – How do we make it fast?
      • Answer: caching, read-ahead, and write-behind.
   – How do we make it reliable? What if a message is
     dropped? What if the server crashes?
      • Answer: client retransmits request until it
        receives a response.
   – How do we preserve the failure/atomicity model?
      • Answer: well...
 NFS as a “Stateless” Service
• The NFS server maintains no transient information
  about its clients.
   – The only “hard state” is the FS data on disk.
   – Hard state: must be preserved for correctness.
   – Soft state: an optimization, can discard safely.
• “Statelessness makes failure recovery simple and
  efficient.”
   – If the server fails client retransmits pending
     requests until it gets a response (or gives up).
   – “Failed server is indistinguishable from a slow
     server or a slow network.”
Drawbacks of a Stateless Service
• Classical NFS has some key drawbacks:
   – ONC RPC has execute-mostly-once semantics
     (“send and pray”).
   – So operations must be designed to be idempotent.
      • Executing it again does not change the effect.
      • Does not support file appends, exclusive name
        creation, etc.
   – Server writes updates to disk synchronously.
      • Slowww…
   – Does not preserve local single-copy semantics.
      • Open files may be removed.
      • Server cannot help in client cache consistency.
        File Cache Consistency
• Caching is a key technique in distributed systems.
   – The cache consistency problem: cached data may become
     stale if cached data is updated elsewhere in the network.
• Solutions:
   – Timestamp invalidation (NFS-Classic).
      • Timestamp each cache entry, and periodically query the
        server: “has this file changed since time t?”; invalidate
        cache if stale.
   – Callback invalidation (AFS).
      • Request notification (callback) from the server if the file
        changes; invalidate cache on callback.
   – Delegation “leases” [Gray&Cheriton89, NQ-NFS, NFSv4]
 Timestamp Validation in NFS [1985]

• NFSv2/v3 uses a form of timestamp validation.
  – Timestamp cached data at file grain.
  – Maintain per-file expiration time (TTL)
  – Refresh/revalidate if TTL expires.
     • NFS Get Attributes (getattr)
     • Similar approach used in HTTP
  – Piggyback file attributes on each response.
• What happens on server failure? Client failure?
• What TTL to use?
      Delegations or “Leases”
• In NQ-NFS, a client obtains a lease on the file that
  permits the client’s desired read/write activity.
      • “A lease is a ticket permitting an activity; the
        lease is valid until some expiration time.”
   – A read-caching lease allows the client to cache
     clean data.
      • Guarantee: no other client is modifying the file.
   – A write-caching lease allows the client to buffer
     modified data for the file.
      • Guarantee: no other client has the file cached.
      • Allows delayed writes: client may delay issuing
        writes to improve write performance (i.e., client
        has a writeback cache).
     Using Delegations/Leases
1. Client NFS piggybacks lease requests for a given file
   on I/O operation requests (e.g., read/write).
   – NQ-NFS leases are implicit and distinct from file locking.
2. The server determines if it can safely grant the
   request, i.e., does it conflict with a lease held by
   another client.
   – read leases may be granted simultaneously to multiple clients
   – write leases are granted exclusively to a single client
3. If a conflict exists, the server may send an eviction
   notice to the holder.
   – Evicted from a write lease? Write back.
   – Grace period: server grants extensions while client writes.
   – Client sends vacated notice when all writes are complete.
                     Failures
• Challenge:
   – File accesses complete if the server is up.
   – All clients “agree” on “current” file contents.
   – Leases are not broken (mutual exclusion)
• What if a client fails while holding a lease?
• What if the server fails?
• How can any node know that any other node has
  failed?
                  Consensus
          P1                                       P1
     v1                                                 d1

     Unreliable                                  Consensus
     multicast                                   algorithm
P2                     P3              P2                       P3
v2                v3                        d2                  d3
      Step 1                                      Step 2
     Propose.                                     Decide.

                            Generalizes to N nodes/processes.
   Properties for Correct Consensus
• Termination: All correct processes eventually decide.
• Agreement: All correct processes select the same di.
      • Or…(stronger) all processes that do decide
         select the same di, even if they later fail.
      • Called uniform consensus: “Uniform consensus is
         harder then consensus.”
• Integrity: All deciding processes select the “right”
  value.
   – As specified for the variants of the consensus
     problem.
Properties of Distributed Algorithms

• Agreement is a safety property.
   – Every possible state of the system has this
     property in all possible executions.
   – I.e., either they have not agreed yet, or they all
     agreed on the same value.
• Termination is a liveness property.
   – Some state of the system has this property in all
     possible executions.
   – The property is stable: once some state of an
     execution has the property, all subsequent states
     also have it.
      Variant I: Consensus (C)

                                                   di = v k




Pi selects di from {v0, …, vN-1}.
All Pi select di as the same vk .
If all Pi propose the same v, then di = v, else di is arbitrary.


                                                   Coulouris and Dollimore
    Variant II: Command Consensus
                  (BG)
                  leader or
                                      vleader
                 commander



      subordinate or
        lieutenant                              di = vleader
Pi selects di = vleader proposed by designated leader node Pleader if
the leader is correct, else the selected value is arbitrary.
As used in the Byzantine generals problem.
Also called attacking armies.
                                                   Coulouris and Dollimore
     Variant III: Interactive
         Consistency (IC)

                                              di = [v0 , …, vN-1]




Pi selects di = [v0 , …, vN-1] vector reflecting the values
proposed by all correct participants.



                                                 Coulouris and Dollimore
 Equivalence of Consensus Variants
• If any of the consensus variants has a solution, then all of them
  have a solution.
• Proof is by reduction.
   – IC from BG. Run BG N times, one with each Pi as leader.
   – C from IC. Run IC, then select from the vector.
   – BG from C.
       • Step 1: leader proposes to all subordinates.
       • Step 2: subordinates run C to agree on the proposed
         value.
   – IC from C? BG from IC? Etc.
Fischer-Lynch-Patterson (1985)
• No consensus can be guaranteed in an asynchronous
  communication system in the presence of any
  failures.
• Intuition: a “failed” process may just be slow, and can
  rise from the dead at exactly the wrong time.
• Consensus may occur recognizably, rarely or often.
       • e.g., if no inconveniently delayed messages
• FLP implies that no agreement can be guaranteed in
  an asynchronous system with Byzantine failures
  either. (More on that later.)
      Consensus in Practice I
• What do these results mean in an asynchronous world?
   – Unfortunately, the Internet is asynchronous, even if we
     believe that all faults are eventually repaired.
   – Synchronized clocks and predictable execution times
     don’t change this essential fact.
• Even a single faulty process can prevent consensus.
• Consensus is a practical necessity, so what are we to do?
     Consensus in Practice II
• We can use some tricks to apply synchronous algorithms:
   – Fault masking: assume that failed processes always
     recover, and reintegrate them into the group.
       • If you haven’t heard from a process, wait longer…
       • A round terminates when every expected message is
         received.
   – Failure detectors: construct a failure detector that can
     determine if a process has failed.
       • Use a failure detector that is live but not accurate.
       • Assume bounded delay.
       • Declare slow nodes dead and fence them off.
• But: protocols may block in pathological scenarios, and they
  may misbehave if a failure detector is wrong.
                   Fault Masking with a
                     Session Verifier
                                “Do A for me.”

                           “OK, my verifier is x.”                        S, x
                                     “B”
                                     “x”
                                                                oops...
                                     “C”

                           “OK, my verifier is y.”

                                  “A and B”
                                                                          S´, y
                                     “y”


What if y == x?
How to guarantee that y != x?
What is the implication of re-executing A and B, and after C?
Some uses: NFS V3 write commitment, RPC sessions, NFS V4 and DAFS (client).
   Delegation/Lease Recovery
• Key point: the bounded lease term simplifies
  recovery.
   – Before a lease expires, the client must renew the lease.
      • Else client is deemed to have “failed”.
   – What if a client fails while holding a lease?
      • Server waits until the lease expires, then unilaterally
        reclaims the lease; client forgets all about it.
      • If a client fails while writing on an eviction, server waits
        for write slack time before granting conflicting lease.
   – What if the server fails while there are outstanding leases?
      • Wait for lease period + clock skew before issuing new
        leases.
   – Recovering server must absorb lease renewal requests
     and/or writes for vacated leases.
              Failure Detectors
• How to detect that a member has failed?
   – pings, timeouts, beacons, heartbeats
   – recovery notifications
       • “I was gone for awhile, but now I’m back.”
• Is the failure detector accurate?
• Is the failure detector live (complete)?
• In an asynchronous system, it is possible for a failure detector
  to be accurate or live, but not both.
   – FLP tells us that it is impossible for an asynchronous system
      to agree on anything with accuracy and liveness!
Failure Detectors in Real Systems
• Use a detector that is live but not accurate.
   – Assume bounded processing delays and delivery times.
   – Timeout with multiple retries detects failure accurately
     with high probability. Tune it to observed latencies.
   – If a “failed” site turns out to be alive, then restore it or
     kill it (fencing, fail-silent).
   – Example: leases and leased locks
• What do we assume about communication failures? How
  much pinging is enough? What about network partitions?
    Variant II: Command Consensus
                  (BG)
                  leader or
                                      vleader
                 commander



      subordinate or
        lieutenant                              di = vleader
Pi selects di = vleader proposed by designated leader node Pleader if
the leader is correct, else the selected value is arbitrary.
As used in the Byzantine generals problem.
Also called attacking armies.
                                                   Coulouris and Dollimore
A network partition
                                        consistency

                                               C
Fox&Brewer “CAP Theorem”:                                         Claim: every distributed
C-A-P: choose two.                                                system is on one side of the
                                                                  triangle.


     CA: available, and consistent,                               CP: always consistent, even in a
     unless there is a partition.                                 partition, but a reachable replica may
                                                                  deny service without agreement of the
                                                                  others (e.g., quorum).




                   A             AP: a reachable replica provides          P
        Availability             service even in a partition, but may      Partition-resilience
                                 be inconsistent if there is a failure.

37
       Two Generals in practice

                                   Deduct
                                    $300
                      Issue
                      $300




How do banks solve this problem?

                                            Keith Marzullo
Careful ordering is limited
 Transfer $100 from Melissa’s account to mine
   1. Deduct $100 from Melissa’s account
   2. Add $100 to my account
 Crash between 1 and 2, we lose $100
 Could reverse the ordering
   1. Add $100 to my account
   2. Deduct $100 from Melissa’s account
 Crash between 1 and 2, we gain $100
 What does this remind you of?
Transactions
 Fundamental to databases
   (except MySQL, until recently)
 Several important properties
   “ACID” (atomicity, consistent, isolated, durable)
   We only care about atomicity (all or nothing)
                               BEGIN
                                 disk write 1
                                 …
                                 disk write n
  Called “committing” the
                               END
             transaction
Transactions: logging
1.    Begin transaction
2.    Append info about modifications to a log
3.    Append “commit” to log to end x-action
4.    Write new data to normal database
     Single-sector write commits x-action (3)

                                           Commit
                                  WriteN
                     Write1
             Begin




             Transaction Complete
                              …




Invariant: append new data to log before applying to DB
Called “write-ahead logging”
Transactions: logging
1.    Begin transaction
2.    Append info about modifications to a log
3.    Append “commit” to log to end x-action
4.    Write new data to normal database
     Single-sector write commits x-action (3)

                                           Commit
                                  WriteN
                     Write1
             Begin



                              …




What if we crash here (between 3,4)?
On reboot, reapply committed updates in log order.
Transactions: logging
1.   Begin transaction
2.   Append info about modifications to a log
3.   Append “commit” to log to end x-action
4.   Write new data to normal database
    Single-sector write commits x-action (3)
                                 WriteN
                    Write1
            Begin



                             …




What if we crash here?
On reboot, discard uncommitted updates.
Committing Distributed Transactions
• Transactions may touch data at more than one site.
• Problem: any site may fail or disconnect while a
  commit for transaction T is in progress.
   – Atomicity says that T does not “partly commit”,
     i.e., commit at some site and abort at another.
   – Individual sites cannot unilaterally choose to abort
     T without the agreement of the other sites.
   – If T holds locks at a site S, then S cannot release
     them until it knows if T committed or aborted.
   – If T has pending updates to data at a site S, then
     S cannot expose the data until T commits/aborts.
Commit is a Consensus Problem
• If there is more than one site, then the sites must
  agree to commit or abort.
• Sites (Resource Managers or RMs) manage their own
  data, but coordinate commit/abort with other sites.
   – “Log locally, commit globally.”
• We need a protocol for distributed commit.
   – It must be safe, even if FLP tells us it might not
     terminate.
• Each transaction commit is led by a coordinator
  (Transaction Manager or TM).
    Two-Phase Commit (2PC)
                               If unanimous to commit
                                   decide to commit
                                  else decide to abort

“commit or abort?”   “here’s my vote”       “commit/abort!”
                                                              TM/C

                                                              RM/P

  precommit          vote             decide      notify
  or prepare

     RMs validate Tx and          TM logs commit/abort
    prepare by logging their         (commit point)
       local updates and
           decisions
      Handling Failures in 2PC
How to ensure consensus if a site fails during the 2PC
   protocol?
1. A participant P fails before preparing.
       Either P recovers and votes to abort, or C times
        out and aborts.
2. Each P votes to commit, but C fails before committing.
       • Participants wait until C recovers and notifies
         them of the decision to abort. The outcome is
         uncertain until C recovers.
       Handling Failures in 2PC,
              continued
   or C fails during phase 2, after the outcome is
3. P
  determined.
       • Carry out the decision by reinitiating the protocol
         on recovery.
       • Again, if C fails, the outcome is uncertain until C
         recovers.
                                      consistency

                                            C
Fox&Brewer “CAP Theorem”:                                     Claim: every distributed
C-A-P: choose two.                                            system is on one side of the
                                                              triangle.


   CA: available, and consistent,                             CP: always consistent, even in a
   unless there is a partition.                               partition, but a reachable replica may
                                                              deny service without agreement of the
                                                              others (e.g., quorum).




                 A             AP: a reachable replica provides        P
      Availability             service even in a partition, but may    Partition-resilience
                               be inconsistent.
             Parallel File Systems 101

  Manage data sharing in large data stores




Asymmetric                           Symmetric
  • E.g., PVFS2, Lustre, High Road     • E.g., GPFS, Polyserve


                                                 [Renu Tewari, IBM]
          Parallel NFS (pNFS)


              data



pNFS                                  Block (FC) /
Clients                              Object (OSD) /
                                       File (NFS)
                     NFSv4+ Server      Storage




                                      [David Black, SNIA]
                   pNFS architecture

                            data



       pNFS                                                  Block (FC) /
       Clients                                              Object (OSD) /
                                                              File (NFS)
                                   NFSv4+ Server               Storage
•   Only this is covered by the pNFS protocol
•   Client-to-storage data path and server-to-storage control path are
    specified elsewhere, e.g.
     – SCSI Block Commands (SBC) over Fibre Channel (FC)
     – SCSI Object-based Storage Device (OSD) over iSCSI
     – Network File System (NFS)


                                                              [David Black, SNIA]
            pNFS basic operation
•   Client gets a layout from the NFS Server
•   The layout maps the file onto storage devices and addresses
•   The client uses the layout to perform direct I/O to storage
•   At any time the server can recall the layout (leases/delegations)
•   Client commits changes and returns the layout when it’s done
•   pNFS is optional, the client can always use regular NFSv4 I/O


                layout




                                         Storage
      Clients
                              NFSv4+ Server

                                                      [David Black, SNIA]
    Google GFS: Assumptions
• Design a Google FS for Google’s distinct needs
• High component failure rates
   – Inexpensive commodity components fail often
• “Modest” number of HUGE files
   – Just a few million
   – Each is 100MB or larger; multi-GB files typical
• Files are write-once, mostly appended to
   – Perhaps concurrently
• Large streaming reads
• High sustained throughput favored over low
  latency
                                         [Alex Moschuk]
         GFS Design Decisions
• Files stored as chunks
   – Fixed size (64MB)
• Reliability through replication
   – Each chunk replicated across 3+ chunkservers
• Single master to coordinate access, keep metadata
   – Simple centralized management
• No data caching
   – Little benefit due to large data sets, streaming reads
• Familiar interface, but customize the API
   – Simplify the problem; focus on Google apps
   – Add snapshot and record append operations


                                               [Alex Moschuk]
          GFS Architecture
• Single master
• Mutiple chunkservers




    …Can anyone see a potential weakness in this design?
                                          [Alex Moschuk]
                Single master
• From distributed systems we know this is a:
  – Single point of failure
  – Scalability bottleneck
• GFS solutions:
  – Shadow masters
  – Minimize master involvement
     • never move data through it, use only for metadata
        – and cache metadata at clients
     • large chunk size
     • master delegates authority to primary replicas in data
       mutations (chunk leases)
• Simple, and good enough!
               Fault Tolerance
• High availability
   – fast recovery
      • master and chunkservers restartable in a few seconds
   – chunk replication
      • default: 3 replicas.
   – shadow masters

• Data integrity
   – checksum every 64KB block in each chunk

What is the consensus problem here?
           Google Ecosystem
• Google builds and runs services at massive scale.
  – More than half a million servers
• Services at massive scale must be robust and
   adaptive.
  – To complement a robust, adaptive infrastructure
• Writing robust, adaptive distributed services is hard.
• Google Labs works on tools, methodologies, and
   infrastructures to make it easier.
  – Conceive, design, build
  – Promote and transition to practice
  – Evaluate under real use
                  Google Systems
•    Google File System (GFS) [SOSP 2003]
    – Common foundational storage layer
•    MapReduce for data-intensive cluster computing [OSDI 2004]
    – Used for hundreds of google apps
    – Open-source: Hadoop (Yahoo)
•    BigTable [OSDI 2006]
    – a spreadsheet-like data/index model layered on GFS
•    Sawzall
    – Execute filter and aggregation scripts on BigTable servers
•    Chubby [OSDI 2006]
    – Foundational lock/consensus/name service for all of the above
    – Distributed locks
    – The “root” of distributed coordination in Google tool set
      What Good is “Chubby”?
• Claim: with a good lock service, lots of distributed
   system problems become “easy”.
  – Where have we seen this before?
• Chubby encapsulates the algorithms for consensus.
  – Where does consensus appear in Chubby?
• Consensus in the real world is imperfect and messy.
  – How much of the mess can Chubby hide?
  – How is “the rest of the mess” exposed?
• What new problems does such a service create?
           Chubby Structure
• Cell with multiple participants (replicas and master)
  – replicated membership list
  – common DNS name (e.g., DNS-RR)
• Replicas elect one participant to serve as Master
  – master renews its Master Lease periodically
  – elect a new master if the master fails
  – all writes propagate to secondary replicas
• Clients send “master location requests” to any replica
  – returns identity of master
• Replace replica after long-term failure (hours)
Master Election/Fail-over
                                      consistency

                                            C
Fox&Brewer “CAP Theorem”:                                     Claim: every distributed
C-A-P: choose two.                                            system is on one side of the
                                                              triangle.


   CA: available, and consistent,                             CP: always consistent, even in a
   unless there is a partition.                               partition, but a reachable replica may
                                                              deny service without agreement of the
                                                              others (e.g., quorum).




                 A             AP: a reachable replica provides        P
      Availability             service even in a partition, but may    Partition-resilience
                               be inconsistent.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:8/14/2011
language:English
pages:64