Separating Durability from Availability by wulinqing


									                       Separating Durability from Availability
                                                     Paper #196

Abstract                                                      ures are frequent in such systems: many nodes are in-
                                                              volved and wide-area networks are prone to failure. Deal-
Wide-area data storage systems maintain redundant data        ing with failures is expensive, since new replicas must
to assure availability and durability. The amount of re-      be created over wide-area or inter-rack links with limited
dundancy is important, since it affects both the network      capacity. These operating conditions make it difficult to
bandwidth consumed for replication and the probability        manage replication and to deploy DHTs that store large
of data loss. Many existing designs expect users to spec-     amounts of data.
ify the amount of redundancy, or choose the amount based         The benefits of realizing a wide-area storage system
on predicted probability of multiple server failure.          would be significant, however. Such a system could har-
    This paper makes two contributions to the understand-     ness the aggregate bandwidth and storage of many nodes
ing of redundant data in wide-area storage systems. First,    to provide both robustness and high capacity. With ef-
it shows that the redundancy level for durability should be   ficient replica management, such a storage system could
chosen based on the rate at which servers fail, since the     facilitate the construction of new and improved distributed
main challenge to durability is the time required to move     applications.
new redundant data over the network. Second, the pa-             Replication is traditionally used to ensure that data
per analyses how additional redundancy reduces the cost       remains both available (reachable on the network) and
of availability and describes an algorithm that creates the   durable (stored intact, but perhaps not reachable). Avail-
minimum amount of additional redundancy necessary to          ability is a harder goal than durability: availability may
weather temporary failures.                                   be threatened by any network, processor, software, or disk
    The paper describes the design of the Sostenuto dis-      failure, while durability is usually only threatened by disk
tributed hash table, which incorporates the paper’s ideas.    failures. Small-scale fault-tolerant storage systems such
Simulation of Sostenuto using failure traces from Plan-       as RAID provide availability and durability relatively in-
etLab support the correctness of the redundancy model,        expensively since failures are rare and plenty of local net-
and show that Sostenuto can store data with a low over-       work or bus bandwidth is available to create new replicas.
head and high availability. Over a 4 month period the            This paper makes two contributions to replica main-
system makes less than 5 copies of each data item and         tenance in wide-area storage systems. First, it develops
achieves better than 99.99% availability. Data-intensive      a model for durability based on the rate of permanent
applications such as the proposed Usenet replacement          replica failure and the time required to create new replicas
DHTGroups can make use of Sostenuto to reduce band-           over a limited-capacity network. The model predicts how
width requirements. Using Sostenuto, DHTGroups could          many replicas are required to ensure durability, and guides
store the complete 1.4TB of Usenet data generated each        decisions about how to maintain the replicas and where to
day on 300 nodes and retain the data for 4 months with        place them. Second, the paper presents an algorithm that
each node using less than 3 Mb/s instead of the 100 Mb/s      implicitly creates enough extra replicas to maintain avail-
required of traditional Usenet servers.                       ability while minimizing subsequent repairs due to tempo-
                                                              rary failures. This algorithm reduces the amount of repli-
                                                              cation traffic compared to existing techniques that create
1 Introduction                                                a set amount of extra redundancy at the outset [11, 3].
                                                                 We are motivated in this work by the prospect of a
Building a robust and reliable wide-area storage system       Usenet replacement built on a DHT. DHTGroups has the
requires significant engineering discipline. Data replicas     potential to greatly reduce the amount of storage and
can be lost by network outages [1], machine crashes and       bandwidth required to host a Usenet feed [9]. The ex-
disk failures. Storage systems have traditionally coped       isting Usenet distributes every article to every Usenet
with failure by replicating data on a small, fixed set of      server. About 1.4 TB of articles are posted each day,
servers on a local area network. Wide-area systems, such      requiring each server to transfer data constantly at about
as those based on distributed hash tables (DHTs), must        100Mb/sec.
place replicas on a constantly changing set of nodes. Fail-      DHTGroups replaces the local article storage at each

                                          SOSP 20 — Paper #196 — Page 1
server with shared storage provided by a DHT. This ap-          mark, rL . If the number of reachable replicas falls below
proach saves total storage since articles are no longer mas-    rL , the system immediately makes a new replica. The pur-
sively replicated; it also saves bandwidth since servers        pose of rL is to ensure durability.
only fetch articles that their clients actually read.               In the common case of temporary server failure or un-
   DHTGroups has the potential to reduce the bandwidth          reachability, the above eager repair policy will result in
requirement for running a Usenet server from 100Mb/sec          more than rL replicas. Sostenuto keeps track of these ex-
to around 1 Mb/sec. It will also reduce storage costs from      tra replicas, and does not delete them, in order to avoid
10TB per week to 60GB per week, assuming 300 servers            repair for future temporary failures and to get a head start
participate and 1 percent of articles are read at each site     on required repair for future permanent failures.
(these numbers are taken from studies of Usenet data).              The most important aspects of Sostenuto’s design are
These savings can be realized only if DHTs can satisfy          the model for choosing r L and the effects of a potentially
the performance requirements of Usenet: low latency ar-         unlimited upper bound on the replicas created by tempo-
ticle fetch, high throughput bulk read and write and effi-       rary failure. Sections 2.1 and 2.2 consider these questions
cient storage of large amounts of data. The first two con-       in turn. We will show, first, that a small r L provides good
cerns are addressed in earlier work by the developers of        durability in practical systems and, second, that although
DHash [8].                                                      the number of replicas we create due to transient failure is
   Here we address the final concern: if the cost of main-       potentially unlimited, the expected number is small.
taining data in a DHT is large, it could outweigh the po-
tential savings of a system like DHTGroups. We show
that, to the contrary, by building DHTGroups on top of          2.1 Eager Repair and the Low-Water Mark
Sostenuto, it is possible to store a complete archive of        Setting rL to 2 would ensure durability if replicas failed
Usenet data for 4 months, with a storage overhead factor        one at a time, with enough time between each failure to
of less than 5 and an additional bandwidth requirement of       make a new replica. After a server fails, there must be
approximately 2.5 Mb/s. We assume that DHTGroups and            enough time to make a new copy of each block it stored
Sostenuto will be deployed as a long-running service on         for which there are now less than r L replicas. If the server
unmanaged, but reliable nodes such as those in the Plan-        holds many blocks, or the servers’ access links are slow,
etLab testbed.                                                  then there may be a significant probability that another
   The roadmap of the paper is as follows. Section 2            server will fail before all the replicas are created.
discusses our model for replica failure and regeneration           The goal of setting r L is to ensure that no data will be
and the design of an algorithm (Sostenuto) based on that        lost if there are multiple failures. Increasing r L causes the
model. Section 3 evaluates the performance of the algo-         copying process to start earlier, so that there is more time
rithm on failure traces taken from the PlanetLab testbed.       to finish the copying before the last copy fails. We are
We discuss additional details related to the issue of replica   interested in understanding the relationship between r L ,
maintenance in Section 4, compare our work to prior work        the system environment (consisting of, for example, data
on Section 5, and conclude.                                     size, number of nodes, link bandwidth) and the reliability
                                                                of data. Since the purpose of r L is to ensure durability;
                                                                for this discussion we will assume that failures are perma-
2 Design                                                        nent. The existence of transient failures makes the sys-
In order to support applications like DHTGroups,                tem more durable for the particular r L we select: transient
Sostenuto must provide these properties:                        failures cause the system to make additional copies (this
                                                                mechanism will be discussed in detail in Section 2.2).
 1. It must ensure durability, essentially by copying data         This line of reasoning differs from prior work, which
    to new servers at the rate at which existing servers        analyzed a system’s availability and durability using the
    (or disks) suffer permanent failures.                       probability of a failure of a fraction of nodes in the sys-
                                                                tem [21, 7]. Instead, we analyze the system in terms of
 2. It must ensure availability, by maintaining more
                                                                its ability to create new copies more quickly than they are
    replicas of each piece of data than the likely number
    of concurrent transient replica failures.
                                                                   We consider a model in which node failures occur ran-
  In practice Sostenuto can only observe whether a              domly with rate λ f and the contents of a failed replica
replica is available (reachable over the network); it can-      server can be recreated at rate µ . The parameter λ f is a
not directly observe durability, because durable replicas       property of node reliabilities; µ depends on the amount of
might be on temporarily unavailable servers. For this rea-      data stored at each node and the bandwidth available to
son Sostenuto’s actions are driven by comparing the num-        create new replicas. This model is intended to capture the
ber of reachable replicas of a datum against a low-water        continuing process of replica loss and repair that occurs

                                            SOSP 20 — Paper #196 — Page 2
            µ            µ            µ           µ

 ...            3            2            1           0                          1.00e-02

                                                                Pr[block loss]
                    3λ           2λ           λ

                                                                                 1.00e-04   ρ =5
Figure 1: The birth-death process for replicas. Each state                                  ρ = 10
                                                                                            ρ = 50
above corresponds to the number of replicas of a given                           1.00e-05

block. Replicas are created at rate µ (governed by fac-
tors such as link capacity and the amount of data in the                                      2           4      6

system). Nodes fail with rate λ f . The rate of replica loss
grows as more replicas are added since nodes are failing        Figure 2: The relationship between ρ , r, and block relia-
in parallel.                                                    bility (p0 )

in a long-running system, rather than the instantaneous
                                                                number of replicas from accumulating. If ρ is large, the
impact of a number of simultaneous failures in a system
                                                                system will operate with more replicas (farther to the left
without repair. For this analysis we assume that λ f and µ
                                                                of the chain) and is less likely to experience enough fail-
are exponentially distributed.
                                                                ures to eliminate all extant copies. When ρ is smaller,
   We analyze the system using a birth-death Markov
                                                                fewer copies are likely to exist and the system is more
chain, shown in Figure 1. Each state in the chain corre-
                                                                likely to experience a string of failures that lead to data
sponds to the replication level of a single block. The sys-
tem moves to higher replication levels at the rate µ . The
rate at which transitions to lower levels are made increases       From this analysis, we glean the intuition that the over-
as the number of replicas increases: when n replicas are        riding factor driving data durability is the ratio between
present, the likelihood of any server failing increases by a    the rate at which replicas can be created and the rate at
factor of n.                                                    which they are lost. The number of replicas that exist at
                                                                any one time is a consequence of this ratio.
   We first consider a simplified system in which nodes
continually create new replicas; this corresponds to r L =         Unfortunately, ρ is largely out of the control of the sys-
∞. We examine this system because we are able to derive         tem: it depends on factors such as link bandwidth, data
closed forms for the properties we are interested in. We        scale, and failure rate. The system can control r L , how-
will adjust this model to correspond more closely to the        ever, to adjust how aggressively it replicates blocks. By
system we have deployed; that model will be an approxi-         modifying the above analysis, we can evaluate the rela-
mation of the infinite model.                                    tionship between different values of r L , ρ and block dura-
   We are interested in the probability that the system is in   bility. While it might be possible to design an adaptive
state zero: this corresponds to the probability no replica      algorithm which adjusts r L based on an estimate of ρ , we
for a block remains. If this probability is (P0) is low, we     only intend to present an analysis that allows system de-
expect data to be stored durably. By an analysis of the         signers to make an informed engineering decision.
chain we can conclude that P0 = e−ρ , where ρ = λ . A              To integrate rL into our model, we modify the Markov
derivation of this result is available in Appendix A. This      chain to have only r L states. At state rL the creation prob-
result tells us that the probability of data loss decreases     ability (µi ) is zero. We are interested in the value of P0
exponentially with ρ , independent of the number of repli-      in this system. The analysis proceeds as in the Appendix
cas the system creates.                                         except that infinite sums have been replaced by bounded
                                                                sums (up to rL ). We find that P0 = (∑rL ρ )−1 .
   To understand this result consider what ρ represents:                                                  r=0 r!
it is the ratio between the rate at which replicas can be          Figure 2 shows the relationship between r, ρ , and the
created and the rate at which they fail. For the system to      probability that the block will be lost. The value of ρ
reach a steady state we must have ρ > 1 or all replicas         has a large impact on the probability of block failure: at
will be lost before they can be replaced. The value of ρ        small values of ρ (below 5), it is impossible to maintain
dictates the expected number of replicas, r, in the system:     extremely low block failure rates even with an aggressive
E[r] = ρ (derivation in appendix). The expected number          low-water mark. In this regime, the system always oper-
of replicas does not diverge: even though the system can        ates with a handful of replicas (regardless of the value of
create replicas faster than a single one is lost (ρ > 1), the   rL ) and is vulnerable to a series of failures, spaced closely
effect of many replicas failing in parallel prevents a large    in time, causing data loss. When ρ is larger, increasing r L

                                              SOSP 20 — Paper #196 — Page 3
S OSTENUTO (key k)                                              B, of copies beyond r L ; these copies make it extremely
       n = available_copies (k, replicas[k])                    unlikely that a transient failure will cause a new copy to
       if (n < rL )                                             be created. In fact, a new copy could be created only
          newnode = create_new_copy (k)                         when B nodes are simultaneously unavailable. The ef-
          replicas[k] += newnode                                fect of these additional replicas is similar to Total Recall’s
       else                                                     lazy repair [3], but does not require initial creation of a set
          // do nothing                                         number of extra replicas. The best number of extra repli-
                                                                cas is difficult to determine, so Sostenuto’s on-demand
                                                                creation of extra replicas eliminates an area of uncertainty
            Figure 3: The Sostenuto algorithm
                                                                in configuring the replication system.
                                                                   To understand the number of blocks that Sostenuto is
rapidly reduces the probability that the block will be lost.    likely to create, we calculate the probability that, for a
   Fortunately, the probability of loss is not sensitive to     given existing number of copies, a new one must be cre-
small changes in the value of r L in practical systems. An      ated. This probability decreases exponentially as blocks
analysis of the failure characteristics of nodes on the Plan-   are created. While the probability of creating a new block
etLab testbed shows that ρ > 50. We can safely set r L to       is always non-zero, after about 2r L /p blocks exist it is
two when running on a system such as PlanetLab. Be-             very unlikely that a new block will be created; p is the av-
cause a small, eagerly maintained, and fixed r L is suffi-        erage availability of nodes (e.g. if nodes are available as
cient to ensure durability on most systems, we can con-         often as they are unavailable, p = 0.5).
sider a separate mechanism to obtain availability cheaply.         To see this, note that as B increases, the probability that
                                                                B − rL nodes fail simultaneously falls, making it less and
                                                                less likely that additional replicas need to be created. As-
2.2 Coping with Transient Failure                               sume that each node in the system is independently avail-
Existing approaches have observed that creating addi-           able with probability p (e.g., nodes that are alive as often
tional redundancy up front [3, 11] can reduce the cost of       as they are down have p = 0.5). To determine whether a
maintaining data availability. Additional redundancy re-        new replica must be created we perform a simple exper-
lieves the system of the need to create a new replica in        iment: flip B coins biased to heads with probability p; if
response to failure. However, it can be difficult to deter-      less than rL come up heads, a new replica must be created.
mine the correct amount of additional redundancy — the          This experiment should be performed at periodic intervals
number of replicas needed depends on the number of com-         separated by the average node lifetime. This is a Bernoulli
mon concurrent failures. An ideal system would produce          process, and the number of successes is given by the bino-
enough replicas to tolerate that number of failures without     mial distribution. The probability of needing a new block,
needing repair.                                                 given c existing copies, is the probability of fewer than r L
   Sostenuto dynamically and eagerly creates additional         successes:
replicas to meet this goal. Observe that both durability                                           rL −1
                                                                                                           r a
and availability will be achieved by eagerly creating a new       Pr[B < rL | r extant copies] =   ∑       a
                                                                                                             p (1 − p)r−a.
replica whenever the number of reachable replicas falls                                            a=0
below rL , for a suitable choice of r L . However, this ea-       Figure 4 shows the probability of creating an additional
ger policy will create new replicas faster than required —      block after various numbers of copies already exist, based
faster than the rate at which servers or disks fail perma-      on rL = 2. As the number of blocks created grows, the
nently — due to temporary server failure or unreachabil-        probability of another block creation drops exponentially.
ity. However, as long as the system can keep track of the       After about 2r L /p blocks have been created the time be-
extra replica, it will not need to generate a new replica in    tween block creations becomes large: an application of
response to future single temporary failure. This occurs        the Chernoff bound shows that the probability that fewer
when a storage system ensures that returning replicas re-       than rL of the 2rL /p total replicas are available is less than
join the appropriate replica sets. This is the approach that    e−rL p . As a result, we would expect the required mainte-
Sostenuto takes to avoiding unnecessary copies.                 nance bandwidth for a block to decrease over time.
   When Sostenuto first creates a block, it makes r L repli-
cas. Sostenuto monitors the number of reachable replicas
of each block. If the number of reachable replicas falls        3 Evaluation
below rL , Sostenuto eagerly creates new copies to bring
the reachable number up to r L and begins to monitor that       This section evaluates the performance of Sostenuto and
node as well.                                                   compares it to an existing system, Total Recall [3]. We
   After some time Sostenuto will have created a number,        characterize the performance of an algorithm with two

                                           SOSP 20 — Paper #196 — Page 4
                                                                            keeps track of this state internally and does not actually
                                                                            maintain the inode structures. A low water mark r L is
                                              p = 0.5                       used to determine when repair actions are taken: so long
Pr [repair action]

                     0.4                      p = 0.7                       as the total redundancy for a given block is not less than
                                              p = 0.9
                                                                            rL , no action is taken. When the redundancy falls below
                                                                            rL , the master node selects replacement nodes at random
                     0.2                                                    for the failed nodes and re-replicates back to a level of r H
                                                                            on the other nodes.
                                                                                Our input to all simulations is as follows. All nodes
                     0.0                                                    are given a bandwidth budget of 900 bytes/second. Early
                           2   4   6          8         10         12
                                   Number of replicas
                                                                            in each trace, we insert 2500 256K blocks. With 2500
                                                                            blocks, and approximately 300 nodes, each node is re-
Figure 4: Additional redundancy must be created when                        sponsible for about 5 blocks, though the actual number of
the amount of live redundancy drops below the desired                       blocks stored on a node is dependent on r L (and will vary
amount. The probability of this happening depends solely                    with rH ) and its random placement in the ring. The band-
on the probability of node failure p and the amount of                      width limit was chosen so that it would take about 5 sim-
durable redundancy. This graph shows the probability of                     ulated minutes for a single block transfer in the network.
a repair action as a function of the amount of durable re-                  This was a simulation trade-off: in a real DHT implemen-
dundancy in the system, with p = 0.5, 0.7, and 0.9. For                     tation, blocks would be smaller and link capacities would
all three curves, rL = 2.                                                   be higher, but there would be many more blocks. Five
                                                                            minutes was chosen as an arbitrary granularity for copies.
                                                                                After selecting these parameters, we can estimate ρ for
metrics: bandwidth used to create replicas, and the to-                     the system to help select a reasonable value for r L to sim-
tal duration of block unavailability. An algorithm should                   ulate. The failure rate N λ f is 1/16670s, where N is the
minimize both metrics. We primarily use trace-driven                        number of nodes in the system. After a node failure,
simulation to evaluate the algorithms. We describe briefly                   the system loses approximately (256K)r L B/N ≈ 4 MB of
a prototype implementation based on MIT’s Chord imple-                      data. Because of the random placement of replicas (see
mentation as the underlying routing and storage layer.                      discussion in Section 4.2), this load is spread over differ-
                                                                            ent nodes in the system — in this case, each effected node
                                                                            will have to move one block, for a µ ≈ 300s. This gives
3.1 Experimental Setup                                                      a ρ ≈ 56, suggesting that r L = 2 is more than sufficient to
We have implemented a custom event-driven simulator in                      maintain durability.
Python. The simulator models link bandwidth, but as-                            For calculating an optimal r L for Total Recall, assume
sumes that all network paths are independent. Bandwidth                     a desired availability, A = 0.99 with the trace average
is limited at the sender to limit the rate at which copies                  host unavailability (µ H ) of ≈ 0.11, produces r L = ln(1 −
are initiated. The simulator also assumes perfect network                   A)/ ln(1 − µH ) = 2.05. Because Total Recall does not
connectivity and that the only outages are due to reboots                   specify how to measure a long-term decay rate and calcu-
or disk failures rather than network failures. In reality,                  late rH , we mimic their evaluation by running with several
there are numerous and frequent network outages which                       different values for r H .
may not be symmetric [1]. Each node additionally has
unlimited disk capacity.                                                    3.2 Trace overview
   The simulator takes as input a trace of activity consist-
ing of node and disk failures, repairs and block insertions.                The most important aspect of our simulation is a realistic
It simulates the behavior of nodes under different proto-                   trace of host outages and repairs. We also require a real-
cols and produces a trace of the availability of blocks at all              istic model of when computers lose the contents of their
times and amount of data sent and stored by each node.                      drives in a typical network environment.
   In this framework, we implemented Sostenuto and To-                         Existing traces, such as those measuring OverNet [2]
tal Recall. Our implementation of Total Recall follows                      or corporate networks [5] do not meet these requirements:
that of [3], focusing on replication. On a block insert, r H                they update node liveness information less than once per
copies are stored on a set of random nodes in the system.                   hour, do not include information about data loss events
In the real implementation, a master node for the block                     and do not cover very long time periods.
generates these copies and records their locations in an                       We use a detailed trace from historical data collected
inode object that is eagerly replicated on the first five suc-                by the CoMon project [14] on PlanetLab [15]. CoMon
cessors of the master node. For simplicity, the simulator                   monitors all PlanetLab hosts every five minutes, allowing

                                                             SOSP 20 — Paper #196 — Page 5
  Dates                           12 Aug 2004 – 6 Jan 2005                                                                   Total disk bytes stored
  Number of hosts                 409                                                                        Sostenuto (rl=2)
                                                                                                      Total Recall (rl=2 rh=4)
                                                                                                      Total Recall (rl=2 rh=6)
  Number of failures (reboots)    13356                                                  6e+09        Total Recall (rl=2 rh=8)
                                                                                                     Total Recall (rl=2 rh=10)
  Number of data-loss failures    645                                                                   Number of live nodes

  Average host downtime (s)       1337078                                                5e+09

  Failure interarrival time (s)   143, 1486, 488600

                                                                  Bytes stored on disk
  Crash interarrival time (s)     555, 16670, 488600
    (Median/Mean/Max)                                                                    3e+09

       Table 1: CoMon+PLC trace characteristics                                          2e+09
                                                                                         1e+09                                                                             300
failures to be detected quickly. Further, CoMon reports                                                                                                                    100
                                                                                             0                                                                             0
the actual uptime counter from each machine, allowing                                            2         4        6        8      10     12
                                                                                                                                 Time (Weeks)
                                                                                                                                                       14   16   18   20

the time at which each machine reboots can be determined
very precisely.
   In order to identify disk failures (resulting typically      Figure 6: The total amount of disk bytes stored on all
from operator actions such as operating system rein-            participating nodes over time.
stallations or disk upgrades), the CoMon measurements
were supplemented with event logs from PlanetLab Cen-
tral [16]. Disk failure information was available only after    system during the simulation, so all additional bytes sent
25 October 2005. Table 1 summarizes the statistics of this      are due entirely to maintenance traffic.
trace.                                                             Consider the curve corresponding to Sostenuto (solid
   Even though PlanetLab nodes are maintained by host           line): in the period immediately after data is inserted, ad-
institutions, the trace includes a large number of node and     ditional copies are quickly generated to maintain avail-
disk failures. Many of the failures are due to periodic up-     ability. However, by week 4, the number of such copies
grades of the nodes’ operating systems. Some of these           necessary has stopped increasing rapidly. At the end of
events cause large number of correlated failures, a circum-     the trace, a total of approximately 3.3 GB has been sent,
stance for which our algorithm is not designed. This trace      corresponding to approximately 5.4 copies of the data.
ends prior to a particularly large upgrade in which 200         This is in line with our prediction that about 2r L /p = 4.8
machines were failed in a short time period.                    replicas would be created.
   Before discuss the results from simulating against this         Total recall behaves differently: it creates an amount
trace, we consider a second set of traces where node and        of fixed replication, r H , at the start of the trace. We plot
disk failures are generated by sampling random distribu-        values of rH = 4, 6, 8, 10. By storing large amounts of re-
tions based on the characteristics of the PlanetLab trace.      dundancy, Total Recall essentially does not have to create
We use the generated traces for illustrative purposes: the      any additional data blocks for r H ≥ 6. These values of
absence of noise and coupled events makes it much easier        rH are all greater than our prediction about the expected
to understand the behavior of the system. The generated         number of replicas required to avoid unavailability caused
traces operate with a 300 node population, with an ap-          by transient failures (4.8).
proximately 85% average availability. For the trace with           For rH = 4 Total Recall creates a large number of new
failures, 5% of outages are disk failures.                      replicas and continues to create them throughout the trace.
                                                                Because it “forgets” about unavailable replicas when cre-
                                                                ating a new replica set, Total Recall fails to take advantage
3.3 Generated traces                                            of returning replicas and sends more data than Sostenuto.
No disk failures Figure 5(a) show the results of our            We can see the magnitude of this extra data by examin-
simulation for rL = 2 running on a generated trace with         ing the right side of the graph: Sostenuto sends a total of
no disk failures. By not generating any disk failures in the    3.3 GB Total Recall with rH = 4 sends an additional 3.9
trace, we are guaranteed that all blocks are durable and        GB. This additional traffic demonstrates the importance
that the results will only reflect work that is performed to     of setting rH correctly.
maintain availability. The left graph shows the cumula-
tive number of bytes sent over time for all nodes in the        Considering disk failures Figure 5(b) shows the same
system. One curve is plotted for each of the systems con-       simulation when disks fail periodically: failures are
sidered: Sostenuto and Total Recall with different high         marked by vertical lines in the curve at the bottom of the
water marks. In the simulation all 2500 blocks are in-          graph. When disk failures are present Sostenuto again
serted at the beginning of the trace, causing a sharp rise in   makes copies quickly at first to deal with transient fail-
total bytes sent. No additional blocks are inserted into the    ures, but now continues to rise at a steady rate throughout

                                            SOSP 20 — Paper #196 — Page 6
                                                             Bytes sent over 147 day trace                                                                                  Bytes sent over 147 day trace
                            8e+09                                                                                                          1e+10
                                                Sostenuto (rl=2)                                                                                               Sostenuto (rl=2)
                                         Total Recall (rl=2 rh=4)                                                                                       Total Recall (rl=2 rh=4)
                                         Total Recall (rl=2 rh=6)                                                                          9e+09        Total Recall (rl=2 rh=6)
                            7e+09        Total Recall (rl=2 rh=8)                                                                                       Total Recall (rl=2 rh=8)
                                        Total Recall (rl=2 rh=10)                                                                                      Total Recall (rl=2 rh=10)
                                           Number of live nodes                                                                            8e+09          Number of live nodes
    Cumulative bytes sent

                                                                                                                   Cumulative bytes sent

                            4e+09                                                                                                          5e+09


                            2e+09                                                                            500                                                                                                            500
                                                                                                             400                           2e+09                                                                            400
                                                                                                             300                                                                                                            300
                                                                                                             200                           1e+09                                                                            200
                                                                                                             100                                                                                                            100
                                0                                                                            0                                 0                                                                            0
                                    2         4        6        8      10     12         14   16   18   20                                         2         4        6        8      10     12         14   16   18   20
                                                                    Time (Weeks)                                                                                                   Time (Weeks)

                                                           (a) No disk failures                                                                                       (b) With disk failures

Figure 5: Comparing Sostenuto against Total Recall for r L = 2 on a simulated trace, with 5% of failures resulting in
data loss.

the trace. This continuing repair cost is caused by disk                                                           of the trace because it restores the number of replicas
failures. The curve here is the sum of the network usage                                                           only to the low-water mark while Total Recall recreates
when no disks fail, and a linear usage caused by periodic                                                          enough replicas to again reach the high-water mark. Fur-
disk failures.                                                                                                     ther, Sostenuto tracks essentially all copies of the block.
   Total Recall’s behaviour with r H = 4 is identical to the                                                          The net result is that Sostenuto causes about 2GB less
previous trace. With larger r H in the presence of disk fail-                                                      data to be sent over the network when compared to the
ures, Total Recall can initially ignore both transient and                                                         best performing Total Recall configuration. Sostenuto
permanent failures. This can be seen in Figure 6: whereas                                                          caused about 2.3GB of maintenance traffic for 1.2GB of
Sostenuto’s disk usage increases at the beginning, usage                                                           replicated data (625 MB of original data): this translates
for Total Recall decreases as disk fail until late in the trace                                                    to an additional 4.74 bytes of redundant data sent per byte
(week 12 for r H = 6, 8 and week 18 for r H = 10). Eventu-                                                         of original data over the 21 week trace (including the orig-
ally, disk failures erode Total Recall’s high initial replica-                                                     inal redundancy inserted into the system). The overhead
tion levels until it is forced to begin making repairs. Note                                                       of the best performing Total Recall (r H = 4) configuration
that in comparison, Sostenuto builds up disk usage to a                                                            is 8.8 bytes of redundant data per byte of original data.
certain level and maintains this level stably.                                                                        Figure 7(b) shows the cumulative total block unavail-
                                                                                                                   ability, an aggregate measure of the amount of time that
3.4 PlanetLab traces                                                                                               data blocks are unavailable. The value is calculated by
                                                                                                                   summing the number of seconds each block is unavail-
We now move to a consideration of the two algorithms                                                               able: that is, two blocks unavailable for one second each
running on a trace of PlanetLab node failures. Figure 7(a)                                                         would be equivalent to one block unavailable for two sec-
shows the number of bytes sent.                                                                                    onds. Regions of the graph with zero slope correspond
   The initial weeks of the trace behave in much the same                                                          to a state where all blocks initially inserted are avail-
way as the generated trace without failures. Sostenuto cre-                                                        able; positive slopes indicate periods where blocks are
ates about two additional replicas by week 12. Total Re-                                                           temporarily available. We can see that both Total Re-
call with a high water mark of 4 does not create a large                                                           call with rH = 4 and Sostenuto experience short periods
excess of data because, in this part of the trace, the ex-                                                         where blocks are unavailable, whereas higher settings of
pected number of replicas necessary to weather transient                                                           rH result in complete block availability. The availability
failure is less than 4.                                                                                            achieved by all algorithms is well above the base 0.99 de-
   During weeks 12 to 19, disk failures (denoted by dot-                                                           sired: for 2500 blocks, there are a total of almost 3 × 10 10
ted lines at the bottom of the graph) begin to drive the                                                           block-seconds in the trace and the worst case configu-
bandwidth usage. All schemes are forced to make addi-                                                              ration (Total Recall with a r H = 4) experiences 3 × 10 5
tional copies in response to disk failures. Sostenuto cre-                                                         block-seconds of unavailability or 0.001 percent of the to-
ates fewer copies than Total Recall during this portion                                                            tal block-seconds.

                                                                                              SOSP 20 — Paper #196 — Page 7
                                                             Bytes sent over 147 day trace                                                                           Total block-seconds of unavailability over 147 day trace
                            7e+09                                                                                                              9e+06
                                                Sostenuto (rl=2)                                                                                                   Sostenuto (rl=2)
                                         Total Recall (rl=2 rh=4)                                                                                           Total Recall (rl=2 rh=4)
                                         Total Recall (rl=2 rh=6)                                                                              8e+06        Total Recall (rl=2 rh=6)
                            6e+09        Total Recall (rl=2 rh=8)                                                                                           Total Recall (rl=2 rh=8)
                                        Total Recall (rl=2 rh=10)                                                                                          Total Recall (rl=2 rh=10)
                                           Number of live nodes                                                                                               Number of live nodes

                                                                                                                   Cumulative unavailability
    Cumulative bytes sent

                            4e+09                                                                                                              5e+06

                            3e+09                                                                                                              4e+06

                                                                                                             500                                                                                                                          500
                                                                                                             400                                                                                                                          400
                            1e+09                                                                            300                                                                                                                          300
                                                                                                             200                               1e+06                                                                                      200
                                                                                                             100                                                                                                                          100
                                0                                                                            0                                     0                                                                                      0
                                    2         4        6        8      10     12         14   16   18   20                                             2         4          6        8      10     12          14       16      18   20
                                                                    Time (Weeks)                                                                                                         Time (Weeks)

                                                    (a) Bytes sent over time                                                                                (b) Total unavailability: ∑b tunavailable

Figure 7: Comparing Sostenuto against Total Recall for r L = 2. For reference, the number of monitored nodes is
shown at the bottom of both graphs; vertical bars indicate disk failures, omitted for clarity in the unavailability graph.
Blocks are inserted “instantaneously,” after the system stabilizes (and after an outage in CoMon).

3.5 Scaling up to DHTGroups                                                                                        requires servers to continually transfer data at about 100
We have implemented Sostenuto using the MIT DHash
implementation as the underlying routing and storage
layer. We modified the DHash implementation to main-                                                                4 Design Alternatives
tain additional replicas beyond r L rather than deleting
them: the successor list placement used by DHash already                                                           Previous sections have described and evaluated the basic
guarantees that returning replicas will rejoin the appro-                                                          design of Sostenuto. This section examines the trade-offs
priate replica set. Some additional implementation de-                                                             which must be confronted when deploying Sostenuto in
tails are discussed in Section 4.3. We additionally built                                                          practice.
a DHTGroups NNTP gateway to take a standard Usenet
feed provided by our university and store the articles into
the Sostenuto DHT running on PlanetLab.
                                                                                                                   4.1 Interaction with Erasure Codes
   Our DHTGroups prototype has been running for sev-                                                               We discussed replica maintenance in terms of whole repli-
eral months on PlanetLab. It was able to successfully                                                              cas of blocks in Section 2. Many of the results of that sec-
store 173,935 articles of Usenet news from a feed of the                                                           tion can be improved by the use of erasure codes. Erasure
standard Big-8 groups in a ring of 34 nodes in one 3 day                                                           codes fragment data so that once a number of unique bits
period. To understand the long term behavior of the sys-                                                           of data roughly corresponding to the size of the original
tem we can use our simulations to estimate the amount of                                                           block are collected, the block can be reassembled.
bandwidth required to maintain the data that DHTGroups                                                                Erasure codes improve both durability and availabil-
stores in the DHT, a consideration neglected by the au-                                                            ity compared to storing the same number of redundant
thors of [9].                                                                                                      bytes as full replicas. Coding is beneficial because it
   Consider storing 1.4 TB of data in the system for a 4                                                           increases the number of simultaneous failure events re-
month period, a period comparable to what is simulated                                                             quired to cause data loss without increasing the number
above. Four months is longer than what is provided by                                                              of bytes stored in the system. However, there are draw-
typical news providers today for binary articles, which                                                            backs to using erasure codes [17]. Coding requires ad-
constitute the bulk of Usenet data. Given the overhead                                                             ditional complexity and does not allow users to fetch a
predicted by our simulations this amount of data would                                                             range of bytes from a block efficiently. It also interferes
require the DHT to absorb 1.4 × 5.74 ≈ 8 TB of data                                                                with latency optimizations that attempt to fetch data from
for each day of Usenet traffic over the four month reten-                                                           servers near the requester [8].
tion period. This translates to approximately 2.5 Mbps                                                                We will compare coding and replication systems where
per node over the same period. A traditional Usenet feed                                                           the amount of redundant data is the same. The redun-

                                                                                              SOSP 20 — Paper #196 — Page 8
                 1.00e-01                                                                        0.6



                                                                                                                            r = 2, p = 0.7

                                                                            Pr [repair action]

                                                                                                                            F = 7, p = 0.7
Pr[block loss]




                            ρ = 10, replicas
                 1.00e-10   ρ = 10, 2 fragments
                            ρ = 10, 4 fragments                                                  0.0
                                                                                                       2   4                        6        8
                                2                 4             6                                              Number of replicas
                                         Redundancy factor

                                                                            Figure 9: Probability of block creation for availability us-
 Figure 8: The probability of block loss for coding and
                                                                            ing erasure coding. Erasure coding decreases the amount
 replication when ρ = 10. Coding reduces the probabil-
                                                                            of data created in response to transient failures.
 ity of block loss compared to replication with the same
 redundancy factor. Increasing the number of fragments
 increases the improvement in reliability.                                  in the replica case.

 dancy factor for systems using whole replicas is r L . Cod-                4.2 Data Placement
 ing systems have an extra parameter, F, the number of                      One practical aspect of constructing systems is to decide
 fragments required for reconstruction. If the redundancy                   where to place extra redundancy. There are two main op-
 factor is two and F = 4, a coding system will maintain at                  tions: redundancy can be placed on random nodes, as in
 least 8 fragments, 4 of which are necessary to reconstruct                 Total Recall, or placed on nodes in the successor lists of
 the data.                                                                  the block. Choosing a placement strategy requires balanc-
    Erasure codes have two advantages compared to repli-                    ing several competing interests: random placement allows
 cas when the redundancy factor is the same. First, cod-                    for increased reconstruction parallelism which benefits
 ing places data on more nodes, meaning that more failure                   durability by increasing ρ . However, it creates the practi-
 events are required to cause data loss. Coding also in-                    cal problem of monitoring data availability on a large set
 creases ρ since creating a new fragment is F times faster                  of nodes. The characteristics of data availability imme-
 than creating a whole replica (the fragment is F times                     diately following a massive failure also differ for the two
 smaller). One disadvantage of coding is that more than                     placement schemes.
 one machine must be maintained to provide durability:
 the failure rate of any machine in a group of F machines                   Effective Durability and Recovery Parallelism When
 is proportionally more than the failure rate for a single                  a server fails, the amount of time required to create new
 machine.                                                                   replicas of lost blocks depends on how the cost of copying
    Figure 8 shows the probability of block loss for two                    is spread over the servers. This time influences µ and
 coding systems and replication when ρ = 10. The repli-                     thus ρ and we would like to minimize the time required
 cation line is identical to Figure 2 in Section 2.1. For a                 to make a copy to maximize reliability.
 given amount of redundancy, coding reduces the proba-                         Suppose that replicas are placed randomly on servers.
 bility of failure.                                                         If a server fails, the remaining copies of the blocks it held
    Coding also reduces the amount of redundant data re-                    are uniformly spread over the remaining servers. Further,
 quired to maintain availability during transient failures.                 the new replicas of those blocks should also be uniformly
 The use of erasure codes increases the number of nodes                     spread over the servers. Thus the work of copying the
 that must fail before data is lost; it also increases the num-             blocks can be spread uniformly over the servers, and can
 ber of nodes that must be alive for data to be found. The                  proceed in parallel (assuming a network with internal par-
 net result is a decrease in the number of additional bytes                 allel paths). The copies will take time proportional to
 necessary to lower the probability of creating a new block.                rL B/N 2 where B is the number of unique blocks in the
 Figure 9 shows the probability of creating a new fragment                  system: each node stores r L B/N blocks, and all N nodes
 (solid line) given the redundancy in created fragments on                  participate in the copying.
 the x-axis. As redundancy grows, the number of extant                         The other common replica placement strategy is to put a
 fragments grows (faster than the number of replicas would                  block’s replicas on successive nodes in a consistent hash-
 have), and the probability of loss drops more quickly than                 ing ring. In this case, when a server fails, its r L − 1 pre-

                                                             SOSP 20 — Paper #196 — Page 9
decessors must each copy B/N blocks to different one of
the failed server’s successors, and the failed server’s suc-                              0.8

                                                                 cumulative probability
cessor must copy B/N blocks to its r L th successor. In
this case, rL copies can proceed in parallel, so copying                                                                                random
                                                                                                                                        succ. list
requires B/N time, as long as r L    N.                                                   0.4

   Random placement recovers from a node failure in
about 1/N-th the time of successor placement. This is                                     0.2

advantageous from a reliability standpoint.                                               0.0
                                                                                                0.92                 0.94        0.96
                                                                                                       fraction of blocks lost

Monitoring Availability Implementing Sostenuto in
practice requires finding usable mechanism for monitor-            Figure 10: The effect of replica placement on block avail-
ing the level of availability of each block. This is nec-         ability following a massive failure. The plot shows the
essary in order to be able to know when to initiate a re-         cumulative distribution of the fraction of block failures
pair. The placement policy significantly affects the kind of       following the simultaneous failure of 500 nodes in a 1000
monitoring that happens — in the case of random place-            node system. Each block was replicated 4 times. For
ment, each node will have stored a small number of blocks         each of the two placement policies, 1000 random trials
on each of a large number of nodes. As a result, each node        were made. Both placement policies result in the same
must eventually monitor all N − 1 other nodes in the sys-         expected fraction of blocks lost, but the variance of the
tem. Requiring each node to maintain contact with each            successor list distribution is larger.
other node in the system presents scalability problems in
DHTs deployed on large numbers of nodes.
   In contrast, if redundancy is placed on successive nodes
in the ring, each node only needs monitor its successor           identifier space and data is stored on the nodes immedi-
list. Further, each node in the successor list will have a        ately following the data’s identifier. For an entire replica
significant number of blocks in common with the moni-              set to fail, at least rL nodes with consecutive identifiers
toring node. Previous work showed how synchronization             must fail. For a given set of r L failures, it is unlikely that
can be accomplished without exchanging a large amount             they will all be in a single replica set. Thus, most concur-
of information when nodes have the same set of keys in            rent failures do not cause any blocks to be lost under this
common [6].                                                       placement model. When a rare failures event that affects
   While synchronization with successors limits the num-          an entire successor list occurs, however, 1/N of the blocks
ber of remote nodes that a given node must communicate            become unavailable simultaneously.
with, the implicit tracking of blocks via the successor list         In contrast, random placement any concurrent failure
may lose track of redundancy if the population of the sys-        of rL nodes affects at least one replica set, resulting in the
tem has changed significantly during the period a node                                  N −1
                                                                  unavailability of rL        of the blocks. Far fewer blocks
was temporarily failed. Explicitly tracking block posi-           are lost in this event, but it is much more likely to occur.
tions (as a random placement system must) could prevent
this loss. Another option, as noted in [6], each node can            The expected number of blocks lost is the same in both
periodically examine the blocks it is storing locally for         cases, but random placement reduces the variance in the
which it is not a replica (i.e. not in the successor list). It    number of blocks lost. Users might prefer a loss distribu-
could then perform an lookup and inform the successor             tion that favors losing no blocks over losing a few (even
that it has some redundancy for the block. The successor          if rare events cause a large number of blocks to be lost).
can then decide what to do with that distant replica based        On the other hand, a policy which causes a small amount
on its own knowledge about replication levels in the active       of repair work over a long period of time (random place-
replica set.                                                      ment) is likely to better utilize network capacity than a
                                                                  policy that periodically requires a large amount of work
                                                                  in a short amount of time (successor list).
Static Availability Massive concurrent failures affect
the availability of blocks differently depending on the             While random and successor list placement are two
placement scheme. The expected number of lost blocks              popular strategies, it is possible to choose a middle ground
following a failure event is the same: however, the dis-          by, for example, using virtual nodes [18]. By running a
tribution of the number of blocks lost differs. Successor         number of virtual nodes on each physical node, the num-
list placement produces a much wider distribution in the          ber of replica sets a physical node participates in can be
number of blocks lost.                                            increased. The number of virtual nodes used per physi-
   In successor placement, nodes form N overlapping               cal node allows the user to tune the system’s performance
replica sets. Nodes and data are organized into a circular        anywhere between the two extremes presented here.

                                            SOSP 20 — Paper #196 — Page 10
4.3 Synchronization Cost                                           require the master node to track the identity of nodes miss-
                                                                   ing copies of a given block. Using this information about
While we have focused on the cost of making new copies             missing replicas, the master can modify the synchroniza-
due to failure, the costs associated with monitoring data          tion protocol to not transfer redundant information. For
between failures can not be ignored. A simple approach             instance when synchronizing with a replica node n that is
to monitoring would be to assume that a node’s availabil-          known to be missing a key k, the master leaves k out of
ity implies availability of any data it has previously been        the tree used for synchronization: this prevents n from re-
known to store. This approach is inexpensive (the cost             porting what the master already knew, that k is missing
grows only with the number of nodes that must be probed)           on n. This scheme requires the master node to keep addi-
but incomplete: database corruption or configuration er-            tional state about where keys are missing within the suc-
rors could result in continued uptime of a node while some         cessor list. However, this amount of state is small relative
or all stored data blocks are lost.                                to the size of storing the block itself, and can be main-
   We consider a system in which the node responsible              tained lazily, unlike the hard state that the node stores for
for a given key wishes to make an end-to-end check that            each key (namely the block).
data is stored on disk. This master node wishes to deter-
mine which replica nodes are not holding copies of the
data item and which data items are held on other nodes             5 Related Work
that the master should be responsible for. The latter case
occurs when failure causes a node to become master for a           Small-scale distributed systems have long understood the
new set of data.                                                   need for replication to provide data durability and avail-
   DHash, the DHT used in CFS [7], uses a synchroniza-             ability. For example, Harp [13] distributes data across
tion protocol based on Merkle trees to efficiently take ad-         a small cluster of machines. Unlike DHTs, conserving
vantage of the fact that most blocks are typically in the          bandwidth is not a priority for systems like Harp since
right place [6]. This scheme is much more efficient than            copies are made over a high-speed LAN.
simply exchanging lists of 20-byte keys; in the common                The applicability of the optimizations presented here
case where a replica node held copies of all desired keys,         to these systems is limited by the consistency guarantees
the protocol could verify that all keys were stored with a         they provide on read/write data. For instance, Harp has lit-
single RPC. This protocol depends on the successor list            tle or no choice about where to place data and benefits less
placement scheme.                                                  from waiting for the return of a long failed replica since
   However, the Merkle synchronization protocol as-                data is likely out of data. Distributed disk systems [10, 12]
sumed that the rL replicas of a block were rigidly placed          can choose how to distribute replicas among a potentially
on the rL nodes immediately following the key in ID                large set of servers, but, like Harp, must provide strong
space. Sostenuto only requires that the r L replicas be lo-        consistency guarantees.
cated in the successor list of the node responsible for the           Previous work on DHTs has focused on data availabil-
key. DHash is typically configured to maintain a succes-            ity following the simultaneous failure of some large frac-
sor list roughly twice as large as r L . This flexibility al-       tion of servers. CFS [7] analyzed the required level of
lows nodes to join the system without transferring large           replication in terms of simultaneous server failures, mod-
amounts of data as they would if the placement of repli-           eling the probability of data unavailability as p r , where
cas was rigidly enforced.                                          p is the probability of any one server being unavailable
   Unfortunately, this flexibility causes the synchroniza-          and r is the number of replicas [7]. Relatively small val-
tion protocol to operate outside of its common case: ad-           ues of r provide high availability. CFS eagerly creates
jacent nodes are no longer likely to store nearly identical        new replicas whenever the number reachable falls below
sets of blocks. The result is that each time the synchroniz-       r, and deletes replicas when it exceeds r. This approach
ers runs, it is forced to transfer a list of keys. This transfer   confuses availability and durability: eager repair is appro-
can be costly. If the synchronization protocol runs once           priate when durability is endangered, but the model for r
a minute, the cost of repeatedly transferring the 20-byte          ignores the main danger to durability (repair time).
name of an 8KB data block will exceed the cost of fixing               The use of Merkle trees to track replicas allows DHash
the “hole” in the replica set by transferring the block in         nodes to robustly maintain replication levels [6]. This
about 8 hours. This problem is more severe when erasure            algorithm minimizes the costs of synchronization in the
coding is used since blocks are smaller: the cost of trans-        common case that adjacent nodes store identical keys
ferring associated keys more quickly outstrips the size of         by eliminating the need to send long lists keys between
the data.                                                          nodes. However, it aggressively deletes additional data
   To avoid this problem we enhanced the synchronization           replicas, causing the system to make additional, unneces-
protocol to efficiently deal with nodes missing blocks. We          sary copies of data in response to transient failure.

                                             SOSP 20 — Paper #196 — Page 11
   Total Recall [3] proposes lazy repair, with low (r L ) and   tribute the load of serving the popular Citeseer document
high (rH ) water marks. It initially creates r H replicas. If   index over a DHT. Overcite requires the permanent stor-
the number of reachable replicas falls below r L , Total Re-    age of 1TB of documents in a DHT.
call eagerly creates replicas until they number r H . The ex-      Systems like Citeseer consume so many resources (35.4
tra rH − rL replicas let Total Recall avoid eager repair in     GB of traffic per day, over 1TB of disk) that it is diffi-
the common case of temporary server failure. Total Recall       cult for a single volunteer organization to host them. By
chooses rL based on the predicted probability of simulta-       reducing the cost of distributing data in a wide-area sys-
neous server failure, and provides no guidance for how the      tem, Sostenuto makes it practical for a group of coopera-
administrator should set r H . Sostenuto, in contrast, auto-    tive individuals or institutions to deploy such data-heavy
matically maintains the high water mark, and Section 2.1        systems by pooling network and storage resources. Hope-
uses a more accurate rate-based model for r L .                 fully, this deployment strategy will allow the development
   Glacier [11] is a system for providing durable storage       of new applications beyond those mentioned here.
despite potential massively correlated failures. It uses
a single fixed system-wide parameter to determine the
amount of redundancy to place in the system — no addi-          References
tional storage space is ever used. Glacier uses aggregation      [1] A NDERSEN , D. Improving End-to-End Availability Using
to minimize the repair overhead for any individual block.            Overlay Networks. PhD thesis, Massachusetts Institute of
   van Renesse and Schneider [20] noted that random                  Technology, 2004.
placement of replicas provides increased parallelism for         [2] B HAGWAN , R., S AVAGE , S., AND VOELKER , G. Un-
replica recreation in the context of chain replication. We           derstanding availability. In Proc. of the 2nd International
show that decreasing the time to create replicas improves            Workshop on Peer-to-Peer Systems (Feb. 2003).
                                                                 [3] B HAGWAN , R., TATI , K., C HENG , Y.-C., S AVAGE , S.,
   Blake and Rodrigues [4] argued that a DHT cannot pro-             AND VOELKER , G. M. Total recall: System support for
vide high availability and scalable storage if node churn            automated availability management. In Proc. of the 1st
is high. To reduce overhead, they propose delaying the               Symposium on Networked Systems Design and Implemen-
creation of new replicas in response to node failure until           tation (Mar. 2004).
after a timeout has expired. Additional replicas are re-         [4] B LAKE , C., AND RODRIGUES , R. High availability, scal-
quired to maintain availability while dead nodes time out.           able storage, dynamic peer networks: Pick two. In Proc.
Sostenuto can be thought of as creating this level of repli-         of the 9th Workshop on Hot Topics in Operating Systems
cation adaptively while using, essentially, an infinite time-         (May 2003).
out following node failure. Blake et al. also assume a high      [5] B OLOSKY, W. J., D OUCEUR , J. R., E LY, D., AND
churn rate and mostly focus on the case of permanent de-             T HEIMER , M. Feasibility of a serverless distributed file
partures from the system while we assume a system with               system deployed on an existing set of desktop pcs. In Proc.
more-reliable server-class machines, with few permanent              of the 2000 SIGMETRICS (June 2000).
departures.                                                      [6] C ATES , J. Robust and efficient data management for a
   This work was motivated by making DHTGroups prac-                 distributed hash table. Master’s thesis, Massachusetts In-
tical and reducing the cost of hosting a Usenet feed. DHT-           stitute of Technology, May 2003.
Groups did not consider the cost of data maintenance:
                                                                 [7] DABEK , F., K AASHOEK , M. F., K ARGER , D., M ORRIS ,
minimizing that cost is our primary goal.                            R., AND S TOICA , I. Wide-area cooperative storage with
                                                                     CFS. In Proc. of the 18th ACM Symposium on Operating
                                                                     System Principles (Oct. 2001).
6 Conclusions                                                    [8] DABEK , F., L I , J., S IT, E., ROBERTSON , J., K AASHOEK ,
                                                                     M. F., AND M ORRIS , R. Designing a DHT for low la-
The cost of data maintenance is a road-block to the de-              tency and high throughput. In Proc. of the 1st Symposium
ployment of large-scale distributed systems that store               on Networked Systems Design and Implementation (Mar.
data. This paper has developed a model of block dura-                2004).
bility in large-scale distributed systems and described          [9]   ELIDED FOR ANONYMITY, A. DHTGroups: A low over-
Sostenuto, a DHT that reduces the cost of data mainte-                 head Usenet server. Position paper about DHT-based
nance.                                                                 Usenet and available on request.
   By reducing the cost of data maintenance on DHTs, we         [10] G RIBBLE , S., B REWER , E., H ELLERSTEIN , J., AND
show that data-intensive distributed applications, such as           C ULLER , D. Scalable, distributed data structures for inter-
DHTGroups, can be deployed with a net savings in stor-               net service construction. In Proceedings of the 4th USENIX
age and network resources. Other applications will also              Symposium on Operating Systems Design and Implemen-
benefit from Sostenuto as well: Overcite [19] plans to dis-           tation (OSDI 2000) (October 2000).

                                           SOSP 20 — Paper #196 — Page 12
[11] H AEBERLEN , A., M ISLOVE , A., AND D RUSCHEL , P.             a number of live block replicas. The node failure rate is
     Glacier: Highly durable, decentralized storage despite         λ f and replica creation rate is µ .
     massive correlated failures. In Proc. of the 2nd Symposium
     on Networked Systems Design and Implementation (May        Before considering the full model, we begin with a spe-
     2005).                                                  cial case that illustrates the impact of the number of repli-
[12] L EE , E. K., AND T HEKKATH , C. A. Petal: Distributed
                                                             cas rL on reliability. We take µ = 0, which corresponds to
     virtual disks. In Proceedings of the Seventh Interna- a system with no repairs and analyze the expected time to
     tional Conference on Architectural Support for Program- failure E[T ]. When the system begins running the failure
     ming Languages and Operating Systems (Cambridge, MA, rate is rL λ f ; the expected time to the first failure is rL λ f .

     1996), pp. 84–92.                                       After the first failure, rL − 1 processes are running and the
[13] L ISKOV, B., G HEMAWAT, S., G RUBER , R., J OHNSON ,    expected time to the next failure is (rL −1)λ . Generalizing,
     P., S HRIRA , L., AND W ILLIAMS , M. Replication in the where T is the time until the last failure:
       Harp file system. In SOSP ’91 (1991), pp. 226–38.
[14] PARK , K. S., AND PAI , V. CoMon: A monitor-                                            rL                  rL
                                                                                                 1     1               1
     ing infrastructure for planetlab. http://comon.cs.                             E[T ] = ∑        =           ∑i                                                                        i=1 iλ f   λf        i=1
     ROSCOE , T. A blueprint for introducing disruptive tech-
     nology into the Internet. In Proc. of HotNets-I (October       This sum is the harmonic series which can be approxi-
     2002).                              mated by the natural log to produce E[T ] ≈ λ1 logrL . In-
[16] PlanetLab: An open platform for developing, deploying          tuitively, adding more replicas helps very little since all
     and accessing planetary-scale services. http://www.            of the replicas fail independently. The logarithmic gain                                                comes from increasing the chance that one of the repli-
[17] RODRIGUES , R., AND L ISKOV, B. High availability in           cas will be “lucky” and last a long time. In other words,
     DHTs: Erasure coding vs. replication. In Proc. of the          to produce a linear increase in time till failure, we must
     4th International Workshop on Peer-to-Peer Systems (Feb.       exponentially increase r L .
                                                                      Returning to the full model, we are interested in the
[18] S TOICA , I., M ORRIS , R., L IBEN -N OWELL , D.,              probability of reaching state 0, where the block has no
     K ARGER , D., K AASHOEK , M. F., DABEK , F., AND BAL -         remaining replicas. Since this model is a birth-process Pr ,
     AKRISHNAN , H. Chord: A scalable peer-to-peer lookup
                                                                    the probability that r replicas exist, in terms of P0 :
     protocol for internet applications. IEEE/ACM Transactions
     on Networking (2002), 149–160.
[19] S TRIBLING , J., C OUNCILL , I. G., L I , J., K AASHOEK ,                                     µr
                                                                                           Pr =           P0 .
     M. F., K ARGER , D. R., M ORRIS , R., AND S HENKER , S.                                      λ fr r!
     OverCite: A cooperative digital research library. In Proc.
     of the 4th International Workshop on Peer-to-Peer Systems
     (Feb. 2005).
                                                                    The factorial comes from the coefficients on the λ f terms.
[20]   VAN  R ENESSE , R., AND S CHNEIDER , F. B. Chain repli-      Let ρ = λ > 1. Using the law of total probability,
       cation for supporting high throughput and availability. In
       Proc. of the 6th Symposium on Operationg Systems Design
       and Implementation (Dec. 2004).
[21] W EATHERSPOON , H., AND K UBIATOWICZ , J. D. Era-                                   1 =       ∑ Pr
     sure coding vs. replication: A quantitative comparison. In                                    r=0
     Proc. of the 1st International Workshop on Peer-to-Peer                                                ρr
     Systems (Mar. 2002).                                                                   = P0 ∑
                                                                                                        r=0 r!
                                                                                            = P0 eρ .
A       Analysis           of     the       Birth-Death
        Chain                                                       This gives P0 = e−ρ and Pr =      e−ρ ρ r
                                                                                                        r! .

This appendix contains derviations for the properties of               We are also interested in the expected value of r, the
the Markov model used to analyze the relationship be-               current state of the system. This corresponds to the num-
tween rL , ρ , and block durability. Recall that we are ana-        ber of replicas we expect the system to maintain in the
lyzing a Markov chain in which each state corresponds to            steady state. Summing rPr (and noting that the first term

                                              SOSP 20 — Paper #196 — Page 13
(r = 0) is zero):
               E[r] = 0 + ∑ rPr
                               e− ρ ρ r
                    =   ∑r         r!
                                        ρ r−1
                    = e− ρ ρ    ∑ (r − 1)!
                    = e− ρ · ρ · eρ
                    = ρ.

                                            SOSP 20 — Paper #196 — Page 14

To top