Separating Durability from Availability
Abstract ures are frequent in such systems: many nodes are in-
volved and wide-area networks are prone to failure. Deal-
Wide-area data storage systems maintain redundant data ing with failures is expensive, since new replicas must
to assure availability and durability. The amount of re- be created over wide-area or inter-rack links with limited
dundancy is important, since it affects both the network capacity. These operating conditions make it difﬁcult to
bandwidth consumed for replication and the probability manage replication and to deploy DHTs that store large
of data loss. Many existing designs expect users to spec- amounts of data.
ify the amount of redundancy, or choose the amount based The beneﬁts of realizing a wide-area storage system
on predicted probability of multiple server failure. would be signiﬁcant, however. Such a system could har-
This paper makes two contributions to the understand- ness the aggregate bandwidth and storage of many nodes
ing of redundant data in wide-area storage systems. First, to provide both robustness and high capacity. With ef-
it shows that the redundancy level for durability should be ﬁcient replica management, such a storage system could
chosen based on the rate at which servers fail, since the facilitate the construction of new and improved distributed
main challenge to durability is the time required to move applications.
new redundant data over the network. Second, the pa- Replication is traditionally used to ensure that data
per analyses how additional redundancy reduces the cost remains both available (reachable on the network) and
of availability and describes an algorithm that creates the durable (stored intact, but perhaps not reachable). Avail-
minimum amount of additional redundancy necessary to ability is a harder goal than durability: availability may
weather temporary failures. be threatened by any network, processor, software, or disk
The paper describes the design of the Sostenuto dis- failure, while durability is usually only threatened by disk
tributed hash table, which incorporates the paper’s ideas. failures. Small-scale fault-tolerant storage systems such
Simulation of Sostenuto using failure traces from Plan- as RAID provide availability and durability relatively in-
etLab support the correctness of the redundancy model, expensively since failures are rare and plenty of local net-
and show that Sostenuto can store data with a low over- work or bus bandwidth is available to create new replicas.
head and high availability. Over a 4 month period the This paper makes two contributions to replica main-
system makes less than 5 copies of each data item and tenance in wide-area storage systems. First, it develops
achieves better than 99.99% availability. Data-intensive a model for durability based on the rate of permanent
applications such as the proposed Usenet replacement replica failure and the time required to create new replicas
DHTGroups can make use of Sostenuto to reduce band- over a limited-capacity network. The model predicts how
width requirements. Using Sostenuto, DHTGroups could many replicas are required to ensure durability, and guides
store the complete 1.4TB of Usenet data generated each decisions about how to maintain the replicas and where to
day on 300 nodes and retain the data for 4 months with place them. Second, the paper presents an algorithm that
each node using less than 3 Mb/s instead of the 100 Mb/s implicitly creates enough extra replicas to maintain avail-
required of traditional Usenet servers. ability while minimizing subsequent repairs due to tempo-
rary failures. This algorithm reduces the amount of repli-
cation trafﬁc compared to existing techniques that create
1 Introduction a set amount of extra redundancy at the outset [11, 3].
We are motivated in this work by the prospect of a
Building a robust and reliable wide-area storage system Usenet replacement built on a DHT. DHTGroups has the
requires signiﬁcant engineering discipline. Data replicas potential to greatly reduce the amount of storage and
can be lost by network outages , machine crashes and bandwidth required to host a Usenet feed . The ex-
disk failures. Storage systems have traditionally coped isting Usenet distributes every article to every Usenet
with failure by replicating data on a small, ﬁxed set of server. About 1.4 TB of articles are posted each day,
servers on a local area network. Wide-area systems, such requiring each server to transfer data constantly at about
as those based on distributed hash tables (DHTs), must 100Mb/sec.
place replicas on a constantly changing set of nodes. Fail- DHTGroups replaces the local article storage at each
SOSP 20 — Paper #196 — Page 1
server with shared storage provided by a DHT. This ap- mark, rL . If the number of reachable replicas falls below
proach saves total storage since articles are no longer mas- rL , the system immediately makes a new replica. The pur-
sively replicated; it also saves bandwidth since servers pose of rL is to ensure durability.
only fetch articles that their clients actually read. In the common case of temporary server failure or un-
DHTGroups has the potential to reduce the bandwidth reachability, the above eager repair policy will result in
requirement for running a Usenet server from 100Mb/sec more than rL replicas. Sostenuto keeps track of these ex-
to around 1 Mb/sec. It will also reduce storage costs from tra replicas, and does not delete them, in order to avoid
10TB per week to 60GB per week, assuming 300 servers repair for future temporary failures and to get a head start
participate and 1 percent of articles are read at each site on required repair for future permanent failures.
(these numbers are taken from studies of Usenet data). The most important aspects of Sostenuto’s design are
These savings can be realized only if DHTs can satisfy the model for choosing r L and the effects of a potentially
the performance requirements of Usenet: low latency ar- unlimited upper bound on the replicas created by tempo-
ticle fetch, high throughput bulk read and write and efﬁ- rary failure. Sections 2.1 and 2.2 consider these questions
cient storage of large amounts of data. The ﬁrst two con- in turn. We will show, ﬁrst, that a small r L provides good
cerns are addressed in earlier work by the developers of durability in practical systems and, second, that although
DHash . the number of replicas we create due to transient failure is
Here we address the ﬁnal concern: if the cost of main- potentially unlimited, the expected number is small.
taining data in a DHT is large, it could outweigh the po-
tential savings of a system like DHTGroups. We show
that, to the contrary, by building DHTGroups on top of 2.1 Eager Repair and the Low-Water Mark
Sostenuto, it is possible to store a complete archive of Setting rL to 2 would ensure durability if replicas failed
Usenet data for 4 months, with a storage overhead factor one at a time, with enough time between each failure to
of less than 5 and an additional bandwidth requirement of make a new replica. After a server fails, there must be
approximately 2.5 Mb/s. We assume that DHTGroups and enough time to make a new copy of each block it stored
Sostenuto will be deployed as a long-running service on for which there are now less than r L replicas. If the server
unmanaged, but reliable nodes such as those in the Plan- holds many blocks, or the servers’ access links are slow,
etLab testbed. then there may be a signiﬁcant probability that another
The roadmap of the paper is as follows. Section 2 server will fail before all the replicas are created.
discusses our model for replica failure and regeneration The goal of setting r L is to ensure that no data will be
and the design of an algorithm (Sostenuto) based on that lost if there are multiple failures. Increasing r L causes the
model. Section 3 evaluates the performance of the algo- copying process to start earlier, so that there is more time
rithm on failure traces taken from the PlanetLab testbed. to ﬁnish the copying before the last copy fails. We are
We discuss additional details related to the issue of replica interested in understanding the relationship between r L ,
maintenance in Section 4, compare our work to prior work the system environment (consisting of, for example, data
on Section 5, and conclude. size, number of nodes, link bandwidth) and the reliability
of data. Since the purpose of r L is to ensure durability;
for this discussion we will assume that failures are perma-
2 Design nent. The existence of transient failures makes the sys-
In order to support applications like DHTGroups, tem more durable for the particular r L we select: transient
Sostenuto must provide these properties: failures cause the system to make additional copies (this
mechanism will be discussed in detail in Section 2.2).
1. It must ensure durability, essentially by copying data This line of reasoning differs from prior work, which
to new servers at the rate at which existing servers analyzed a system’s availability and durability using the
(or disks) suffer permanent failures. probability of a failure of a fraction of nodes in the sys-
tem [21, 7]. Instead, we analyze the system in terms of
2. It must ensure availability, by maintaining more
its ability to create new copies more quickly than they are
replicas of each piece of data than the likely number
of concurrent transient replica failures.
We consider a model in which node failures occur ran-
In practice Sostenuto can only observe whether a domly with rate λ f and the contents of a failed replica
replica is available (reachable over the network); it can- server can be recreated at rate µ . The parameter λ f is a
not directly observe durability, because durable replicas property of node reliabilities; µ depends on the amount of
might be on temporarily unavailable servers. For this rea- data stored at each node and the bandwidth available to
son Sostenuto’s actions are driven by comparing the num- create new replicas. This model is intended to capture the
ber of reachable replicas of a datum against a low-water continuing process of replica loss and repair that occurs
SOSP 20 — Paper #196 — Page 2
µ µ µ µ
... 3 2 1 0 1.00e-02
3λ 2λ λ
1.00e-04 ρ =5
Figure 1: The birth-death process for replicas. Each state ρ = 10
ρ = 50
above corresponds to the number of replicas of a given 1.00e-05
block. Replicas are created at rate µ (governed by fac-
tors such as link capacity and the amount of data in the 2 4 6
system). Nodes fail with rate λ f . The rate of replica loss
grows as more replicas are added since nodes are failing Figure 2: The relationship between ρ , r, and block relia-
in parallel. bility (p0 )
in a long-running system, rather than the instantaneous
number of replicas from accumulating. If ρ is large, the
impact of a number of simultaneous failures in a system
system will operate with more replicas (farther to the left
without repair. For this analysis we assume that λ f and µ
of the chain) and is less likely to experience enough fail-
are exponentially distributed.
ures to eliminate all extant copies. When ρ is smaller,
We analyze the system using a birth-death Markov
fewer copies are likely to exist and the system is more
chain, shown in Figure 1. Each state in the chain corre-
likely to experience a string of failures that lead to data
sponds to the replication level of a single block. The sys-
tem moves to higher replication levels at the rate µ . The
rate at which transitions to lower levels are made increases From this analysis, we glean the intuition that the over-
as the number of replicas increases: when n replicas are riding factor driving data durability is the ratio between
present, the likelihood of any server failing increases by a the rate at which replicas can be created and the rate at
factor of n. which they are lost. The number of replicas that exist at
any one time is a consequence of this ratio.
We ﬁrst consider a simpliﬁed system in which nodes
continually create new replicas; this corresponds to r L = Unfortunately, ρ is largely out of the control of the sys-
∞. We examine this system because we are able to derive tem: it depends on factors such as link bandwidth, data
closed forms for the properties we are interested in. We scale, and failure rate. The system can control r L , how-
will adjust this model to correspond more closely to the ever, to adjust how aggressively it replicates blocks. By
system we have deployed; that model will be an approxi- modifying the above analysis, we can evaluate the rela-
mation of the inﬁnite model. tionship between different values of r L , ρ and block dura-
We are interested in the probability that the system is in bility. While it might be possible to design an adaptive
state zero: this corresponds to the probability no replica algorithm which adjusts r L based on an estimate of ρ , we
for a block remains. If this probability is (P0) is low, we only intend to present an analysis that allows system de-
expect data to be stored durably. By an analysis of the signers to make an informed engineering decision.
chain we can conclude that P0 = e−ρ , where ρ = λ . A To integrate rL into our model, we modify the Markov
derivation of this result is available in Appendix A. This chain to have only r L states. At state rL the creation prob-
result tells us that the probability of data loss decreases ability (µi ) is zero. We are interested in the value of P0
exponentially with ρ , independent of the number of repli- in this system. The analysis proceeds as in the Appendix
cas the system creates. except that inﬁnite sums have been replaced by bounded
sums (up to rL ). We ﬁnd that P0 = (∑rL ρ )−1 .
To understand this result consider what ρ represents: r=0 r!
it is the ratio between the rate at which replicas can be Figure 2 shows the relationship between r, ρ , and the
created and the rate at which they fail. For the system to probability that the block will be lost. The value of ρ
reach a steady state we must have ρ > 1 or all replicas has a large impact on the probability of block failure: at
will be lost before they can be replaced. The value of ρ small values of ρ (below 5), it is impossible to maintain
dictates the expected number of replicas, r, in the system: extremely low block failure rates even with an aggressive
E[r] = ρ (derivation in appendix). The expected number low-water mark. In this regime, the system always oper-
of replicas does not diverge: even though the system can ates with a handful of replicas (regardless of the value of
create replicas faster than a single one is lost (ρ > 1), the rL ) and is vulnerable to a series of failures, spaced closely
effect of many replicas failing in parallel prevents a large in time, causing data loss. When ρ is larger, increasing r L
SOSP 20 — Paper #196 — Page 3
S OSTENUTO (key k) B, of copies beyond r L ; these copies make it extremely
n = available_copies (k, replicas[k]) unlikely that a transient failure will cause a new copy to
if (n < rL ) be created. In fact, a new copy could be created only
newnode = create_new_copy (k) when B nodes are simultaneously unavailable. The ef-
replicas[k] += newnode fect of these additional replicas is similar to Total Recall’s
else lazy repair , but does not require initial creation of a set
// do nothing number of extra replicas. The best number of extra repli-
cas is difﬁcult to determine, so Sostenuto’s on-demand
creation of extra replicas eliminates an area of uncertainty
Figure 3: The Sostenuto algorithm
in conﬁguring the replication system.
To understand the number of blocks that Sostenuto is
rapidly reduces the probability that the block will be lost. likely to create, we calculate the probability that, for a
Fortunately, the probability of loss is not sensitive to given existing number of copies, a new one must be cre-
small changes in the value of r L in practical systems. An ated. This probability decreases exponentially as blocks
analysis of the failure characteristics of nodes on the Plan- are created. While the probability of creating a new block
etLab testbed shows that ρ > 50. We can safely set r L to is always non-zero, after about 2r L /p blocks exist it is
two when running on a system such as PlanetLab. Be- very unlikely that a new block will be created; p is the av-
cause a small, eagerly maintained, and ﬁxed r L is sufﬁ- erage availability of nodes (e.g. if nodes are available as
cient to ensure durability on most systems, we can con- often as they are unavailable, p = 0.5).
sider a separate mechanism to obtain availability cheaply. To see this, note that as B increases, the probability that
B − rL nodes fail simultaneously falls, making it less and
less likely that additional replicas need to be created. As-
2.2 Coping with Transient Failure sume that each node in the system is independently avail-
Existing approaches have observed that creating addi- able with probability p (e.g., nodes that are alive as often
tional redundancy up front [3, 11] can reduce the cost of as they are down have p = 0.5). To determine whether a
maintaining data availability. Additional redundancy re- new replica must be created we perform a simple exper-
lieves the system of the need to create a new replica in iment: ﬂip B coins biased to heads with probability p; if
response to failure. However, it can be difﬁcult to deter- less than rL come up heads, a new replica must be created.
mine the correct amount of additional redundancy — the This experiment should be performed at periodic intervals
number of replicas needed depends on the number of com- separated by the average node lifetime. This is a Bernoulli
mon concurrent failures. An ideal system would produce process, and the number of successes is given by the bino-
enough replicas to tolerate that number of failures without mial distribution. The probability of needing a new block,
needing repair. given c existing copies, is the probability of fewer than r L
Sostenuto dynamically and eagerly creates additional successes:
replicas to meet this goal. Observe that both durability rL −1
and availability will be achieved by eagerly creating a new Pr[B < rL | r extant copies] = ∑ a
p (1 − p)r−a.
replica whenever the number of reachable replicas falls a=0
below rL , for a suitable choice of r L . However, this ea- Figure 4 shows the probability of creating an additional
ger policy will create new replicas faster than required — block after various numbers of copies already exist, based
faster than the rate at which servers or disks fail perma- on rL = 2. As the number of blocks created grows, the
nently — due to temporary server failure or unreachabil- probability of another block creation drops exponentially.
ity. However, as long as the system can keep track of the After about 2r L /p blocks have been created the time be-
extra replica, it will not need to generate a new replica in tween block creations becomes large: an application of
response to future single temporary failure. This occurs the Chernoff bound shows that the probability that fewer
when a storage system ensures that returning replicas re- than rL of the 2rL /p total replicas are available is less than
join the appropriate replica sets. This is the approach that e−rL p . As a result, we would expect the required mainte-
Sostenuto takes to avoiding unnecessary copies. nance bandwidth for a block to decrease over time.
When Sostenuto ﬁrst creates a block, it makes r L repli-
cas. Sostenuto monitors the number of reachable replicas
of each block. If the number of reachable replicas falls 3 Evaluation
below rL , Sostenuto eagerly creates new copies to bring
the reachable number up to r L and begins to monitor that This section evaluates the performance of Sostenuto and
node as well. compares it to an existing system, Total Recall . We
After some time Sostenuto will have created a number, characterize the performance of an algorithm with two
SOSP 20 — Paper #196 — Page 4
keeps track of this state internally and does not actually
maintain the inode structures. A low water mark r L is
p = 0.5 used to determine when repair actions are taken: so long
Pr [repair action]
0.4 p = 0.7 as the total redundancy for a given block is not less than
p = 0.9
rL , no action is taken. When the redundancy falls below
rL , the master node selects replacement nodes at random
0.2 for the failed nodes and re-replicates back to a level of r H
on the other nodes.
Our input to all simulations is as follows. All nodes
0.0 are given a bandwidth budget of 900 bytes/second. Early
2 4 6 8 10 12
Number of replicas
in each trace, we insert 2500 256K blocks. With 2500
blocks, and approximately 300 nodes, each node is re-
Figure 4: Additional redundancy must be created when sponsible for about 5 blocks, though the actual number of
the amount of live redundancy drops below the desired blocks stored on a node is dependent on r L (and will vary
amount. The probability of this happening depends solely with rH ) and its random placement in the ring. The band-
on the probability of node failure p and the amount of width limit was chosen so that it would take about 5 sim-
durable redundancy. This graph shows the probability of ulated minutes for a single block transfer in the network.
a repair action as a function of the amount of durable re- This was a simulation trade-off: in a real DHT implemen-
dundancy in the system, with p = 0.5, 0.7, and 0.9. For tation, blocks would be smaller and link capacities would
all three curves, rL = 2. be higher, but there would be many more blocks. Five
minutes was chosen as an arbitrary granularity for copies.
After selecting these parameters, we can estimate ρ for
metrics: bandwidth used to create replicas, and the to- the system to help select a reasonable value for r L to sim-
tal duration of block unavailability. An algorithm should ulate. The failure rate N λ f is 1/16670s, where N is the
minimize both metrics. We primarily use trace-driven number of nodes in the system. After a node failure,
simulation to evaluate the algorithms. We describe brieﬂy the system loses approximately (256K)r L B/N ≈ 4 MB of
a prototype implementation based on MIT’s Chord imple- data. Because of the random placement of replicas (see
mentation as the underlying routing and storage layer. discussion in Section 4.2), this load is spread over differ-
ent nodes in the system — in this case, each effected node
will have to move one block, for a µ ≈ 300s. This gives
3.1 Experimental Setup a ρ ≈ 56, suggesting that r L = 2 is more than sufﬁcient to
We have implemented a custom event-driven simulator in maintain durability.
Python. The simulator models link bandwidth, but as- For calculating an optimal r L for Total Recall, assume
sumes that all network paths are independent. Bandwidth a desired availability, A = 0.99 with the trace average
is limited at the sender to limit the rate at which copies host unavailability (µ H ) of ≈ 0.11, produces r L = ln(1 −
are initiated. The simulator also assumes perfect network A)/ ln(1 − µH ) = 2.05. Because Total Recall does not
connectivity and that the only outages are due to reboots specify how to measure a long-term decay rate and calcu-
or disk failures rather than network failures. In reality, late rH , we mimic their evaluation by running with several
there are numerous and frequent network outages which different values for r H .
may not be symmetric . Each node additionally has
unlimited disk capacity. 3.2 Trace overview
The simulator takes as input a trace of activity consist-
ing of node and disk failures, repairs and block insertions. The most important aspect of our simulation is a realistic
It simulates the behavior of nodes under different proto- trace of host outages and repairs. We also require a real-
cols and produces a trace of the availability of blocks at all istic model of when computers lose the contents of their
times and amount of data sent and stored by each node. drives in a typical network environment.
In this framework, we implemented Sostenuto and To- Existing traces, such as those measuring OverNet 
tal Recall. Our implementation of Total Recall follows or corporate networks  do not meet these requirements:
that of , focusing on replication. On a block insert, r H they update node liveness information less than once per
copies are stored on a set of random nodes in the system. hour, do not include information about data loss events
In the real implementation, a master node for the block and do not cover very long time periods.
generates these copies and records their locations in an We use a detailed trace from historical data collected
inode object that is eagerly replicated on the ﬁrst ﬁve suc- by the CoMon project  on PlanetLab . CoMon
cessors of the master node. For simplicity, the simulator monitors all PlanetLab hosts every ﬁve minutes, allowing
SOSP 20 — Paper #196 — Page 5
Dates 12 Aug 2004 – 6 Jan 2005 Total disk bytes stored
Number of hosts 409 Sostenuto (rl=2)
Total Recall (rl=2 rh=4)
Total Recall (rl=2 rh=6)
Number of failures (reboots) 13356 6e+09 Total Recall (rl=2 rh=8)
Total Recall (rl=2 rh=10)
Number of data-loss failures 645 Number of live nodes
Average host downtime (s) 1337078 5e+09
Failure interarrival time (s) 143, 1486, 488600
Bytes stored on disk
Crash interarrival time (s) 555, 16670, 488600
Table 1: CoMon+PLC trace characteristics 2e+09
failures to be detected quickly. Further, CoMon reports 100
the actual uptime counter from each machine, allowing 2 4 6 8 10 12
14 16 18 20
the time at which each machine reboots can be determined
In order to identify disk failures (resulting typically Figure 6: The total amount of disk bytes stored on all
from operator actions such as operating system rein- participating nodes over time.
stallations or disk upgrades), the CoMon measurements
were supplemented with event logs from PlanetLab Cen-
tral . Disk failure information was available only after system during the simulation, so all additional bytes sent
25 October 2005. Table 1 summarizes the statistics of this are due entirely to maintenance trafﬁc.
trace. Consider the curve corresponding to Sostenuto (solid
Even though PlanetLab nodes are maintained by host line): in the period immediately after data is inserted, ad-
institutions, the trace includes a large number of node and ditional copies are quickly generated to maintain avail-
disk failures. Many of the failures are due to periodic up- ability. However, by week 4, the number of such copies
grades of the nodes’ operating systems. Some of these necessary has stopped increasing rapidly. At the end of
events cause large number of correlated failures, a circum- the trace, a total of approximately 3.3 GB has been sent,
stance for which our algorithm is not designed. This trace corresponding to approximately 5.4 copies of the data.
ends prior to a particularly large upgrade in which 200 This is in line with our prediction that about 2r L /p = 4.8
machines were failed in a short time period. replicas would be created.
Before discuss the results from simulating against this Total recall behaves differently: it creates an amount
trace, we consider a second set of traces where node and of ﬁxed replication, r H , at the start of the trace. We plot
disk failures are generated by sampling random distribu- values of rH = 4, 6, 8, 10. By storing large amounts of re-
tions based on the characteristics of the PlanetLab trace. dundancy, Total Recall essentially does not have to create
We use the generated traces for illustrative purposes: the any additional data blocks for r H ≥ 6. These values of
absence of noise and coupled events makes it much easier rH are all greater than our prediction about the expected
to understand the behavior of the system. The generated number of replicas required to avoid unavailability caused
traces operate with a 300 node population, with an ap- by transient failures (4.8).
proximately 85% average availability. For the trace with For rH = 4 Total Recall creates a large number of new
failures, 5% of outages are disk failures. replicas and continues to create them throughout the trace.
Because it “forgets” about unavailable replicas when cre-
ating a new replica set, Total Recall fails to take advantage
3.3 Generated traces of returning replicas and sends more data than Sostenuto.
No disk failures Figure 5(a) show the results of our We can see the magnitude of this extra data by examin-
simulation for rL = 2 running on a generated trace with ing the right side of the graph: Sostenuto sends a total of
no disk failures. By not generating any disk failures in the 3.3 GB Total Recall with rH = 4 sends an additional 3.9
trace, we are guaranteed that all blocks are durable and GB. This additional trafﬁc demonstrates the importance
that the results will only reﬂect work that is performed to of setting rH correctly.
maintain availability. The left graph shows the cumula-
tive number of bytes sent over time for all nodes in the Considering disk failures Figure 5(b) shows the same
system. One curve is plotted for each of the systems con- simulation when disks fail periodically: failures are
sidered: Sostenuto and Total Recall with different high marked by vertical lines in the curve at the bottom of the
water marks. In the simulation all 2500 blocks are in- graph. When disk failures are present Sostenuto again
serted at the beginning of the trace, causing a sharp rise in makes copies quickly at ﬁrst to deal with transient fail-
total bytes sent. No additional blocks are inserted into the ures, but now continues to rise at a steady rate throughout
SOSP 20 — Paper #196 — Page 6
Bytes sent over 147 day trace Bytes sent over 147 day trace
Sostenuto (rl=2) Sostenuto (rl=2)
Total Recall (rl=2 rh=4) Total Recall (rl=2 rh=4)
Total Recall (rl=2 rh=6) 9e+09 Total Recall (rl=2 rh=6)
7e+09 Total Recall (rl=2 rh=8) Total Recall (rl=2 rh=8)
Total Recall (rl=2 rh=10) Total Recall (rl=2 rh=10)
Number of live nodes 8e+09 Number of live nodes
Cumulative bytes sent
Cumulative bytes sent
2e+09 500 500
400 2e+09 400
200 1e+09 200
0 0 0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Time (Weeks) Time (Weeks)
(a) No disk failures (b) With disk failures
Figure 5: Comparing Sostenuto against Total Recall for r L = 2 on a simulated trace, with 5% of failures resulting in
the trace. This continuing repair cost is caused by disk of the trace because it restores the number of replicas
failures. The curve here is the sum of the network usage only to the low-water mark while Total Recall recreates
when no disks fail, and a linear usage caused by periodic enough replicas to again reach the high-water mark. Fur-
disk failures. ther, Sostenuto tracks essentially all copies of the block.
Total Recall’s behaviour with r H = 4 is identical to the The net result is that Sostenuto causes about 2GB less
previous trace. With larger r H in the presence of disk fail- data to be sent over the network when compared to the
ures, Total Recall can initially ignore both transient and best performing Total Recall conﬁguration. Sostenuto
permanent failures. This can be seen in Figure 6: whereas caused about 2.3GB of maintenance trafﬁc for 1.2GB of
Sostenuto’s disk usage increases at the beginning, usage replicated data (625 MB of original data): this translates
for Total Recall decreases as disk fail until late in the trace to an additional 4.74 bytes of redundant data sent per byte
(week 12 for r H = 6, 8 and week 18 for r H = 10). Eventu- of original data over the 21 week trace (including the orig-
ally, disk failures erode Total Recall’s high initial replica- inal redundancy inserted into the system). The overhead
tion levels until it is forced to begin making repairs. Note of the best performing Total Recall (r H = 4) conﬁguration
that in comparison, Sostenuto builds up disk usage to a is 8.8 bytes of redundant data per byte of original data.
certain level and maintains this level stably. Figure 7(b) shows the cumulative total block unavail-
ability, an aggregate measure of the amount of time that
3.4 PlanetLab traces data blocks are unavailable. The value is calculated by
summing the number of seconds each block is unavail-
We now move to a consideration of the two algorithms able: that is, two blocks unavailable for one second each
running on a trace of PlanetLab node failures. Figure 7(a) would be equivalent to one block unavailable for two sec-
shows the number of bytes sent. onds. Regions of the graph with zero slope correspond
The initial weeks of the trace behave in much the same to a state where all blocks initially inserted are avail-
way as the generated trace without failures. Sostenuto cre- able; positive slopes indicate periods where blocks are
ates about two additional replicas by week 12. Total Re- temporarily available. We can see that both Total Re-
call with a high water mark of 4 does not create a large call with rH = 4 and Sostenuto experience short periods
excess of data because, in this part of the trace, the ex- where blocks are unavailable, whereas higher settings of
pected number of replicas necessary to weather transient rH result in complete block availability. The availability
failure is less than 4. achieved by all algorithms is well above the base 0.99 de-
During weeks 12 to 19, disk failures (denoted by dot- sired: for 2500 blocks, there are a total of almost 3 × 10 10
ted lines at the bottom of the graph) begin to drive the block-seconds in the trace and the worst case conﬁgu-
bandwidth usage. All schemes are forced to make addi- ration (Total Recall with a r H = 4) experiences 3 × 10 5
tional copies in response to disk failures. Sostenuto cre- block-seconds of unavailability or 0.001 percent of the to-
ates fewer copies than Total Recall during this portion tal block-seconds.
SOSP 20 — Paper #196 — Page 7
Bytes sent over 147 day trace Total block-seconds of unavailability over 147 day trace
Sostenuto (rl=2) Sostenuto (rl=2)
Total Recall (rl=2 rh=4) Total Recall (rl=2 rh=4)
Total Recall (rl=2 rh=6) 8e+06 Total Recall (rl=2 rh=6)
6e+09 Total Recall (rl=2 rh=8) Total Recall (rl=2 rh=8)
Total Recall (rl=2 rh=10) Total Recall (rl=2 rh=10)
Number of live nodes Number of live nodes
Cumulative bytes sent
1e+09 300 300
200 1e+06 200
0 0 0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Time (Weeks) Time (Weeks)
(a) Bytes sent over time (b) Total unavailability: ∑b tunavailable
Figure 7: Comparing Sostenuto against Total Recall for r L = 2. For reference, the number of monitored nodes is
shown at the bottom of both graphs; vertical bars indicate disk failures, omitted for clarity in the unavailability graph.
Blocks are inserted “instantaneously,” after the system stabilizes (and after an outage in CoMon).
3.5 Scaling up to DHTGroups requires servers to continually transfer data at about 100
We have implemented Sostenuto using the MIT DHash
implementation as the underlying routing and storage
layer. We modiﬁed the DHash implementation to main- 4 Design Alternatives
tain additional replicas beyond r L rather than deleting
them: the successor list placement used by DHash already Previous sections have described and evaluated the basic
guarantees that returning replicas will rejoin the appro- design of Sostenuto. This section examines the trade-offs
priate replica set. Some additional implementation de- which must be confronted when deploying Sostenuto in
tails are discussed in Section 4.3. We additionally built practice.
a DHTGroups NNTP gateway to take a standard Usenet
feed provided by our university and store the articles into
the Sostenuto DHT running on PlanetLab.
4.1 Interaction with Erasure Codes
Our DHTGroups prototype has been running for sev- We discussed replica maintenance in terms of whole repli-
eral months on PlanetLab. It was able to successfully cas of blocks in Section 2. Many of the results of that sec-
store 173,935 articles of Usenet news from a feed of the tion can be improved by the use of erasure codes. Erasure
standard Big-8 groups in a ring of 34 nodes in one 3 day codes fragment data so that once a number of unique bits
period. To understand the long term behavior of the sys- of data roughly corresponding to the size of the original
tem we can use our simulations to estimate the amount of block are collected, the block can be reassembled.
bandwidth required to maintain the data that DHTGroups Erasure codes improve both durability and availabil-
stores in the DHT, a consideration neglected by the au- ity compared to storing the same number of redundant
thors of . bytes as full replicas. Coding is beneﬁcial because it
Consider storing 1.4 TB of data in the system for a 4 increases the number of simultaneous failure events re-
month period, a period comparable to what is simulated quired to cause data loss without increasing the number
above. Four months is longer than what is provided by of bytes stored in the system. However, there are draw-
typical news providers today for binary articles, which backs to using erasure codes . Coding requires ad-
constitute the bulk of Usenet data. Given the overhead ditional complexity and does not allow users to fetch a
predicted by our simulations this amount of data would range of bytes from a block efﬁciently. It also interferes
require the DHT to absorb 1.4 × 5.74 ≈ 8 TB of data with latency optimizations that attempt to fetch data from
for each day of Usenet trafﬁc over the four month reten- servers near the requester .
tion period. This translates to approximately 2.5 Mbps We will compare coding and replication systems where
per node over the same period. A traditional Usenet feed the amount of redundant data is the same. The redun-
SOSP 20 — Paper #196 — Page 8
r = 2, p = 0.7
Pr [repair action]
F = 7, p = 0.7
ρ = 10, replicas
1.00e-10 ρ = 10, 2 fragments
ρ = 10, 4 fragments 0.0
2 4 6 8
2 4 6 Number of replicas
Figure 9: Probability of block creation for availability us-
Figure 8: The probability of block loss for coding and
ing erasure coding. Erasure coding decreases the amount
replication when ρ = 10. Coding reduces the probabil-
of data created in response to transient failures.
ity of block loss compared to replication with the same
redundancy factor. Increasing the number of fragments
increases the improvement in reliability. in the replica case.
dancy factor for systems using whole replicas is r L . Cod- 4.2 Data Placement
ing systems have an extra parameter, F, the number of One practical aspect of constructing systems is to decide
fragments required for reconstruction. If the redundancy where to place extra redundancy. There are two main op-
factor is two and F = 4, a coding system will maintain at tions: redundancy can be placed on random nodes, as in
least 8 fragments, 4 of which are necessary to reconstruct Total Recall, or placed on nodes in the successor lists of
the data. the block. Choosing a placement strategy requires balanc-
Erasure codes have two advantages compared to repli- ing several competing interests: random placement allows
cas when the redundancy factor is the same. First, cod- for increased reconstruction parallelism which beneﬁts
ing places data on more nodes, meaning that more failure durability by increasing ρ . However, it creates the practi-
events are required to cause data loss. Coding also in- cal problem of monitoring data availability on a large set
creases ρ since creating a new fragment is F times faster of nodes. The characteristics of data availability imme-
than creating a whole replica (the fragment is F times diately following a massive failure also differ for the two
smaller). One disadvantage of coding is that more than placement schemes.
one machine must be maintained to provide durability:
the failure rate of any machine in a group of F machines Effective Durability and Recovery Parallelism When
is proportionally more than the failure rate for a single a server fails, the amount of time required to create new
machine. replicas of lost blocks depends on how the cost of copying
Figure 8 shows the probability of block loss for two is spread over the servers. This time inﬂuences µ and
coding systems and replication when ρ = 10. The repli- thus ρ and we would like to minimize the time required
cation line is identical to Figure 2 in Section 2.1. For a to make a copy to maximize reliability.
given amount of redundancy, coding reduces the proba- Suppose that replicas are placed randomly on servers.
bility of failure. If a server fails, the remaining copies of the blocks it held
Coding also reduces the amount of redundant data re- are uniformly spread over the remaining servers. Further,
quired to maintain availability during transient failures. the new replicas of those blocks should also be uniformly
The use of erasure codes increases the number of nodes spread over the servers. Thus the work of copying the
that must fail before data is lost; it also increases the num- blocks can be spread uniformly over the servers, and can
ber of nodes that must be alive for data to be found. The proceed in parallel (assuming a network with internal par-
net result is a decrease in the number of additional bytes allel paths). The copies will take time proportional to
necessary to lower the probability of creating a new block. rL B/N 2 where B is the number of unique blocks in the
Figure 9 shows the probability of creating a new fragment system: each node stores r L B/N blocks, and all N nodes
(solid line) given the redundancy in created fragments on participate in the copying.
the x-axis. As redundancy grows, the number of extant The other common replica placement strategy is to put a
fragments grows (faster than the number of replicas would block’s replicas on successive nodes in a consistent hash-
have), and the probability of loss drops more quickly than ing ring. In this case, when a server fails, its r L − 1 pre-
SOSP 20 — Paper #196 — Page 9
decessors must each copy B/N blocks to different one of
the failed server’s successors, and the failed server’s suc- 0.8
cessor must copy B/N blocks to its r L th successor. In
this case, rL copies can proceed in parallel, so copying random
requires B/N time, as long as r L N. 0.4
Random placement recovers from a node failure in
about 1/N-th the time of successor placement. This is 0.2
advantageous from a reliability standpoint. 0.0
0.92 0.94 0.96
fraction of blocks lost
Monitoring Availability Implementing Sostenuto in
practice requires ﬁnding usable mechanism for monitor- Figure 10: The effect of replica placement on block avail-
ing the level of availability of each block. This is nec- ability following a massive failure. The plot shows the
essary in order to be able to know when to initiate a re- cumulative distribution of the fraction of block failures
pair. The placement policy signiﬁcantly affects the kind of following the simultaneous failure of 500 nodes in a 1000
monitoring that happens — in the case of random place- node system. Each block was replicated 4 times. For
ment, each node will have stored a small number of blocks each of the two placement policies, 1000 random trials
on each of a large number of nodes. As a result, each node were made. Both placement policies result in the same
must eventually monitor all N − 1 other nodes in the sys- expected fraction of blocks lost, but the variance of the
tem. Requiring each node to maintain contact with each successor list distribution is larger.
other node in the system presents scalability problems in
DHTs deployed on large numbers of nodes.
In contrast, if redundancy is placed on successive nodes
in the ring, each node only needs monitor its successor identiﬁer space and data is stored on the nodes immedi-
list. Further, each node in the successor list will have a ately following the data’s identiﬁer. For an entire replica
signiﬁcant number of blocks in common with the moni- set to fail, at least rL nodes with consecutive identiﬁers
toring node. Previous work showed how synchronization must fail. For a given set of r L failures, it is unlikely that
can be accomplished without exchanging a large amount they will all be in a single replica set. Thus, most concur-
of information when nodes have the same set of keys in rent failures do not cause any blocks to be lost under this
common . placement model. When a rare failures event that affects
While synchronization with successors limits the num- an entire successor list occurs, however, 1/N of the blocks
ber of remote nodes that a given node must communicate become unavailable simultaneously.
with, the implicit tracking of blocks via the successor list In contrast, random placement any concurrent failure
may lose track of redundancy if the population of the sys- of rL nodes affects at least one replica set, resulting in the
tem has changed signiﬁcantly during the period a node N −1
unavailability of rL of the blocks. Far fewer blocks
was temporarily failed. Explicitly tracking block posi- are lost in this event, but it is much more likely to occur.
tions (as a random placement system must) could prevent
this loss. Another option, as noted in , each node can The expected number of blocks lost is the same in both
periodically examine the blocks it is storing locally for cases, but random placement reduces the variance in the
which it is not a replica (i.e. not in the successor list). It number of blocks lost. Users might prefer a loss distribu-
could then perform an lookup and inform the successor tion that favors losing no blocks over losing a few (even
that it has some redundancy for the block. The successor if rare events cause a large number of blocks to be lost).
can then decide what to do with that distant replica based On the other hand, a policy which causes a small amount
on its own knowledge about replication levels in the active of repair work over a long period of time (random place-
replica set. ment) is likely to better utilize network capacity than a
policy that periodically requires a large amount of work
in a short amount of time (successor list).
Static Availability Massive concurrent failures affect
the availability of blocks differently depending on the While random and successor list placement are two
placement scheme. The expected number of lost blocks popular strategies, it is possible to choose a middle ground
following a failure event is the same: however, the dis- by, for example, using virtual nodes . By running a
tribution of the number of blocks lost differs. Successor number of virtual nodes on each physical node, the num-
list placement produces a much wider distribution in the ber of replica sets a physical node participates in can be
number of blocks lost. increased. The number of virtual nodes used per physi-
In successor placement, nodes form N overlapping cal node allows the user to tune the system’s performance
replica sets. Nodes and data are organized into a circular anywhere between the two extremes presented here.
SOSP 20 — Paper #196 — Page 10
4.3 Synchronization Cost require the master node to track the identity of nodes miss-
ing copies of a given block. Using this information about
While we have focused on the cost of making new copies missing replicas, the master can modify the synchroniza-
due to failure, the costs associated with monitoring data tion protocol to not transfer redundant information. For
between failures can not be ignored. A simple approach instance when synchronizing with a replica node n that is
to monitoring would be to assume that a node’s availabil- known to be missing a key k, the master leaves k out of
ity implies availability of any data it has previously been the tree used for synchronization: this prevents n from re-
known to store. This approach is inexpensive (the cost porting what the master already knew, that k is missing
grows only with the number of nodes that must be probed) on n. This scheme requires the master node to keep addi-
but incomplete: database corruption or conﬁguration er- tional state about where keys are missing within the suc-
rors could result in continued uptime of a node while some cessor list. However, this amount of state is small relative
or all stored data blocks are lost. to the size of storing the block itself, and can be main-
We consider a system in which the node responsible tained lazily, unlike the hard state that the node stores for
for a given key wishes to make an end-to-end check that each key (namely the block).
data is stored on disk. This master node wishes to deter-
mine which replica nodes are not holding copies of the
data item and which data items are held on other nodes 5 Related Work
that the master should be responsible for. The latter case
occurs when failure causes a node to become master for a Small-scale distributed systems have long understood the
new set of data. need for replication to provide data durability and avail-
DHash, the DHT used in CFS , uses a synchroniza- ability. For example, Harp  distributes data across
tion protocol based on Merkle trees to efﬁciently take ad- a small cluster of machines. Unlike DHTs, conserving
vantage of the fact that most blocks are typically in the bandwidth is not a priority for systems like Harp since
right place . This scheme is much more efﬁcient than copies are made over a high-speed LAN.
simply exchanging lists of 20-byte keys; in the common The applicability of the optimizations presented here
case where a replica node held copies of all desired keys, to these systems is limited by the consistency guarantees
the protocol could verify that all keys were stored with a they provide on read/write data. For instance, Harp has lit-
single RPC. This protocol depends on the successor list tle or no choice about where to place data and beneﬁts less
placement scheme. from waiting for the return of a long failed replica since
However, the Merkle synchronization protocol as- data is likely out of data. Distributed disk systems [10, 12]
sumed that the rL replicas of a block were rigidly placed can choose how to distribute replicas among a potentially
on the rL nodes immediately following the key in ID large set of servers, but, like Harp, must provide strong
space. Sostenuto only requires that the r L replicas be lo- consistency guarantees.
cated in the successor list of the node responsible for the Previous work on DHTs has focused on data availabil-
key. DHash is typically conﬁgured to maintain a succes- ity following the simultaneous failure of some large frac-
sor list roughly twice as large as r L . This ﬂexibility al- tion of servers. CFS  analyzed the required level of
lows nodes to join the system without transferring large replication in terms of simultaneous server failures, mod-
amounts of data as they would if the placement of repli- eling the probability of data unavailability as p r , where
cas was rigidly enforced. p is the probability of any one server being unavailable
Unfortunately, this ﬂexibility causes the synchroniza- and r is the number of replicas . Relatively small val-
tion protocol to operate outside of its common case: ad- ues of r provide high availability. CFS eagerly creates
jacent nodes are no longer likely to store nearly identical new replicas whenever the number reachable falls below
sets of blocks. The result is that each time the synchroniz- r, and deletes replicas when it exceeds r. This approach
ers runs, it is forced to transfer a list of keys. This transfer confuses availability and durability: eager repair is appro-
can be costly. If the synchronization protocol runs once priate when durability is endangered, but the model for r
a minute, the cost of repeatedly transferring the 20-byte ignores the main danger to durability (repair time).
name of an 8KB data block will exceed the cost of ﬁxing The use of Merkle trees to track replicas allows DHash
the “hole” in the replica set by transferring the block in nodes to robustly maintain replication levels . This
about 8 hours. This problem is more severe when erasure algorithm minimizes the costs of synchronization in the
coding is used since blocks are smaller: the cost of trans- common case that adjacent nodes store identical keys
ferring associated keys more quickly outstrips the size of by eliminating the need to send long lists keys between
the data. nodes. However, it aggressively deletes additional data
To avoid this problem we enhanced the synchronization replicas, causing the system to make additional, unneces-
protocol to efﬁciently deal with nodes missing blocks. We sary copies of data in response to transient failure.
SOSP 20 — Paper #196 — Page 11
Total Recall  proposes lazy repair, with low (r L ) and tribute the load of serving the popular Citeseer document
high (rH ) water marks. It initially creates r H replicas. If index over a DHT. Overcite requires the permanent stor-
the number of reachable replicas falls below r L , Total Re- age of 1TB of documents in a DHT.
call eagerly creates replicas until they number r H . The ex- Systems like Citeseer consume so many resources (35.4
tra rH − rL replicas let Total Recall avoid eager repair in GB of trafﬁc per day, over 1TB of disk) that it is difﬁ-
the common case of temporary server failure. Total Recall cult for a single volunteer organization to host them. By
chooses rL based on the predicted probability of simulta- reducing the cost of distributing data in a wide-area sys-
neous server failure, and provides no guidance for how the tem, Sostenuto makes it practical for a group of coopera-
administrator should set r H . Sostenuto, in contrast, auto- tive individuals or institutions to deploy such data-heavy
matically maintains the high water mark, and Section 2.1 systems by pooling network and storage resources. Hope-
uses a more accurate rate-based model for r L . fully, this deployment strategy will allow the development
Glacier  is a system for providing durable storage of new applications beyond those mentioned here.
despite potential massively correlated failures. It uses
a single ﬁxed system-wide parameter to determine the
amount of redundancy to place in the system — no addi- References
tional storage space is ever used. Glacier uses aggregation  A NDERSEN , D. Improving End-to-End Availability Using
to minimize the repair overhead for any individual block. Overlay Networks. PhD thesis, Massachusetts Institute of
van Renesse and Schneider  noted that random Technology, 2004.
placement of replicas provides increased parallelism for  B HAGWAN , R., S AVAGE , S., AND VOELKER , G. Un-
replica recreation in the context of chain replication. We derstanding availability. In Proc. of the 2nd International
show that decreasing the time to create replicas improves Workshop on Peer-to-Peer Systems (Feb. 2003).
 B HAGWAN , R., TATI , K., C HENG , Y.-C., S AVAGE , S.,
Blake and Rodrigues  argued that a DHT cannot pro- AND VOELKER , G. M. Total recall: System support for
vide high availability and scalable storage if node churn automated availability management. In Proc. of the 1st
is high. To reduce overhead, they propose delaying the Symposium on Networked Systems Design and Implemen-
creation of new replicas in response to node failure until tation (Mar. 2004).
after a timeout has expired. Additional replicas are re-  B LAKE , C., AND RODRIGUES , R. High availability, scal-
quired to maintain availability while dead nodes time out. able storage, dynamic peer networks: Pick two. In Proc.
Sostenuto can be thought of as creating this level of repli- of the 9th Workshop on Hot Topics in Operating Systems
cation adaptively while using, essentially, an inﬁnite time- (May 2003).
out following node failure. Blake et al. also assume a high  B OLOSKY, W. J., D OUCEUR , J. R., E LY, D., AND
churn rate and mostly focus on the case of permanent de- T HEIMER , M. Feasibility of a serverless distributed ﬁle
partures from the system while we assume a system with system deployed on an existing set of desktop pcs. In Proc.
more-reliable server-class machines, with few permanent of the 2000 SIGMETRICS (June 2000).
departures.  C ATES , J. Robust and efﬁcient data management for a
This work was motivated by making DHTGroups prac- distributed hash table. Master’s thesis, Massachusetts In-
tical and reducing the cost of hosting a Usenet feed. DHT- stitute of Technology, May 2003.
Groups did not consider the cost of data maintenance:
 DABEK , F., K AASHOEK , M. F., K ARGER , D., M ORRIS ,
minimizing that cost is our primary goal. R., AND S TOICA , I. Wide-area cooperative storage with
CFS. In Proc. of the 18th ACM Symposium on Operating
System Principles (Oct. 2001).
6 Conclusions  DABEK , F., L I , J., S IT, E., ROBERTSON , J., K AASHOEK ,
M. F., AND M ORRIS , R. Designing a DHT for low la-
The cost of data maintenance is a road-block to the de- tency and high throughput. In Proc. of the 1st Symposium
ployment of large-scale distributed systems that store on Networked Systems Design and Implementation (Mar.
data. This paper has developed a model of block dura- 2004).
bility in large-scale distributed systems and described  ELIDED FOR ANONYMITY, A. DHTGroups: A low over-
Sostenuto, a DHT that reduces the cost of data mainte- head Usenet server. Position paper about DHT-based
nance. Usenet and available on request.
By reducing the cost of data maintenance on DHTs, we  G RIBBLE , S., B REWER , E., H ELLERSTEIN , J., AND
show that data-intensive distributed applications, such as C ULLER , D. Scalable, distributed data structures for inter-
DHTGroups, can be deployed with a net savings in stor- net service construction. In Proceedings of the 4th USENIX
age and network resources. Other applications will also Symposium on Operating Systems Design and Implemen-
beneﬁt from Sostenuto as well: Overcite  plans to dis- tation (OSDI 2000) (October 2000).
SOSP 20 — Paper #196 — Page 12
 H AEBERLEN , A., M ISLOVE , A., AND D RUSCHEL , P. a number of live block replicas. The node failure rate is
Glacier: Highly durable, decentralized storage despite λ f and replica creation rate is µ .
massive correlated failures. In Proc. of the 2nd Symposium
on Networked Systems Design and Implementation (May Before considering the full model, we begin with a spe-
2005). cial case that illustrates the impact of the number of repli-
 L EE , E. K., AND T HEKKATH , C. A. Petal: Distributed
cas rL on reliability. We take µ = 0, which corresponds to
virtual disks. In Proceedings of the Seventh Interna- a system with no repairs and analyze the expected time to
tional Conference on Architectural Support for Program- failure E[T ]. When the system begins running the failure
ming Languages and Operating Systems (Cambridge, MA, rate is rL λ f ; the expected time to the ﬁrst failure is rL λ f .
1996), pp. 84–92. After the ﬁrst failure, rL − 1 processes are running and the
 L ISKOV, B., G HEMAWAT, S., G RUBER , R., J OHNSON , expected time to the next failure is (rL −1)λ . Generalizing,
P., S HRIRA , L., AND W ILLIAMS , M. Replication in the where T is the time until the last failure:
Harp ﬁle system. In SOSP ’91 (1991), pp. 226–38.
 PARK , K. S., AND PAI , V. CoMon: A monitor- rL rL
1 1 1
ing infrastructure for planetlab. http://comon.cs. E[T ] = ∑ = ∑i
princeton.edu/. i=1 iλ f λf i=1
 P ETERSON , L., A NDERSON , T., C ULLER , D., AND
ROSCOE , T. A blueprint for introducing disruptive tech-
nology into the Internet. In Proc. of HotNets-I (October This sum is the harmonic series which can be approxi-
2002). http://www.planet-lab.org. mated by the natural log to produce E[T ] ≈ λ1 logrL . In-
 PlanetLab: An open platform for developing, deploying tuitively, adding more replicas helps very little since all
and accessing planetary-scale services. http://www. of the replicas fail independently. The logarithmic gain
planet-lab.org. comes from increasing the chance that one of the repli-
 RODRIGUES , R., AND L ISKOV, B. High availability in cas will be “lucky” and last a long time. In other words,
DHTs: Erasure coding vs. replication. In Proc. of the to produce a linear increase in time till failure, we must
4th International Workshop on Peer-to-Peer Systems (Feb. exponentially increase r L .
Returning to the full model, we are interested in the
 S TOICA , I., M ORRIS , R., L IBEN -N OWELL , D., probability of reaching state 0, where the block has no
K ARGER , D., K AASHOEK , M. F., DABEK , F., AND BAL - remaining replicas. Since this model is a birth-process Pr ,
AKRISHNAN , H. Chord: A scalable peer-to-peer lookup
the probability that r replicas exist, in terms of P0 :
protocol for internet applications. IEEE/ACM Transactions
on Networking (2002), 149–160.
 S TRIBLING , J., C OUNCILL , I. G., L I , J., K AASHOEK , µr
Pr = P0 .
M. F., K ARGER , D. R., M ORRIS , R., AND S HENKER , S. λ fr r!
OverCite: A cooperative digital research library. In Proc.
of the 4th International Workshop on Peer-to-Peer Systems
The factorial comes from the coefﬁcients on the λ f terms.
 VAN R ENESSE , R., AND S CHNEIDER , F. B. Chain repli- Let ρ = λ > 1. Using the law of total probability,
cation for supporting high throughput and availability. In
Proc. of the 6th Symposium on Operationg Systems Design
and Implementation (Dec. 2004).
 W EATHERSPOON , H., AND K UBIATOWICZ , J. D. Era- 1 = ∑ Pr
sure coding vs. replication: A quantitative comparison. In r=0
Proc. of the 1st International Workshop on Peer-to-Peer ρr
Systems (Mar. 2002). = P0 ∑
= P0 eρ .
A Analysis of the Birth-Death
Chain This gives P0 = e−ρ and Pr = e−ρ ρ r
This appendix contains derviations for the properties of We are also interested in the expected value of r, the
the Markov model used to analyze the relationship be- current state of the system. This corresponds to the num-
tween rL , ρ , and block durability. Recall that we are ana- ber of replicas we expect the system to maintain in the
lyzing a Markov chain in which each state corresponds to steady state. Summing rPr (and noting that the ﬁrst term
SOSP 20 — Paper #196 — Page 13
(r = 0) is zero):
E[r] = 0 + ∑ rPr
e− ρ ρ r
= ∑r r!
= e− ρ ρ ∑ (r − 1)!
= e− ρ · ρ · eρ
SOSP 20 — Paper #196 — Page 14