The TickerTAIP Parallel RAID Architecture
PEI CAO, SWEE BOON LIM, SHIVAKUMAR VENKATARAMAN,
and
JOHN WILKES Hewlett-Packard
Laboratories
Tradltlonal all requests maximum architecture coupled This
disk flow. number for processors. article
arrays Such dmk The presents
have
a centrahzed to which that 1s better T1ckerTAIP the
architecture, point can the array
with of failure, scale. We
a single and describe functions an
controller
through
which the
a controller arrays result the
M a single dmtrlbutes scalablhtyi
Its performance TlckerTAIP. across of Its
hrnlts
of disks
a parallel loosely We for and and by TAfP
controller fault and
several behavior
tolerance, a family request design also We
and flexlblhty. evaluation of distributed atomlclty, the m both conclude absolute effects the that algorithms sequencing, terms Ticker
architecture example, of the mslde
demonstrate calculating recovery, comparison disk-level architectural Categories and
the feasibility RAID to parity, evaluate
by a working discuss
describe
techmques
for establishing TlckerTAIP We the array
the performance RAID algorkhms useful,
a centrahzed
Implementation and effective
analyze
of Including
request-scheduhng approach and Subject
M feasible, Descriptors
B 42
Input/ Output Devices—c/Lunnels and rent Programming—parallel progra mm —seco?Ldary storage; D.4.7 [Operating
tems
controllers;
and Data Communications] [Programming Techniques]: Concurzng; D 42 [Operating Systems] Storage Management Systems]: Orgamzatlon and Design—dlsfrzbuted sys[Input/ Output D 13
General Addltlonal trlbuted
Terms: Key controller,
Algorithms, Words fault and
Design, Phrases
Performance, Decentrahzed
Rehablhty panty calculation, disk scheduling, RAID disk duarray
tolerance,
parallel
controller,
performance
slmulatlon,
1,
INTRODUCTION
A disk
array
is a structure
that
connects
several
disks
together
to extend
the
cost, power, and space advantages of small disks to higher-capacity configurations. By providing partial redundancy such as parity, availability can be
An
earner
version Architecture addresses: emad:
of this P Cao,
article Princeton Avenue, venkatar[a
was
presented
at
the
1993
International
Symposium Science, Princeton, of Computer Dayton
Laboratories,
on
Computer Authors’ NJ 08540; Science. man, Box Madison, 10490, Permmslon not made of tbe Assoclatlon
Umversity, S. B Llm,
Department Umverslty emad
of Computer of Illinols, sbhm(({ Science,
PC(CJ prmceton.cdu: cs ,Sprmgfleld emad Alto. of Wisconsin,
Department 1210 West
1.304 W WI
T_Trhana, IL 61801, of Computer cs WMC edu: J emad: of thm and
es muc edu; S VenkataraStreet,
PO
Umverslty 1U13,
Department CA 94304-0969;
53706,
Wilkes,
Hewlett-Packard
Palo
wllkestff
hpl hp com provided copyright that notice reqmres the copies are and the title of the a fee and/or
to copy without or chstrlhuted and for Computmg its
fee all or part date appear,
material is given
IS granted the ACM that
for direct
commercial To copy $03.50
Vol
advantage, notice otherwise,
publication
copying
1s by permission
Machinery.
or to repubhsh,
specific permmslon. [a 1994 ACM 0734-2071/94/0800-0236
ACM Transactions on Computer Systems,
12,
No
3, August
1994,
Pages
236-2b9
TlckerTAIP
Table redundancy technique I. Some Common RAID Levels
.
237
Level
p/acement of redundant data
diagrammatic
rendrf!on
o
stnpmg
none
none
Hf3@m
1 sfrige of data blocks secondary copy of data
!
1
mirrormg
mmpete disks P%d p;mary copy’of data blocks
3
parity across a stripe of data
one diskdexiicsted to parity da{a blocks parity block
5
panty across a stripe of data
panty rotates roundrobm across all disks
Ejm3g$
data blocks panty Mock
host interconnect // controller
disk controller Fig. 1.
Traditional RAID array architecture.
I
‘JJ’s
increased as well. Such RAIDs (for Redundant Arrays of Inexpensive Disks) were first described in the early 1980s [Lawlo~ 1981; Pa~k and Balasubramanian 1986], and popularized by the work of a group at UC Berkeley number and these [Patterson of different are summarized a disk et al. 1988; levels, redundant in Table array I. that provides one or more of the RAID levels, 1989]. data. The RAID The most terminology amounts commonly encompasses of redundancy encountered of a corresponding to different
placement
of the
To implement
the traditional RAID array architecture, shown in Figure 1, has a central controller, one or more disks, and multiple head-of-string disk interfaces. The RAID controller interfaces to the host, processes read and write requests, and
ACM Transactions on Computer Systems, Vol 12, No 3, August 1994.
238
.
Pel Cao et al.
small-area network
1
/
-4
1/
\ ‘m
host mterconneot(s)
array controller nodes
Fig 2 TlckerTAIP array architecture
\
carries disk
out
parity
calculations, interfaces
block
placement,
and
data from
recovery
after
a to
failure.
The disk
pass on the commands
the controller
the disks via a disk interconnect of some sort—these days, most often a bus [SCSI 1991]. variety of the Small Computer System Interface, or SCSI Obviously, mance and ing power, whole will the capabilities of the RAID controller are crucial to the perforavailability of the system. If the controller’s bandwidth, processor capacity are inadequate, the performance of the array as a suffer. (This is increasingly likely to happen: for example, parity have not 1990].) of small are now kept pace with A high latency requests. similar, The and
calculation is memory bound, and memory speeds recent CPU performance improvements [Ousterhout through single failure the point rates controller of failure for disk can that drives reduce the and the performance represents controller packaged
can also be a concern:
electronics
one of the primary motivations for RAID arrays is to survive the failure rates that result from having many disks in a system. Although some commercial RAID array products include spare RAID controllers, they are not normally simultaneously active: one typically acts as a backup for the other, and is held in reserve until it is needed because of failure of the primary. (For example, this technique was suggested in Gray et al. [1990 ].) This is expensive: the backup has to have provides no useful services trollers are active all the capacity of the primary controller, but it in normal operation. Alternatively, both conbut over disjoint sets of disks. This limits
simultaneously,
the performance available from the array, even though the controllers can be fully utilized. architecture To address these concerns, we have developed the TickerTAIP
for
In this architecture (Figure 2), there is no central conparallel RAIDs. controller nodes troller: it has been replaced by a cooperating set of array that together provide all the functions needed by operating in parallel. The TickerTAIP architecture offers several benefits, including: fault tolerance (no central controller to break), performance scalability (no central bottleneck), (by simply adding components).
Vol 12, No 3, August
smooth incremental growth (it is easy to mix and match
ACM Transactions on Computer
another
node),
and
flexibility
Systemb,
1994
TickerTAIP
.
239
This main controller
1.1
article emphasis
provides
an
evaluation techniques
of the used
TickerTAIP
architecture. parallel, distributed
Its
is on the
to provide
functions
and their
effectiveness.
Outline
this article by presenting and follow an overview it with of the TickerTAIP description architecof several and related work, a detailed
We begin ture
design issues, including descriptions and evaluations of algorithms for parity calculation, recovery from controller failure, and extensions to provide sequencing of concurrent requests from multiple hosts. To evaluate TickerTAIP we constructed a working prototype as a functional testbed, and then built a detailed event-based simulation that we calibrated against this prototype. the simulation-based particular tion. emphasis We conclude with These tools performance on comparing a summary are presented as background analysis of TickerTAIP that it against a centralized RAID of our results. material for follows, with implementa-
2. THE
TICKERTAIP
ARCHITECTURE
A TickerTAIP array is composed nodes with one or more local disks provide connections to host another by a high-performance, redundancy to survive single Mesh-based availability scale and fault A similar interconnect of running disks for of array switching needs with
of a number of worker nodes, which are nodes connected through a bus. Originator clients. The nodes network the the with are connected sufficient to one internal and
computer
small-area failures. can costs would and
fabrics reasonable that
achieve meet
bandwidth, across performance,
latency,
complexity
a reasonable scalability,
sizes. A design
tolerance needs of a TickerTAIP array is described in Wilkes [1991]. scheme has been described in Shin [1991]. For smaller arrays, the could at well even be a pair over 100MB/s of backplanes. [PC I 1994], or (For which many example, would hundred PCI is capable 15–20 if the become scales support disks
bandwidth-limited
applications,
workload was small, more cost effective perfectly. relatively
random 1/0s.) Multiple, independent arrays will at some sufficiently large scale: no interconnect
However, TickerTAIP’s requirements on the switching light, and this point will probably only be reached with that such a split will probably be desirable for other
fabric are arrays that reasons.
are so large
In Figure 2, the nodes are shown as being both workers and originators: that is, they have both host and disk connections. As a result, designing a node requires that both the host interface and the disk interface be designed together, and adding a disk node requires paying for another host connection. A second design that avoids these problems is shown in Figure 3. It uses separate disk-controller (worker) nodes and host-interface (originator) nodes. This allows arbitrary mixing and matching of node types (such as SCSIoriginator, FDDI-originator, IPI-worker, SCSI-worker), which makes building a TickerTAIP array with several different kinds of host interface simply a configuration-time question,
ACM
not
a design-time
on Computer
one. Since
Systems,
each
node
is plug
1994
TransactIons
Vol. 12, No 3, August
240
.
Pei Cao et al. chent processes
e
/“’”e
Flg
3.
TickerTAIP
system
envn-onment
compatible configure flexibility Figure
from the point of view of the internal interconnect, it is easy to an array with any desired ratio of worker and originator nodes, a less easily achieved in the traditional centralized architecture. 3 shows the environment in which we envision a TickerTAIP provides nodes. array
operating. The array through the originator
disk services to one or more host computers There may be several originator nodes, each a single host can be connected to and greater failure resilience. be returned array looks like to the host a Tickernode communi-
connected to a different host; alternatively, multiple originators for higher performance For simplicity, we require that all data along the path used to issue the request. In the context of this model, a traditional TAIP cation array with several unintelligent calculations worker take on which all the parity
for a request RAID nodes,
a single
originator
place,
and shared-memory
between
the components.
One assumption we made in this study is that parity calculation is a driving factor in determining the performance of a RAID array. The TickerTAIP architecture does these calculations in a decentralized fashion; other high-speed array controller designs (e.g., RAID-II [Drapeau et al. 1994] use a central parity calculation (1) that processors are engine. Our approach is predicated cost-effective engines for calculating on two beliefs: parity and (2)
that memory bandwidth, rather than processor cycles, is the determining cost factor in providing this functionality. (By way of example, the bandwidth and functionality requirements of the RAID-II engine required a controller card nearly two feet on a side.) The TickerTAIP architecture reduces the perprocessor parity calculation requirements sufficiently far that the cheap commodity microprocessors it uses for the control functions can also be used as the parity calculation engines. At this point, the diseconomies of scale associated with providing high-bandwidth data paths to a hardware parity calculation engine will overwhelm any intrinsic simplicity in the use of specialized logic to perform the exclusive-OR calculation. As the performance
ACM TransactIons on Computer Systems, Vol 12, No 3, August 1994
TickerTAIP
.
241
of commodity microprocessors continues to improve range of array sizes over which this argument holds
at its current rate, will only increase.
the
2.1
Related
Work
Many design 1988; Dunphy
papers Gibson
have et
been for al.
published Menon
on RAID and and
reliability, recovery 1992;
performance, schemes Schulze and Lui [Clark et 1990; al.
and
on
variations et al 1990;
parity 1989;
placement
et al. 1989; Holland
Kasson
Gray
et al 1990;
Lee 1990; Muntz
and Gibson 1992]. Our work builds on these studies: we concentrate here on the architectural issues of parallelizing the techniques used in a centralized RAID array, so we take such work as a given—and assume basic RAID concepts in the following discussion. The HP7937 family of disks realizes a physical architecture of TickerTAIP gether disk switch-over functions to Several similar [Hewlett-Packard bus, which attached as a disk for between (such that hosts array) database 1988]. allows These disks can by a 10MB/s access to “remote” of system use including were provided, familiarity similar as well with to that toas fast
be connected failure.
disks
in the event systems
No multiarchitecture
however. a hardware Bubba [Boral 1988;
“shared-nothing” adopted
TickerTAIP,
Copeland et al. 1988], Gamma [DeWitt et al. 1986; 1988], Teradata [Neches 1984; Sloan 1992], and Tandem [Bartlett et al. 1990; Siewiorek and Swarz 1992]. across However, multiple none nodes, appears to use a distributed RAID implementation and all are intended as database engines rather than makes and para widely was called
parallel implementations of RAID. On the other hand, TickerTAIP extensive use of well-known techniques such as two-phase commit tial-write distributed spread failure. inside ordering was RAID from made the database to connect community networks [1989]. [Gray This 1978]. to form approach A proposal
RADD-Redundant
of processors
controller
in Stonebraker
Arrays of Distributed Disks. It proposed using disks across a wide-area network to improve availability in the face of a site In contrast to the RADD study, we emphasize the use of parallelism
a single
RAID
server;
we assume
the kind
of fast,
reliable
interconnect
that is easily constructed inside a single-server cabinet; we couple processors and disks closely, so that a node failure is treated as (one or more) disk failures; and we provide much improved performance analyses—Stonebraker used “all ization disk of the operations parallel take RAID 30 ins.” design The result approach is a new, in detailed characterdifferent a significantly
environment.
3. DESIGN This section
ISSUES
describes the TickerTAIP design issues in some detail. It begins
with an examination of normal mode operation (i.e., in the absence of faults) and then examines the support needed to cope with failures. Table II may prove helpful in understanding the data layout used for the RAID5 array we are describing.
ACM Transactions on Computer Systems, Vol 12, No. 3, August 1994
242
.
Pel Cao et al
Table
II ~
Data
Layout
for a 5-Disk
Left-Symmetric
RAID
5 Array
[Lee
1990]
/og/ca/ b/ock number
I
i
2 3 4
Each spans column most represent of stripe a disk The shaded, blocks cmtl]ned have darker area represents and one possible are marked request a P that
1 (a “large stripe”).
stripe”),
all of stripes
2 and 3 (“full shading
stripes”),
and a small with
amount
of stripe
4 (a “small
Parity
3,1 In
Normal-Mode
normal mode,
Reads
no parity computation is required for reads, so they are
quite straightforward. All the necessary data is read at the workers and forwarded to the originator, where it is assembled, and transmitted to the host in the correct order. The main performance issue that arises has to do with skipping reads over the parity data blocks: and we than found and it then beneficial to discard separate to the perform parity requests sequential of both parity,
blocks inside the worker nodes, rather that omitted reading the parity blocks. 3.2
to generate
Normal-Mode
array,
Writes
writes data partial require calculation Each is or modification stripe computation, maintained. of stored since this The parity to in that a single redundancy. redundancy executed is considered separately discussion that
In a RAID maintain determining across follows request 3.2.1
the partial which spans.
How to
the method the
and site for parity
is the unit
describes
the algorithms
on each of the stripes
Calculate
calculate the new much of the stripe
—full stripe:
New Parity. The first design choice parity. There are three alternatives, depending is being updated (Figure 4):
is how to upon how
parity
—small
all of the data blocks in the stripes have can be calculated entirely from the new data;
to be written,
and
stripe:
and parity be written, with
—large stripe:
less than half of the data blocks in a stripe are to be written, is calculated by first reading the old data of the blocks that will XORing them with the new data, and then XORing the results
the old parity
block
data;
more than half of the data blocks in the stripe are to be written; the new parity block can be computed by reading the data blocks in the stripe that are not being written and XORing them with the new data (i.e., reducing this to the full-stripe case) [Chen et al. 1990].
‘11-ansactlom on Computer Systems, Vol 12, No 3, August 1994
ACM
TickerTAIP
.
243
“y’’mn~~~ clhbffcks (a) Full stripe old dafa bei~g read Y parity block
read-mo~fy-write
cycles’
(b) Small stripe
(c) Large
Fig. 4. Three
stripe
ml
stripe update size policies. x indicates where parity calculations occur.
different
Notice all but
that the first
a single
request stripe is just
might will
span always
all three be full
kinds ones.
of stripes, optimization,
although since
and last mode
The large-stripe the right behavior cases. We discuss 3.2.2 where tures
Where
a (potential)
performance
can be obtained from whether it is beneficial
New Parity.
using just the smallin practice later. The second design
and full-stripe
to Calculate
consideration
is
the parity is to be calculated. Traditional centralized calculate all parity at the originator node, since processing capability. In TickerTAIP, every node
RAID architeconly it has the has a processor, the work over as 5):
necessary
so there are several among the nodes—in many
—at
choices. The key design goal is to load balance particular, to spread out the parity calculations Here are three possibilities (shown in Figure
nodes
originator:
as possible.
all
parity
calculations calculations
are done at the originator; for a stripe take place at the parity
—solely-parity:
node for that
—at-parity:
all parity stripe;
same as for solely-parity, except that partial results during a small-stripe write are calculated at the worker nodes and shipped to the parity node.
ACM Transactions on Computer
Systems, Vol 12, No 3, August 1994
244
.
Pel Cao et al
.
- .. -J
(a) At originator
r -
“
-
----
.
.
.
.,
)
.&
b) Solely parity
f==l
f=l
‘+’:’
+
Fig. the
5
node
Three where
different parity
places
to calculate occur.
panty
@ indicates
the
orl~nator
node;
X, indicates
calculations
The tween
3.3
solely-parity not the other
scheme pursue two, it later
always further.
uses more We
messages
than
the at-parity comparisons
one, be-
so we did
provide
performance
in the article.
Single Failures—
Request Atomicity
We begin with a discussion of single-point failures. Notice that a primary goal of a regular RAID array is to survive single-disk failures.1 The TickerTAIP architecture extends this to include failure of a part of its distributed controller: we do not make the simplifying assumption that the controller is not a TickerTAIP is intended to be used possible failure point. The way in which provides duplex paths to its host (see Figure 3), and since there are several techniques for doing so, we have legislated that the internal interconnect fabric is itself single-fault resilient. As a result the overall architecture is
1There failures
are
variants et al and
on
of the 1989]
parity
calculation further
Vd
scheme, here.
12 No
that TAIP
can
compensate
for
multlple these
disk cases is
[Gibson
The extension
of the Ticker
architecture
to cover
straightforward,
ACM TransactIons
not discussed
Computer
Systems,
3, August
1994
TickerTAIP
Table III. Algorithms block being used to perform written a write in failure mode, as a function being updated
.
of the kind of
245
to, and the amount physical
of the stripe
stripe size
block type on faded disk parity I none not updated small stripe strategy
updated
small ‘ large stripe strategy I lar9e full large stripe strategy full stripe strategy
none small stripe strategy
none —
capable
of surviving
a fault
in any
single
system
component.
However,
there
are certain TickerTAIP section
requirements on the software algorithms used at the nodes in a system to ensure correct operation in the presence of faults. This the first of them: the need to provide request atomicity.
discusses
Just as with a regular RAID array, packaging and power-supply issues are very important if the system availability is to be maximized. Some of these decisions are discussed in Schulze [1988] and Schulze et al. [1989]; the design approach that used for these questions disk In until array. TickerTAIP, RAID the disk a disk array: the is repaired failure array is treated continues in normal in just operation the contents mode. the in of in a TickerTAIP-based array is identical to used for a regular
Disk Failure.
3.3.1
same
way
as in a traditional (failed) mode are reconstructed:
degraded the outside appropriate failed disk disk Table
3.3.2
or replaced;
the new disk
and execution
resumes
From
of the array, the effect is as if nothing has happened. Inside, data reconstructions occur on reads, and 1/0 operations to the are suppressed. Exactly the same algorithm is used if an entire goes bad for some reason. The algorithms are summarized in
string III.
Worker
Failure.
disk RAID
failure,
and
A TickerTAIP worker failure is treated just like a is masked in just the same way. (Just as with a regular multiple disks per head-of-string controller, a failing
controller
with
worker means that an entire column of disks is lost at once, but the same recovery algorithms apply, ) We assume fail-silent nodes so that we can significantly simplify the fault-isolation and normal-case protocols we use the isolation offered by the networking between the nodes. In practice, protocols used to communicate between nodes is likely to make this assumption realistic in practice for all but the most extreme cases—for which RAID arrays are probably not appropriate choices. (In support of our position, Gray [1988] explains that the complexities of handing tine failure modes are rarely deemed worthwhile A node alive” is suspected within to have failed request a reasonable time. (This the more complicated in practice.) respond only is the place that Byzanyou such
if it does not
to an “are
time-outs occur, to simplify the maintenance of other portions of the system.) The node that detects a failure of another node initiates a distributed consensus protocol much like two-phase commit, taking the role of coordinaACM TransactIons on Computer Systems, Vol. 12, No 3, August 1994.
246
.
Pel Cao et al
tor of the consensus protocol. All the remaining this means on the number and identity of the ensures Multiple possible
3.3.3
nodes reach failed node(s). mode down
agreement by This protocol same to time. prevent
that data
all
the
remaining the
nodes array to
enter shut
failure itself
at the safely
failures
cause
corruption.
Failure and Request Atomicity.
Originator
Failure to a host
of a node with is lost;
an
originator
on it brings
new
concerns:
a channel
any worker
on the same node will be lost as well; and the fate of requests that arrived through this node needs to be determined since the failed originator was responsible for coordinating their execution. Originator failures during reads are fairly simple: the read operation is aborted host. Failures portions to avoid since there during of the write compromising is no longer write a route to communicate its results back to the different are taken write.
operations the
are more of the
complicated, unless stripes extra
because steps
could
be at different consistency
stages
involved
in the
Worst is failure of a node that is both a worker and an originator, since it will be the only one with a copy of the data destined for its own disk. (For example, if such a node fails during a partial-stripe write after some of the blocks in the stripe have been written, it may not be possible to reconstruct the state of the entire stripe, violating the rule that a single failure must be masked.) Our solution either Notice drives a write that this to both operation is a much being these concerns is to ensure successfully, guarantee arrays. With than these, until or
write atom icity:
that
is, disk
completes stronger disk
it
makes
no
changes. of a range completes
provided the
by single write
or non-parity-protected blocks written
the content
of logical
to is indeterminate
successfully. If a write request is aborted or fails, the contents of the targeted range will be in an indeterminate state. To achieve this guarantee, we added a two-phase commit protocol to write operations. Before a write can proceed, sufficient data must be replicated in more than one node’s memory to let the operation restart and complete—even if an arbitrary node fails. If this cannot be achieved, the request is aborted before it can make any changes. (A similar problem caches with occurs that must in disk RAID controllers failure This that issue have cache two-part half, nonvolatile write tolerate of either possibly in Menon in conjunction and Courtney
early
concurrent
failures.
is discussed
[1993]; similar We identified
commit
solutions to the one we adopted two approaches to implementing
serve there as well.) the two-phase commit:
tries to make the decision as quickly as possible; late commit delays its commit decision until all that is left to do are the writes. We describe them in reverse order, since late commit is the simpler of the two. the commit point (when it decides whether to continue or In late commit, not) is reached only after the parity has been computed. The reason for this choice is that the computed parity data, suitably distributed, provides exactly the partial redundancy needed. In late commit, all that remains to be done after the commit decision is to perform the writes.
ACM Transactmns on Computer Systems, Vol 12. No 3, August 1994
TlckerTAIP
Table IV. Data needed for recovering strategy a stripe during a write, —
.
247
and the stripe-size
used to do so
faded node
block type at fa//ed node updated
stripe size -+ small strpe panty node has copy; large-stripe strategy parity not computed — originator has copy; large-stripe strategy parity not computed — /arge stripe parity node has copy; large-stripe strategy fu// strpe
--1
( ~ I
originator
L
panty notupdated updated
I updated and panty nodes ave copy, full-stripe strategy qh_._ -.-, panty not computed panty not computed ----1 parity node has copy; k--Iarge-stripe strategy originator has copy; large-stripe strategy parity not computed parity node has copy; large-stripe strateg~l
L--=I
orlgmator has copy; full-stripe strategy panty not computed
+
1
~
L
worker
t
parity
not updated
—
—1
point that as the elsefor node nodes
In quickly new where,
early
commit,
the during
goal
is for
the
array
to get node
to its This same
commit requires must
as possible destined in case the
the execution originator/worker fails after
of the request. commit. The
data
or the originator
has to be replicated
be done
old data being read as part of a large-stripe write, in case the reading fails before it can provide the data. We duplicate this data on the parity of the affected stripes—this involves of parity calculations at the parity preferred policy). The commit point redundancy has been achieved. Late rency commit restarting originator, the request, nodes that commit and higher point. the is much request When from easier latency. sending no additional node (which we will is reached as soon
data in the case see below is the as the necessary lower not concurreach its for in
to implement, We explore if any node event those fails, nodes
but has somewhat the magnitude worker originator were already the that does failure,
of this
cost later.
A write
operation
is aborted a worker In the among
involved
is responsible a temporary participating
operation.
of an originator it processing. the request nodes already
chosen
is elected to complete or abort was already participating in
Choosing minimizes have the
one of the data and necessary
control traffic interchanges, since these information about the request itself.
Table IV summarizes the different cases that For each combination of node role and block type which that node has a copy of the data must be applied to the stripe. required
need to be accommodated. that has been lost, it shows and the write policy
for recovery,
3.4
Multiple Failures — Request Sequences
This section discusses measures designed to help limit the effects of multiple concurrent failures. The RAID architecture tolerates any single disk failure. However, it provides no behavior guarantees in the event of multiple failures
ACM Transactions on Computer
Systems, Vol 12, No 3. August 1994
248
.
Pei Cao et al.
(especially power-fail), and it does not ensure ping requests that are executing simultaneously. son and Sturgis [1981], multiple failures are covered fault set for RAID. TickerTAIP troller failures; and it goes beyond this the effects As with
the independence of overlapIn the terminology of Lampdisasters: events outside the conlimit
introduces coverage for partial sequencing to by using request
of multiple failures in a way that is useful to file system a regular RAID, a power-fail during a write can corrupt techniques is exactly
designers. the stripe
being written to unless more extensive recover logging) are used—in this respect, TickerTAIP failure power TAIP’s wishing model. Power failures can be handled supply for both TickerTAIP request sequencing also to tolerate crashes and
(such as intentions emulating the RAID
by the use of an uninterruptible
and a regular RAID array, but Tickerprovides improved performance to hosts other failures. Strengthening the regular
RAID failure guarantees of wanting to maximize lower
3.4.1
in the controller follows naturally as a consequence performance in the array; in turn, doing so at the to simplify system its own failure designers rely recover typically mechanisms. on the presence
level
allows
the host File
Requirements.
of ordering invariants to allow For example, in 4.2 BSD-based
them to recover from crashes or power failure. file systems, metadata (inode and directory)
writes must occur before the data to which they refer is allowed to reach the disk [McKusick et al. 1984]. The simplest way to achieve this is to defer queueing the data write until the metadata write has completed. Unfortunately, this can severely limit concurrency: for example, parity calculations can no longer is unfortunate, be overlapped with the execution of the previous request. This and becoming more so, as the technology of disk drives improves to include command queueing, immediate reporting, and more nearly optimal request sequencing that exploits position information available only at the disk itself [Seltzer et al. 1990; Jacobson and Wilkes 1991; Ruemmler and Wilkes 1993]. A better way to achieve the desired invariant is to provide—and preserve —partial write orderings in the 1/0 subsystem. This technique can significantly improve file system performance. From our perspective as RAID array designers, it also allows the RAID array to make more intelligent decisions about request scheduling. We discuss the effects to of some of these multiple scheduling hosts. As a decisions later A TickerTAIP in the article. array can
be configured
support
result, some mechanism needs to be provided to let requests from different hosts be serialized without recourse to either sending all requests through a single host or requiring one request to complete before the next can be issued. Finally, multiple overlapping requests from a single or multiple hosts can be in flight simultaneously. This could lead to parts parts of another in a nonserializable fashion, which vented. (Our write commit protocols provide atomicity no serializability guarantees.)
3.4.2 Request Sequencing.
of one write replacing clearly should be prefor each request, but
a request-sequencing
ACM Transactions on Computer
To address these requirements, we introduced mechanism using partial orderings for both reads and
Systems, Vol. 12. No 3, August 1994
TickerTAiP
.
249
writes.
graphs
Internally,
these
are
represented
in
the
form
of
directed
acyclic
(DAGs): each request is represented by a node in the DAG, edges of the DAG represent dependencies between requests. To express allowed TAIP perform on which to list guarantees eager the DAG, that each request requests (this the effect complete some is given a unique identifier. until one or more on which allows it depends begins
explicitly;
while
the is
A request Tickerthe requests the freedom testbed
is as if of which
no request
it depends
the implementation we exploited in our
to
evaluation,
proto-
type). If a request is aborted, all requests that depend explicitly aborted (and so on, transitively). If a host later wishes to reissue any of the aborted dependent free to do so, of course. Having TickerTAIP itself propagate dependent handshake An alternative detects hosts up into protocol. determine depended any had
on it are also requests, the abort it is to
requests preserves sequencing guarantees without requiring a with the host on every operation in the normal (error-free) case. designz abort, would that have they TickerTAIP had aborted push the data in enter any a special and all mode until requests once it all the that back recovery it to improving to not during which it would would execute no requests
acknowledged on the failed but
one. This giving
dependency-handling and fragile thereby to TickerTAIP allows
the hosts, Additionally, which
at the cost of a more the dependency can be executed assign
complicated parallel,
requests
performance in the normal case. Also, TickerTAIP will arbitrarily prevent propagated cies, the schedule.
3.4.3 Sequencer States. The through a high-level state table, their transitions, in Figure 6): —NotIssued:
sufficient
implicit
dependencies Aborts are dependenserializable
overlapping across order
requests implicit
from
executing
concurrently. is some
dependencies. requests
In the absence
of explicit arbitrary
in which
are serviced
management of with the following
sequencing is performed states (diagramed, with
request
—Unresolved:
the request itself has not yet reached TickerTAIP, has referred to this request in its dependency list. it depends
but
another that at the
the request has been issued, but has not yet reached the TickerTAIP array. array, but
on a request arrived
—Resolved:
all of the requests that this one depends at least one has yet to complete. dependencies have
on have been
—InProgress:
begun
all of a request’s executing. a request
satisfied,
so it has
—Completed: —Aborted:
has &ccessfully
finished. on which this request de-
pended
a request was aborted, or a request explicitly has been aborted.
~Due to one of the
anonymous
reviewers. ACM Tran.actmm on Computer Systems, Vol 12, No. 3, August 1994
250
.
Pei Cao et al.
referenced by another n?quest
Issued by a host issued by a host
anti-dependents resolved
anti-dependents completed
Fig. 6.
States
of a request.
An “antidependent”
is a request
that
this
request
is waiting
for.
Old request state the hosts number oldest depends Aborts outstanding any completes, dependency
has to be garbage-collected. their requests sequentially incomplete request requests than exception the from oldest completed request satisfied. to this
We do this by requiring and by keeping track each host. When Any this request
that of the request that the that
older
can be deleted. recorded mechanism
on an older immediately
one can consider since a request
are an important
depends on an aborted request should itself be aborted, whenever the original request was aborted—even if this was some considerable time in the past. The simplest solution is to require that a host never issue a request that depends on one that has been aborted, but this would require an unnecessary serialization at the host. As a result, we decided to propagate aborts to other requests already in the TickerTAIP array. Unfortunately, this is not enough: there is a potential race condition between the request being aborted and the host being told about it, and the host ceasing to emit further requests that may depend on the aborted request. Our solution is to maintain state about aborted requests for a guaranteed minimum time— 10 seconds in our prototype.
ACM
(This
Transactions
is not ideal:
on Computer
in the presence
Systems, Vol 12, No
of a large
3, August
number
1994
of cascaded
aborts,
TlckerTAIP
.
251
we may However, Similarly, as a host
have
to delay
accepting this issues situation
new
commands
state
until
the
10 seconds rare
are up. such
we believe a time-out that never
is likely
to be extremely other requests four
in practice.) errors
on the NotIssued a request Alternatives.
can be used to detect
for which
are waiting. designs for the
3.4.4 Sequencer Design sequencer mechanism: (1)
We considered
a single, Fully centralized: its transitions. (A primary
point the trips of failure.) sequencer, In the additional round-trip
central sequencer manages the state and a backup are used to eliminate absence the of contention, times: sequencer and each between with request latency
table and a single two and
suffers
message
the originator the sequencer.
and between
its backup.
One of these
is not needed
if the originator
is co-located
a centralized sequencer handles the state table (2) Partially centralized: until all the dependencies have been resolved, at which point the responsibility is shifted to the worker nodes involved in the request. This requires that the status of every request be sent to all the workers, to allow them to do the resolution of subsequent transitions. This has more
concurrency, (3) Originator
but driven:
requires in place
a broadcast of a central
on every
request the
completion. originator nodes
sequencer, than the
(since there will typically be fewer of these distributed-consensus protocol to determine constraints, always (4) after which more the partially than generates messages
the workers) conduct a overlaps and sequence approach all is used. This their node schemes.
centralized
the centralized
Worker driuen: the workers are responsible for transitions. This widens the distributed-consensus in the array, and still requires the end-of-request the they higher-numbered the fully of the above designs
the states and protocol to every broadcast. may increase
Although rency,
concurlargely
do so at the cost of increased
message
traffic
complexity. required for the to be that of two overhead acceptwe made
We chose to implement because of the complexity alternatives. As expected, round-trip messages table onds of state request
3.5
centralized
model
in our prototype,
of the failure recovery protocols we measured the resulting latency plus We believe sequencing this additional
(i.e., 440 KS in our prototype) management. that request optional
a few tens of microsecNonetheless,
able for the benefits sequencing
provides.
for those
cases where
it is not needed.
The RAIDmap
sections have presented the policy issues; this one discusses an technique requests we found useful. Our first design retained a great deal of centralized for sequencing authority: the
Previously
implementation and coordinating
originator tried to coordinate the actions taking place at each of the different nodes (reads and writes, parity calculations). We soon found ourselves faced with the messy problem of coping with a complex set of interdependent actions taking place on multiple
ACM
remote
TransactIons
nodes,
on Computer
and coordinating
Systems, Vol 12, No.
these
proved
1994
3, August
252
.
Pel Cao et al
Stripe
node O . x, . . unused 4,1,2 data --.! unused
for a write block the column array;
node 1 --‘, unused 5,1,2 data -,2,parity
request on a disk, physical on each
node 2 2,0,3 data -,1,parity 6,2,1 data
spanmng block node);
node 3
Type small stripe full stripe large stripe
2 through tuple disk (which to send 7 Each equates parity cell m are data the to, to the
0
1
-,o,parity 3,1,2 data 7,2,1 data
lo~cal
2
Fig
7.
RAIDmap represents number block number
example in the
blocks on this
the figure logical stripe and
a physical is only rightmost
and contains number the
a four-part node number 32 1
The parts
if there The
one disk
a block
type.
is d]scussed
In SectIon
exceedingly complex—especially into consideration.
so when
potential
failure
modes
were
taken
To avoid this complexity, we developed a new approach: rather the originator tell each worker what it had to do, and coordinate
than having the stream
of asynchronous events that resulted, we delegated management of its own work to each worker, and then coded everything to assume that all the nodes were doing what they were supposed to without any further prompting. So, once the workers are told about the original request (its starting point and length, and whether it is a read or a write) and given any data they needed, they can proceed on their own. For example, if node A needs data from node B, A can rely on B to generate and ship the data to A with no further
it is characterized prompting. We call this approach collaborative execution; by each node assuming that other nodes are already doing their part of the request. It proved to be an enormous simplification.
To orchestrate a two-dimensional stripe.3 Each as a function
all the work, we developed a structure known as a RAIDmap; array with an entry for each column (worker) and each of the RAIDmap, filling in the blanks or write), the layout policy, and the where data and parity blocks or RAID 4 or 5). The execution service the request (e.g., where
worker builds its column of the operation (read
policy determines execution policy. The layout are placed in the RAID array (e.g., mirroring,
policy
determines
the
algorithm
used
to
parity is to be calculated). A simplified RAIDmap is shown in Figure 7. One component of the RAIDmap is a state table for each block (the states are described in Table V). It is this table that the layout and execution policies fill out. A request enters the states in order, leaving each one when the associated function has been completed, or immediately if the state is marked as “not needed” for this request. For example, a read request will
‘Although practice since
ACM
the Idea there
of the RAIDmap to actually descriptions
1s more generate all look
V.]
simply pretty
12, No
described rows much
3, August
as If the full of the array
array
was present, long request,
in
m no need full-strip
on
all the
for a very
the inner
Transactions
ahke.
1994
Computer
Systems,
TickerTAIP
.
253
Table
V.
State-Space
for each Block
at a Worker
Node
I State I
I
Funct/on
I Wrerinfomration I disk address
I
II
I Read old data
I
enter The several results state
5 6
XOR incomin data with local old dat a?Dantv Write new data or parity
i disk address
1
1 (to read
the data), skip
skip
through
state
2 to state states. allowing to calculate
3 (to send it to us to test partial or parity out
the originator), RAIDmap different locally,
and then proved policy or whether
through
the remaining mechanism, (e.g., whether
to be a flexible alternatives
parity nodes).
to send the data
to the originator
Additionally, the same techniques is used in failure mode: the RAIDmap indicates to each node how it is to behave, but now it is filled out in such a way as to reflect the failure mode fied the configuring of a centralized operations. Finally, the implementation, using case. is to maximize this, data RAIDmap the same utilization computation disks. servicing simplipolicies and is For their we
and assumptions as in the distributed The goal of any RAID implementation minimize overlapped the same own parity request with reason, latency. other workers important To help operations, or local prototype increased together, two-phase
disk
achieve
the RAIDmap before
such as moving needed disk transfers. the disk
or accessing
send data to optimize
elsewhere
computations
It also proved
accesses themselves. until because the the of disk described
When
delayed writes in our available, throughput writes In were coalesced the implement
implementation by 25–3070 reducing commit
parity data was data and parity seeks needed. in Section 3.3,
the number protocols
additional states were added to the worker and originator node state tables. The placement of these additional states determines the commit policy: for early commit, as soon as possible; for late commit, just before the writes.
3.6 Scheduling
Unlike traditional
Disk Accesses
centralized RAID designs, TickerTAIP provides request
atomicity and sequencing to support multiple outstanding requests. As a result, more than one disk access request can be queued at a worker node at one time, which means that it is beneficial to consider more sophisticated request-scheduling policies inside the array (preserving the write-order invariant determined by the sequencing algorithms, of course). In theory, a worker node could use any of the algorithms proposed in the (fairly extensive) literature on disk scheduling. In practice, we are mostly interested in those
ACM Transactions on Computer Systems, Vol 12, No 3, August 1994
254
.
Pel Cao et al,
that
are inexpensive with four
and
yet give
good performance.
We report
here
on our
experiments
such
algorithms. is what which is seek time
—first come first is implemented
served (FCFS): that is, no request reordering—this in the working prototype described below, and of the results we present;
used for the majority
—shortest seek time first (SSTF): the request that has the shortest from the current disk head position is served first; —shortest access time first (SATF): time (seek time + rotation time) served —batched first [Seltzer nearest neighbor (BNN):
the request that has the shortest access from the current disk head position is and Wilkes SATF, except 1991]; that requests are like
et al. 1990; Jacobson
batched—each in the queue not attempt Among these
time it runs, the scheduler takes all the requests currently as a batch, and runs the SATF algorithm over them; it does to serve any new SAFT requests gives until generally the current the best batch is finished. when
algorithms,
throughput
applied to Unix system-like workloads, but can potentially starve requests. BNN remedies this at the cost of a small reduction in throughput. We found that scheduling improved both the throughput and average response time load condition The results 3.7 of requests. The improvement of the array, and (as expected) in Section 4.7. depended on the workload and was largest under heavy loads.
are reported
Memory
Management
in this work is memory limitaexample, memory
The main functionality issue we did not address explicitly buffer management at the originator nodes. In a real system, tions would complicate additional flow control
some of the algorithms presented here. For might be needed to ensure that the originator
would not get swamped if the array was presented However, these costs will be small: by definition, requests are larger than would fit comfortably into tor node, so the cost of the flow control will moving the data. Alternatively, the originator node might requests up into chunks monly used in disk drive
with many large requests. they only show up if the the memory of an originahidden by the up very cost of large
be largely choose
to break
with some maximum size. This approach is comcontrollers today; the main difference would be the that the array could deliver data at
use of much larger chunk sizes to ensure close to its full potential bandwidth.
4. EVALUATING TICKERTAIP
This
section
presents
the vehicles
we used to evaluate architecture.
the design
choices
and
performance
of the TickerTAIP
4,1 The Prototype We first constructed a working prototype design, including all the fault tolerance
ACM TransactIons on Computer Systems, Vol 12, No
implementation
features
3, August
described
1994
of the TickerTAIP above. The intent
TlckerTAIP
Table VI. Characteristics of the HP97560 Disk Drive
.
255
[
properfy
diameter
value
525” 19 data, 1 servo 1.3GB
track size
72 sectors 512 bytes 4002 RPM 2.2MB/s 5MB/s
=
I
controller
overhead
1ms 1.67ms 1.28 + 1.15~d ms 4.84 + 0.193~d + 0.00494d ms
of this implementation was a functional testbed, to help ensure that we had made our design complete. (For example, we were a-ble to test our fault recovery code by telling nodes to “die.”) The prototype also let us measure path lengths and obtain We implemented the comprised a local leave of a Parsytec interface, was 4KB. SCSI unit) early performance data. design on an array of seven MSC card with experiments. node had a T800 transputer, disk unit (the connected Each to a local a local SCSI storage 4MB drive. SCSI The disk 1991]: nodes, of RAM, disks each and were inter-
spin-synchronized
for these
A stripe
block-level
HP97560
[Hewletta small,
Packard 1991] with the properties shown in Table VI. The prototype was built in C to run on Helios [Perihelion lightweight operating system nucleus. We measured latency for short messages between directly connected the peak internode the relatively slow means that they bandwidth processors, overlap
the one-way message nodes to be llO~s, and
to be 1.6MB/s. Performance was limited by and because the design of the Parsytec cards computation and data transfer across their perforof our
cannot
SCSI bus. Nevertheless, mance data for design simulator. Our prototype and test routines. RAID functionality. 4.2 The Simulator We also built tasking library a detailed [AT & T
the prototype provided useful comparative choices, and served as the calibration point a total 12k lines of 13.3k of this lines was of code, including directly associated
comprised About
comments with the
event-driven 1989]. This
ACM Transactions
simulator
enabled
on Computer
us
using the to explore
VOI
AT&T C+ + the effects of
3, August 1994
Systems,
12, No
256
.
Pei Cao et al
changing link and processor speeds, and to experiment with larger configurations than our prototype. Our model encompassed the following components: — Workloads: both fixed (all requests of a single type) and imitatiue (pat-
terns that simulate existing workload model, and the method of independent obtain steady-state measurements. —Host: array; a collection disk driver HP-UX nodes our of workloads path lengths and systems. (workers sharing were
patterns); we used a closed replications [Pawlikowski an access port estimated from to the
queueing 1990] to
TickerTAIP made were deMSC
measurements lengths the
on our local —TickerTAIP rived type from and
originators): (we
code path running that
measurements HP-UX would not occur
of the algorithms workstations in a real design).
on the working Parsytec
proto-
assumed
limitations
—Disk: we modeled the HP97560 tation, using data taken from disk model was fairly detailed, —the seek time settling about during position; and profile times head a data from for
disks as used measurements and included: VI; than reads track-
on the prototype implemenof the real hardware. The
Table writes
—longer optimistic —trackincurred —rotation —SCSI from —Links:
(the but and
disk
can
afford
to be times
positioning transfer;
for reads,
not for writes); cylinder-switch
cylinder-skews,
including
bus and controller overheads, including the mechanism into a disk track buffer bus (the granularity used was 4KB). channels such represent communication
overlapped data transfers and transmissions across as the small-area network
the SCSI
and the SCSI buses. We report here data from a complete point-to-point interconnect design with a DMA engine per link, since this is both the simplest effects results assume topology and the one from which it is easiest to extrapolate to the of other designs. would be obtained multicast Our preliminary from mesh-based studies suggest that similar switching fabrics. We did not
capabilities.
Under the same design choice and performance parameters, our simulation results agreed with the prototype (real) implementation within 3% most of the time, and always within 6%. here This gave us confidence disk array in the with predictive abilities of the simulator. The system we evaluate
1s a RAID5
left-symmetric
parity [Lee 1990] (the same data layout shown in Table II and Figure 7), stripes composed from a 4KB block on each disk, spin-synchronized disks, FIFO disk scheduling (except where noted), and without any data replication, spare blocks, floating parity, or indirect-writes Stepanov 1992; Menon and Kasson 1992]. The hosts and 11 worker nodes, with each worker disk attached to it via a 5MB/s SCSI bus.
ACM Transactions on Computer Systems, Vol 12, No
for data or parity [English and configuration simulated had 4 node having a single HP97560 Four of the nodes were both
1994
3, August
TickerTAIP
.
257
Table
VII
Read performance (all relative
for fixed-size deviations
workloads, were
with less than
varying 2%)
link
speeds
standard
Request size 4KB 40KB lMB 10MB
throughput MEW 0.94 1.79 15.2 21.1
latency (in ms) lMBA 33 38 178 1520 10MBIs 31 34 ae 610 100MBA 30 33 76 520
originators and about exploring
workers; for simplicity, and since the effects of the internal design
we were choices,
most concerned we used only a here with
single infinite-speed connection between each host and the array. Except for the results in Section 4.7, the throughput numbers reported were obtained only when the system was driven at a time. to saturation; response times one request in the system For the throughput
measurements
we timed 10,000 requests in each run; for latency Each data point on a graph represents the average
we timed 2000 requests. of two independent runs,
with all relative standard deviations less than 1.59%, and most less than 0.5%. Each value in a table is the average of five such runs; the relative standard deviations are reported with each table. In section 4.7, our throughput and response time numbers are means of 5 simulations, each consisting of 10,000 requests. Nearly all relative standard deviations for the data points in Section 4,7 are less than few (on the OLTP workload) were as high as 6.0%. In before 4.3 all cases, 100 requests were run to completion to minimize any measurements were taken, startup 1.OYC, although the simulator a
through
effects.
Read Performance
n-disk array for random data show no significant but 10MB/s transfers. or
Table VII shows the performance of our simulated read requests across a range of link speeds. The difference in throughput for more are needed to minimize 4.4 Write Performance: We first consider the the
any link speed above lMB/s, request latencies for the larger
An Exploration
effect is small, of the but
of the Design Alternatives
large-stripe enabling the policy. Figure 8 shows policy the large-write resulted
result:
difference
always in a slight increase in latency.
improvement in throughput at the expense of a slight We chose to enable the large-stripe mode for the remain-
der of the experiments. Next, we compared the at-originator and at-parity policies for parity calculation. Figure 9 gives the results: at-parity is significantly better than atoriginator, with the differences largest (as expected) at larger write sizes and with lower processor speeds. This is due to the at-parity algorithm spreading
ACM TransactIons on Computer Systems, Vol 12, No 3, August 1994
258
.
Pel Cao et al.
ol~ 100 Reqwst s,,. (iii3)
(log SC.le)
1000
10000
o
Res Ponse 14
Tlrne,
“s
Request
S,,,
(10
MIPS, With Without
10 — *
MB/s)
0.12 /’
:::;
~
----’”~ +?
0.04
0.02
ol~ 100 Request S,ze (KB) 1000 (log scale) 10000
Fig. half
8,
the
Effect stripe
of enabhng
the
large-stripe
parity
computation
policy
for writes
larger
than
one
theparity
calculation
across
several
processors
more
evenly,
so weused
it for
the remainder of our experiments. The effect of the late-commit protocol 10: the effect response effect time of the commit time by up to 20%. protocol This on response is more marked,
on performance with the late
is shown commit point
in Figure but the as a increasing
on throughput is because
is small
( < 2%),
the commit
is acting
synchronization barrier, which prevents some of the usual overlaps between disk accesses and other operations. For example, a disk that is only doing writes for a request will not start seeking until the commit message is given. The delay that results could presumably be reduced by sending the disk a seek command ring, although only show Although than that recommend
ACM Transactions
to position its head while the parity we have not performed this experiment disk array.
computation was occurbecause the effect will is slightly better As a result, we its throughput is
up on an otherwise-idle
not shown, the performance of early commit of late commit, but not as good as no commit. late
on
commit
Computer
as the
Systems, Vol
preferred
12, No
design
3, August 1994
choice:
TlckerTAIP
Throughput 12 vs Link and CPU (Random lMB)
.
259
10
8
6
4
2
0 0 5 Link 10 (MB/s) 15 20 & CPU (MIPS) 25 30
Responsetlme 0.7 r
vs
Link
and
CPU at
(Random CIrlglnator at Pa,,
lMB) — +
0.6 ; ‘c G
ty
;2 : G 0.1
‘L
n I “o 5 Link 10 Speed 15 (MB/s) 20 CPU (MIPS) 25 30
Fig,
9,
Effect
of parity
calculation
policy
on throughput
and
response
times
for
lMB
random
writes.
almost as good as no commit implement than early commit. 4.5
protocol
at
all,
and
it
is
much
easier
to
Comparison
with a Centralized
RAID Array
How would a TickerTAIP array compare with a traditional centralized RAID array? This section answers that question. We simulated both the same n-node TickerTAIP system as before and an n-disk centralized RAID. The simulation components and algorithms used in the two cases were the same: our goal was to provide a direct comparison of the two architectures, uncontaminated dedicated by other originator factors. node, The together centralized with array was modeled nodes as a single, that did read stripa set of worker to do so. for a 10-disk
and write operations only when directed For amusement, we also provide data
nonfault-tolerant
ing array implemented using the TickerTAIP architecture. The results for 10MIPS processors with 10MB/s links are shown in Figure 11: clearly a nondisk bottleneck is limiting the throughput of the centralized ACM
TransactIons on Computer Systems, Vol 12, No 3, August 1994
260
.
Pel Cao et al,
. .. ‘.7- . .,qhvut
12 ,.
,,s
Request
S,ze
(10
MIPS. k..., N. Corn,
10 + t
MB,. — .
)
..
a
1
/
6
4
2
1
10 Request
100 Slzt! (KB) (log
1000 scale)
10000
Response 0.13 0.12 q 0.11 : “ a . 0 2 : al 2 : : a 0.07 0 06 0.08 01 09
Times
v.
Request
Size
(10
MIPS,
10 — -
MB/s)
Comlt NO Comrnlt
,/i’
/’
/ /“’’””””
0.05 0.04 1
10 Request
Size
100 (KB)
(log
1000 SC,le)
10000
Fig
10
Effect
of the late-commit
protocol
on write
throughput
and response
time
system for request sizes larger than 256KB, and its response time for requests larger than 32KB. The obvious candidate is a CPU bottleneck from parity calculations, and this is indeed what we found. To show this, we plot performance as a function of CPU and link speed (Figure 12), and both varying writes, These together but a much (Figure effect smaller that 13)—these effect the graphs show that changing the CPU speed has a marked graphs on the performance on TickerTAIP, TickerTAIP architecture is successfully exof the centralized case for lMB
show
ploiting load balancing to achieve similar (or better) throughput and response times with less powerful processors than the centralized architecture. For lMB write requests, TickerTAIP’s 5MIPS processors and 2–5MB/s’ links give comparable throughput to 25MIPS and 5MB/s for the centralized array. The centralized array needs a 50MIPS processor to get similar response times as TickerTAIP. Finally, we looked at the effect of scaling the number of workers in the array, with both constant request size (400KB) and a varying one with a fixed amount of data per disk (ten full stripes, however large a stripe becomes). In ACM
TransactIons on Computer Systems, Vol 12, No .3, August 1994
TickerTAIP
Throughput 16 14 12 10 8 6 4 E // 2 0 1 ‘*, 10 100 (KB) .,,, 1000 scale) > * r “s Size (CPU 10 MI PS, Link 10 (M B, s)) -— =
.
261
I
Cent Al> zed TIckerTAIP Str, plnq ..
Request
10000
S,ze
(log
Responset, 0.3
mes
“,
Request
Size
(CPU
10
MI PS, Link
10
(MB
0.25
0.2
0.15
/’ ,+
01
0.05
/“”:
m=
o 1 10 Request
..=.
100 (KB)
~
u
Size
(log
1000 scale)
1000$
Fig.
11.
Writs
throughput
and response
time
for three
different
array
architectures
these experiments, are seen in Figure slightly with larger
four of the worker nodes were also originators. 14. With constant request size, the performance number of disks. This is exactly
The results grows only as the
as expected:
number of disks increases, the fixed-size 400KB request touches a smaller fraction of the stripe size, so the disks get to do less useful work. On the other hand, the performance improvement shown as the request size is scaled up with the number of disks shows almost perfect linearity. (In practice, these at some data are a point the host links would become strong vindication of our scalability 4.6 The a bottleneck.) We believe claims for TickerTAIP.
Synthetic
results
Workloads
reported so far that have been from fixed, constant-sized would workload workloads. over mixtures, TickerTAIP performance scale as well
To test our hypothesis
some other workloads, we tested a number designed to model “real-world” applications: —OLTP: based on the TPC-A ACM database
Transactions
of additional
benchmark
on Computer
[Dietrich
Systems, Vol
et al. 1992];
12. No 3, August 1994
262
.
Thrcwqhp.t >+2LL’L “s
Pel Cao et al.
CPU (Random lMB, L,nk Centralized 100 (t.fF3/s) — t ) Throughput 12 +,/ Centralized T,ckerTAIP — ~ v, L,nk (Random 1M?3, CPU 100 MIPs)
12
10
I‘
r
‘lckerTA’p; -
_
,0 9
/ ,
7
6
4
2
0
.~
0.5 1 1.5 L.nk 2 2.53354455 Speed (wB/see)
Pes PO”,’3t,me 25
“3
L,nk
(Random
1~,
CPU
10fl
MIPS) — -
,,
,-
‘\
Central, zcd T. CkerTAIP
Fig.
12.
Throughput
and response
time
as a funct]on
of both
CPC1 and link
speed
lMB
random
writes.
—timeshare:
based
on measurements 1993];
of a local
Unix
timesharing
system
[Ruemmler —scientific: running of about
and Wilkes
based on measurements taken from supercomputer on a Cray [Miller and Katz 1991]; “large” has a mean 0.3 MB; “small” has a mean around 30KB.
applications request size
Table VIII gives the throughputs for a range of processor and link forms the centralized architecture eventually able to drive the disks sizes are quite small. TickerTAIP’s
4.7
of the disk arrays under these workloads speeds. As expected, TickerTAIP outperat lower CPU speeds, although both are to saturation—mostly availability is still because the request higher, of course.
The Effect of Scheduling
Individual
Disk Accesses
Our previous results used the simplest possible request-scheduling algorithm, FCFS, at the disk device drivers in the worker nodes. In this section we explore the effects of changing this scheduling algorithm. Clearly, this will have little effect when the queue sizes seen at the disk are small, but our early experiments led us to believe that they can sometimes get quite large
ACM Transactmns on Computer Systems, Vol 12, No. 3, August 1994
TickerTAIP
Throughput vs Link and CPU (Random lMB I/o)
.
263
~
t
L
/
+
6 ,/ 4 2 O* 0510152025 Link -i ,/ , /’
~.
“
/’
,
w
s g
q
f’ 3035404550 (MB/see) and CPU (MIPS)
Responsetlme 1.2
vs
LL.
k
speed
and
CPU
(Random
lMB — + ~
1/0)
~ c j . ; 2 . ; -$ : 02 0.4 08
1
Central, zed TlckerTAIP Strlplng
0.6
1. - ~. %;; 0 0 10 20 30 Link 40 50 (MB/see) 60 CPU 10 (MIPS) Eo ~0 1(30 ;;~ -; +-------: --
Fig,
13.
Throughput
and response
time
as a function
of both
CPU
and link
speeds.
lMB
random
writes,
(especially scheduling performance. and BNN writes show loads writes ing the and that and
when
operating
near
saturation)—at
which a marked
point,
a
better in SATF, 40KB worklMB
algorithm is quite likely to produce This is indeed what we found. this, writes, 40KB Figure algorithms as well can nearly writes. them the individual (which Similar that both 15 shows the results scheduling lMB random scheduling because between on workloads as the OLTP double The 1/0s the smaller is the effects SATF we prefer
improvement SSTF,
To demonstrate
of applying workload. under
comprised synthetic throughput improvement effect of the
of fixed-sized OLTP for
The graphs
shown better
results gaps
are larger,
so the effect
of improvscheduling time
algorithms) graphs. Our initial for scheduling
is less noticeable. results suggest properties.
ACM
are seen on the and BNN
response
are good candidates of its inherent
algorithms.
Currently,
BNN
because
starvation-resistant
Transactions
on Computer
Systems,
Vol
12. No
3. August
1994
264
.
Pel Cao et al.
Throughput 25 scaled write Sizes ~ v. Array’ Size h Request SLze
20 ; ? z ~ D , 0 * E 5 /’ 10 /’” /~ 15 ,’ / / /
.
0 # 5 of 10 Nodes (n) 15 Request 20 Size 25 ((n-l 30 )x40K) 35
400KB
W,,
te,
8
;
./”
/
o
5
10 f
15 of Nodes
20 (n)
25
30
35
Fig.
14
Effect
of TlckerTAIP
array
size on performance
Table
VIII
Thmughputs,
m MB,/s, different
of the three workloads
array
architectures
under
Speeds ~ Workload -z MEW I A’4/Ps
I
~
OLTP t/meshare small sclenttfic large sc/ent/f/c
101 –;;.
1
101 059(17%)1059(14%)
1 ~ 0.43 (0.9%) ~o.7fj (0.8%)
163 (1 O%) 1.69 (1 3%)
169 (1 4%)
-.+_
-
‘
L_
10
1 1 I
i176 {2 5%)
0.71 (4.2%)
O 76
f
(~7%)
.20(1 .2%)
(1
173 (o 4%)
1
10
10 ~
1=2.3
10 823
%)
(4 8%)
120
8.39
9%)
73 (o
2%)
!
(3.3%)
981
(2.1%)
~
:
The shown shading In
hlghhghts parenthcse~ )
comparable
total-MIPS
configurations
( Relatlve
standard
cleviatlons
are
ACM
Transaction.
on
Computer
Systems,
Vol
12,
No
3, August
1994
TlckerTAIP
Ttrouc4wJI m Loads w HOS (rmdom 40KB wni6 WK=SIS) 3 p jz :F 25 Mean ~
TIm’6
.
40f@ * SATF SNN SSTF FCFS .
265
WI19S) . --
W
F&
H@ (~!lb?l
,
.— --—. J f_-* -------- -——-—— 31 ~
05}
I
1
I c : s
‘: , ,, /% /p o
‘ = ,“ -
/’;’””:
o—---___J
o
5
i
05
15 10 #0flmd2perh0st
20
25
0
5
15 10 #01 bad5G9h05t
20
25
ta) Fixed-size 40KB writes.
(b) Fixed-size
40KB writes.
Tmevs L0sd5Pw HcsI (mndm1M3ti83) SATF — BNN SSTF Q FCFS x ., .
20 18
16
TfWUEW
W+bfds
W( Hc61 (randcm 1M
me
rape+
h4a&nRes+mw 10
SATF — . .&mT? Q FCFS --
98 ~
- 14 s ~ 12 .&z-::... ~ ~.z ‘6 to -
----
..:..
.:
~:
:/./
~~~
~
o~
o
o 5 10
#0fbnc!5pefhost
15
20
25
0
5 # :? ILMIISw
15 I105t
20
25
(c) Fixed-size 1MB writes.
(d)
Fixed-size
1MB
writes,
Thru.I@@
w Lads w
H-
(OLTP
WOikk@ 1
!0: f06 g 04 02
: -
x---=~.
: ---------------------------20 40
SATF — BNN + SSTF o FCFS .
!
00
#0ffcad5Derflo6i
01 0
I 20 40
#of fOmdsrtlfwsi
60
lW
(e) Synthetic OLTP workload.
Fig. The 15. Effect of different graphs response disk-level time.
(f) Synthetic
OLTP workload,
on TickerTAIP and scheduling performance. policy, the
request-scheduling as a function
algorithms of load
left-hand
display
throughput
right-hand
ones the
5 CONCLUSIONS TickerTAIP is a new parallel architecture for RAID arrays. Our experience is
that it is eminently practical (for example, our prototype implementation took only 12k lines of commented code). The TickerTAIP architecture exploits its physical redundancy to tolerate any single point of failure, including single ACM
Transactions on Computer Systems, Vol. 12, No. 3, August 1994.
266
.
Pei Cao et al
failures sizes with provides in the
in its
distributed
controller: growth; configuration outstanding are just on how we more
it is scalable flexibility. ordering requests as good central have that—at and
across
a range we
of system node model how such one, to as and im-
smooth
incremental
and its worker/originator Further, and to support multiple least clean
considerable face of multiple
showed faults,
provide—and power —eleven provided mentations provements failures.
prototyped—partial-write We have also demonstrated
semantics application
in this 50MIPS RAID
5MIPS
processors data Finally, from
as a single parallel the
quantitative compare. available
array
imple-
demonstrated
performance
sophisticated
request-scheduling
algorithms. and centraland this turns most of the has calculaUnfortubecause that
Most of the performance differences between the TickerTAIP ized designs result from the cost of doing parity calculations, out to be the main other been tions nately, of the system is in CPU-intensive made in the centralized it seems high the increases memory problem that speeds they thing work that the changes lackluster with the processor is hidden by disk delays. dedicated at. Because the processor It
speed:
One suggestion of these XOR the XOR engines. in part
is to improve
performance
case by constructing the resulting have than this linearly rather way systems with than to operate
can be unwieldy, its performance—much is our
cost of a processor of the cost the that contention itself—tackling
faster in
system
XOR-speed off-the-shelf bring with cost, where With disks. design approach
is unproductive.
microprocessors are, in fact, cost-effective XOR engines, and they them all their advantages of economies of scale in manufacturing time, and reliability. to divide Thus it is better, parallelize we believe, the work to use an like TickerTAIP request larger become disks sizes, more are up and to the point the
such microprocessors small With requests, added
can be used. it is easy for either the difference worker, architecture more as the to saturate marked becomes and as parity to increase are obvi-
calculations as multiple smarter
significant. to each algorithms
The difference is included. Both
is also likely improvements
cost of performing
disk-scheduling
ous upgrade paths for TickerTAIP (indeed, we have them here); both will make the TickerTAIP architecture than the centralized model. We recommend array for implementers. multinode beyond use in the TickerTAIP Additionally with parallel locally RAID attached the TickerTAIP
demonstrated one of even more attractive to future is well this case, it disk suited can
architecture architecture disks. In
multicomputers
provide hardware
RAID resilience without any dedicated that already provided for the multicomputer
or specialized itself.
ACKNOWLEDGMENTS
The TickerTAIP Hewlett-Packard is based the
ACM
work
was done
as part [Wilkes version by Chia
Vol 12, No
of the DataMesh 1992]. Chao.
3, August
research Wiener,
project
at
Laboratories on a centralized driver
on Computer
The
prototype by Janet Jacobson
implementation and uses the improved
loosely disk
written
SCSI
developed
Systems,
David
1994
Transactions
TickerTAIP
AT& T tasking library to use a double input options. into Chris our for its time understanding Ruemmler helped value. us Federico improve
.
267
Malucelli and disk our
provided models.
significant
of the
sequencing
request-scheduling
We also thank the IEEE for allowing us permission to publish this and the ACM anonymous reviewers for helping us to improve it. Finally, whence the name? Because tickerT’AIP is used in all pa(rallel)lltills!
REFERENCES
AT& T. 1989. Code In UnLY System AT& 1990. Parallelism G, A.j VAT& T C+ + In. language system release 2.0 selected
revision, the best
readings. D., 90.5,
Select
307-144. D.
T, Indianapolis, Fault and KATZ, using and tolerance Calif. data R. an
BARTLETT, Tandem BoR~L, tronics CH~N, P. redundant H.
J., BAmm,m, Computers, 1988. and M., Computer GIBSON, arrays
W., CARR, R., GARCIA, D,, GRAY, J., HORST, R., JARDINE, R., LENOSIU, in Tandem computer Tech. Tex. D. A. 1990. New An evaluation SIGMETRICS 74-85. G. D. JR. 1986; Rep, systems Tech. Rep. Cupertino, Technology
AND MCGUIRE,
management. Austin, 5890. H.,
ACA-ST-156-88,
Microelecof
Corporation, Amdahl
AND PATTERSON, In of Computer
of disks
Proceedings Systems, 4,761,785; T. 1988, ACM, filed
of ACM
Conference Parity
on Measurement to enhance
Modeling
York,
CLARK, B. E., LAWLUR, F. D., SCHMIDT-STUMPF, spreading 1988. W., BOU~HT~R, storage 2 August
W. E., STEWART, T. J., AND TIMMh,
1988. granted
access. U.S. Patent E., AND KELLER. Conference
12 June
COPELAND, G., ALEXANDER, In Proceedings York, DEWITT, 1986 DEWITT, Gamma DIETRICH, 322-331. 99-108,
Data
placement of Data.
in Bubba. ACM, New M.
of 1988 SIGMOD
Internattonul
on Management
D. J., GERBER, R. H., GRA~F~, GAMMA-a high performance Conference
G,, HEYTENS, dataflow Data
M. L., KUMAR, K. B., AND MURAIJKRISHNA, machine. 1988. SIGMOD In Proceedings 228-237. analysis VLDB Endowment,
database Bases. of
of the 12th of the on
Znternatzonal
on Very Large In New
D. J., GHAN~EHARIZ~DE.H, database S. W., machine. ACM, M., BROWN, of Data.
S., AND SCHNEIDER, D. Proceedings York, 1988 350-360. E.,
A performance International S. 1992.
Conference A practitioner’s
Managen~ent introduction
CORTES-RELLO,
AND WUNDERLIN, and
to database
performance
benchmarks
measurements. E. L., SIMHAN, 1994.
Comput.
J. 35, (Aug.)
DRJUWAU, A. L., SHIRRIFF, K. W,, HARTMAN, J. H., MILLER width network file IEEE, 28 June server. New 1988, Wtnter’92 In Proceedings 234-244 1990. of 21st
S., KATZ, R. H., LUTZ, K., RAID-II: A high-bandon Computer 4, 914, ProceedFailure on ReL,. Symposz unz
PATTERSON, D. A., LF,F., E, K., CHEN, P. M., AND GIBSON, G. A. Internat~onal Disk Architecture. 656; filed ENGLISH, ings York, granted
DUNPHY, R. H. JR., WALSH, R., AND BOWERS, J. H. 3 April A. 1992. 1990. Loge: R. M. AND STEPANOV, A
drwe
memory. storage Berkeley
U.S. patent device. Calif., In
A self-organizing USENIX Assoc., of 3rd Operating problem Calif.
of USENIX
Tech nzcal disk
Conference. arrays. In
237–251. Conference
GIBSON, G. A., HELLERSTEIN, correction Architectural 23, Apr., GRAY, J. problem. GRAY, J. N. Course. storage Large 1988. Tech Lecture with Data tecbmques Support 123-132. A comparison Rep Notes Notes acceptable Bases. VLDB for
L., KARP, R. M., KATZ, R. H., AND PATTERSON, D. A. Proceedings and Znternatzonal Systems. Programming Languages
1989.
Oper.
for large
Syst.
of the Byzantine Computers, base operating Science, 1990. M. In
agreement Cupertino, systems.
and the transaction Systems: Berlin, arrays:
commit
88.6 Tandem on data in Computer throughput
1978.
In Operating
An Aduanced rehable on Very
GRAY, J., HORST, B., AND WALKKR,
vol 60. Springer-Verlag, Parity striping of disc
393-481. Low-cost Conference
Proceedings
of 16 International
Endowment,
ACM
148-159
on Computer Systems, Vol 12, No 3, August 1994.
Transactions
268
.
Pei
Cao
et al
HEWLETT-PACKARD. Manual. Manual. HOLLAND, dant dmk
D.
1991. 1988b.
HP 97556, HP 7936 1992.
97558, and
and HP
97560 Company, 7937
5,25-znch Boise, Drlues Boise,
SCSI Idaho
Dnk
Drz ves: Technzczzl and operation Insiallatlon in redunSupport
rotational
Part Part
No. 5960–0115. No 07937-90902. In
AND
Hewlett-Packard Hewlett-Packard Parity of 5th Operating
J. 1991.
HEWLETT-PACKARD
Disc
Operating Idaho,
Company, declustering Cornput.
M., AND GIBSON, G. A. arrays.
M.
for continuous Arch.
algorithms
Proceedings and
WILKES,
International Systems
Disk scheduling
Conference
on Architectural News,
based
for
posi-
Programrnzng
JACOBSON, tion.
Languages Rep
20,
23-35.
on
Tech
HPL-CSP-91-7, Zmplementatzon. New Efficient 986-987 Software of Cahforma, York,
Hewlett-Packard H. E, 1981. 246-265 storage parity issues Div., Calif S The Atomic An Advanced mass
Lahoratoz-ies, transactions. Lecture Course.
Palo In Notes
AJto,
Cahf. Systems—ArScience, Tech. vol.
LMVIPSON, B. W. AND STURGIS, chltecture 105 Bull
LFE,
Dzstrzbuted in Computer In IBM
and
Sprmger-Verlag, 1981. 2?4, 2 (July), E. K. 1990. 90/573. Univ. M In ACM K., Joy, Trans.
LAWLOR, F. D.
recovery
mechamsm
Dmlos
and performance Science Berkeley, L~FFLER, Syst. 1993
in the unplementatlon of Electrical
of a RAID and file
prototype. Computer system RAID IEEE, arrays. New for conNew In
UCB,’CSD Sclencc, McKusIcK, UNIX. M~NON, troller York, MENON, 74-83 M[LLER> tlons 51 –59
M(TNIY,
Computer W. N.,
Dept
Engmeermg 1984. A fast
J., AND FABRY, R. S. 181-197. architecture Symposium for Improved on System the 1/0
Comput. of 20th J 1992
2, 3 (Aug.),
J. AND COURTNF,Y, J In Proceedings 76-86. J AND KMSON, of 25th L.
of a fault-tolerant on Computer update
cached
International Methods
Arch ztectare, of disk 1. IEEE,
performance Vol.
Proceedings E
International R H llth S 1991
Conference Analyzing
Sczences. behavior Storage
York,
AND
KATZ,
of supercomputer Systems, arrays IEEE, under VLDB
apphcaNew York, In
In Dzgest R R
AND
of Papers, LUI, J. C
IEEE 1990
Symposzum Performance
on Mass analys[s Large
of disk Data
fadure, Endowment, In
Pro, w’dlng$ 162-17’3 N~[(IN, 1) P~TTLRS( IN. D >lVC d]>ks o~ Data
PAWL1hoW\hl,
A , CH~N,
GIMWN,
( RAID),
ANI) KATZ, Sprzng H 1988
R. H.
to redundant 112–117 of mexpenarrays
Inexpensive
disks
COMPCON’89, Internc[tlonal
IEEE,
York,
A , GIBSON, G , AND KATZ, R In 1990. In In P(’I 1991 A(”M proce~dmgs Steady-state Comput Sure, Intel Parallel 1993. Berkeley, D]v , Dept of 1988 New York.
A ca~e for redundant Co?lference
( RAID) K
SIGMOD
on Management of problems
ACM,
simulation Corporation, Operatzng UNIX Cahf., disk
of queueing 123–170 Hlllsboro, System
processes. Or.
A survey
and PCI
+{]lut]on> 1!994 (’ M E
22, 2 (,June),
Speclficatzon. The Hellos J
PERIHELION RummIL1>l{. [lSENLY
S(HUI m’,
Prentice-Hall
International,
London,
ANI)
WILKES>
access patterns. of a RAID Engmeermg
D 1989
In Proceedings prototype. Tech.
of Winter Rep, UCB Science,
1993 CSD of In
LTSENIX 1988 Computrr Berkeley, M., GIIIW)N, COitlPCON’89
on
Assoc., Science Cahf
405-420 and Computer How reliable Umv.
Considerations
in the design of Electrical
88-448. California. SCHULZE, S,artng
ACM
G , KATZ, IEEE,
(’oruputer
R , AND PATTERSON, New York,
Vol
M a RAID?
118-123.
12, No 3, August 1994
Transaction.
Systems,
TickerTAIP
.
269
SCSI.
1991.
Secretariat, American (SCSI-2), USENZX 1991. Draft
Computer ANSI
and Business for standard USENIX
Equipment
Manufacturers systems—Small 2 February 1991
Association. Computer (revision
Draft System 10d). of
proposed Interface-2 Wznter 25-35. SIEWIOREK 2nd In SLOAN, R. D. 320-327.
National
Standard
information X3 T9.2/86-109, Disk
SELTZER, M., CHEN, P., AND OUSTERHOUT, J. 1990 Conference. SHIN, K. G. HARTS: A distributed 1992.
1990. Assoc., real-time Reliable
scheduling Calif.,
revisited. 313-323. In IEEE
In Proceedings 24, 5
Berkeley, architecture. Computer
Conzput. and
(May),
D. P. ANII SWARZ, 1%.S. Press, of 25th M. M89/56, 1992. Assoc., 1991. 1989. DataMesh Berkeley, The Bedford, 1992. A practical
Systems:
Design
Evaluation. DBC/ New Tech. 1012. York, Rep.
ed. Digital Proceedings
Mass. implementation Conference RAID—a Lab., project, of the database on System new Univ. phase project. machine—Teradata ScLences. copy Vol. 1. IEEE,
International Distributed research Calif.,
STONEBRAIWR, UCB\ERL WILKES, USENIX Wn.KEs, J. Amsterdam, J.
multiple of California, 1. In In USENIX
algorithm. Calif.
Electronics
Research 63-69. research
Berkeley, Workshop Vol.
on File
Systems. Press,
DataMesh
7’ranspztttng’91,
2. 10S
547-553.
Recewed
October
1993;
revLsed
May
1994;
accepted
June
1994
ACM
Transact]cms
on Computer
Systems,
Vol
12, No
3, August
1994