The TickerTAIP Parallel RAID Architecture

Description

The TickerTAIP Parallel RAID Architecture

Reviews
Shared by: Rehan Shabbir
Stats
views:
49
rating:
not rated
reviews:
0
posted:
8/2/2009
language:
English
pages:
0
The TickerTAIP Parallel RAID Architecture PEI CAO, SWEE BOON LIM, SHIVAKUMAR VENKATARAMAN, and JOHN WILKES Hewlett-Packard Laboratories Tradltlonal all requests maximum architecture coupled This disk flow. number for processors. article arrays Such dmk The presents have a centrahzed to which that 1s better T1ckerTAIP the architecture, point can the array with of failure, scale. We a single and describe functions an controller through which the a controller arrays result the M a single dmtrlbutes scalablhtyi Its performance TlckerTAIP. across of Its hrnlts of disks a parallel loosely We for and and by TAfP controller fault and several behavior tolerance, a family request design also We and flexlblhty. evaluation of distributed atomlclty, the m both conclude absolute effects the that algorithms sequencing, terms Ticker architecture example, of the mslde demonstrate calculating recovery, comparison disk-level architectural Categories and the feasibility RAID to parity, evaluate by a working discuss describe techmques for establishing TlckerTAIP We the array the performance RAID algorkhms useful, a centrahzed Implementation and effective analyze of Including request-scheduhng approach and Subject M feasible, Descriptors B 42 Input/ Output Devices—c/Lunnels and rent Programming—parallel progra mm —seco?Ldary storage; D.4.7 [Operating tems controllers; and Data Communications] [Programming Techniques]: Concurzng; D 42 [Operating Systems] Storage Management Systems]: Orgamzatlon and Design—dlsfrzbuted sys[Input/ Output D 13 General Addltlonal trlbuted Terms: Key controller, Algorithms, Words fault and Design, Phrases Performance, Decentrahzed Rehablhty panty calculation, disk scheduling, RAID disk duarray tolerance, parallel controller, performance slmulatlon, 1, INTRODUCTION A disk array is a structure that connects several disks together to extend the cost, power, and space advantages of small disks to higher-capacity configurations. By providing partial redundancy such as parity, availability can be An earner version Architecture addresses: emad: of this P Cao, article Princeton Avenue, venkatar[a was presented at the 1993 International Symposium Science, Princeton, of Computer Dayton Laboratories, on Computer Authors’ NJ 08540; Science. man, Box Madison, 10490, Permmslon not made of tbe Assoclatlon Umversity, S. B Llm, Department Umverslty emad of Computer of Illinols, sbhm(({ Science, PC(CJ prmceton.cdu: cs ,Sprmgfleld emad Alto. of Wisconsin, Department 1210 West 1.304 W WI T_Trhana, IL 61801, of Computer cs WMC edu: J emad: of thm and es muc edu; S VenkataraStreet, PO Umverslty 1U13, Department CA 94304-0969; 53706, Wilkes, Hewlett-Packard Palo wllkestff hpl hp com provided copyright that notice reqmres the copies are and the title of the a fee and/or to copy without or chstrlhuted and for Computmg its fee all or part date appear, material is given IS granted the ACM that for direct commercial To copy $03.50 Vol advantage, notice otherwise, publication copying 1s by permission Machinery. or to repubhsh, specific permmslon. [a 1994 ACM 0734-2071/94/0800-0236 ACM Transactions on Computer Systems, 12, No 3, August 1994, Pages 236-2b9 TlckerTAIP Table redundancy technique I. Some Common RAID Levels . 237 Level p/acement of redundant data diagrammatic rendrf!on o stnpmg none none Hf3@m 1 sfrige of data blocks secondary copy of data ! 1 mirrormg mmpete disks P%d p;mary copy’of data blocks 3 parity across a stripe of data one diskdexiicsted to parity da{a blocks parity block 5 panty across a stripe of data panty rotates roundrobm across all disks Ejm3g$ data blocks panty Mock host interconnect // controller disk controller Fig. 1. Traditional RAID array architecture. I ‘JJ’s increased as well. Such RAIDs (for Redundant Arrays of Inexpensive Disks) were first described in the early 1980s [Lawlo~ 1981; Pa~k and Balasubramanian 1986], and popularized by the work of a group at UC Berkeley number and these [Patterson of different are summarized a disk et al. 1988; levels, redundant in Table array I. that provides one or more of the RAID levels, 1989]. data. The RAID The most terminology amounts commonly encompasses of redundancy encountered of a corresponding to different placement of the To implement the traditional RAID array architecture, shown in Figure 1, has a central controller, one or more disks, and multiple head-of-string disk interfaces. The RAID controller interfaces to the host, processes read and write requests, and ACM Transactions on Computer Systems, Vol 12, No 3, August 1994. 238 . Pel Cao et al. small-area network 1 / -4 1/ \ ‘m host mterconneot(s) array controller nodes Fig 2 TlckerTAIP array architecture \ carries disk out parity calculations, interfaces block placement, and data from recovery after a to failure. The disk pass on the commands the controller the disks via a disk interconnect of some sort—these days, most often a bus [SCSI 1991]. variety of the Small Computer System Interface, or SCSI Obviously, mance and ing power, whole will the capabilities of the RAID controller are crucial to the perforavailability of the system. If the controller’s bandwidth, processor capacity are inadequate, the performance of the array as a suffer. (This is increasingly likely to happen: for example, parity have not 1990].) of small are now kept pace with A high latency requests. similar, The and calculation is memory bound, and memory speeds recent CPU performance improvements [Ousterhout through single failure the point rates controller of failure for disk can that drives reduce the and the performance represents controller packaged can also be a concern: electronics one of the primary motivations for RAID arrays is to survive the failure rates that result from having many disks in a system. Although some commercial RAID array products include spare RAID controllers, they are not normally simultaneously active: one typically acts as a backup for the other, and is held in reserve until it is needed because of failure of the primary. (For example, this technique was suggested in Gray et al. [1990 ].) This is expensive: the backup has to have provides no useful services trollers are active all the capacity of the primary controller, but it in normal operation. Alternatively, both conbut over disjoint sets of disks. This limits simultaneously, the performance available from the array, even though the controllers can be fully utilized. architecture To address these concerns, we have developed the TickerTAIP for In this architecture (Figure 2), there is no central conparallel RAIDs. controller nodes troller: it has been replaced by a cooperating set of array that together provide all the functions needed by operating in parallel. The TickerTAIP architecture offers several benefits, including: fault tolerance (no central controller to break), performance scalability (no central bottleneck), (by simply adding components). Vol 12, No 3, August smooth incremental growth (it is easy to mix and match ACM Transactions on Computer another node), and flexibility Systemb, 1994 TickerTAIP . 239 This main controller 1.1 article emphasis provides an evaluation techniques of the used TickerTAIP architecture. parallel, distributed Its is on the to provide functions and their effectiveness. Outline this article by presenting and follow an overview it with of the TickerTAIP description architecof several and related work, a detailed We begin ture design issues, including descriptions and evaluations of algorithms for parity calculation, recovery from controller failure, and extensions to provide sequencing of concurrent requests from multiple hosts. To evaluate TickerTAIP we constructed a working prototype as a functional testbed, and then built a detailed event-based simulation that we calibrated against this prototype. the simulation-based particular tion. emphasis We conclude with These tools performance on comparing a summary are presented as background analysis of TickerTAIP that it against a centralized RAID of our results. material for follows, with implementa- 2. THE TICKERTAIP ARCHITECTURE A TickerTAIP array is composed nodes with one or more local disks provide connections to host another by a high-performance, redundancy to survive single Mesh-based availability scale and fault A similar interconnect of running disks for of array switching needs with of a number of worker nodes, which are nodes connected through a bus. Originator clients. The nodes network the the with are connected sufficient to one internal and computer small-area failures. can costs would and fabrics reasonable that achieve meet bandwidth, across performance, latency, complexity a reasonable scalability, sizes. A design tolerance needs of a TickerTAIP array is described in Wilkes [1991]. scheme has been described in Shin [1991]. For smaller arrays, the could at well even be a pair over 100MB/s of backplanes. [PC I 1994], or (For which many example, would hundred PCI is capable 15–20 if the become scales support disks bandwidth-limited applications, workload was small, more cost effective perfectly. relatively random 1/0s.) Multiple, independent arrays will at some sufficiently large scale: no interconnect However, TickerTAIP’s requirements on the switching light, and this point will probably only be reached with that such a split will probably be desirable for other fabric are arrays that reasons. are so large In Figure 2, the nodes are shown as being both workers and originators: that is, they have both host and disk connections. As a result, designing a node requires that both the host interface and the disk interface be designed together, and adding a disk node requires paying for another host connection. A second design that avoids these problems is shown in Figure 3. It uses separate disk-controller (worker) nodes and host-interface (originator) nodes. This allows arbitrary mixing and matching of node types (such as SCSIoriginator, FDDI-originator, IPI-worker, SCSI-worker), which makes building a TickerTAIP array with several different kinds of host interface simply a configuration-time question, ACM not a design-time on Computer one. Since Systems, each node is plug 1994 TransactIons Vol. 12, No 3, August 240 . Pei Cao et al. chent processes e /“’”e Flg 3. TickerTAIP system envn-onment compatible configure flexibility Figure from the point of view of the internal interconnect, it is easy to an array with any desired ratio of worker and originator nodes, a less easily achieved in the traditional centralized architecture. 3 shows the environment in which we envision a TickerTAIP provides nodes. array operating. The array through the originator disk services to one or more host computers There may be several originator nodes, each a single host can be connected to and greater failure resilience. be returned array looks like to the host a Tickernode communi- connected to a different host; alternatively, multiple originators for higher performance For simplicity, we require that all data along the path used to issue the request. In the context of this model, a traditional TAIP cation array with several unintelligent calculations worker take on which all the parity for a request RAID nodes, a single originator place, and shared-memory between the components. One assumption we made in this study is that parity calculation is a driving factor in determining the performance of a RAID array. The TickerTAIP architecture does these calculations in a decentralized fashion; other high-speed array controller designs (e.g., RAID-II [Drapeau et al. 1994] use a central parity calculation (1) that processors are engine. Our approach is predicated cost-effective engines for calculating on two beliefs: parity and (2) that memory bandwidth, rather than processor cycles, is the determining cost factor in providing this functionality. (By way of example, the bandwidth and functionality requirements of the RAID-II engine required a controller card nearly two feet on a side.) The TickerTAIP architecture reduces the perprocessor parity calculation requirements sufficiently far that the cheap commodity microprocessors it uses for the control functions can also be used as the parity calculation engines. At this point, the diseconomies of scale associated with providing high-bandwidth data paths to a hardware parity calculation engine will overwhelm any intrinsic simplicity in the use of specialized logic to perform the exclusive-OR calculation. As the performance ACM TransactIons on Computer Systems, Vol 12, No 3, August 1994 TickerTAIP . 241 of commodity microprocessors continues to improve range of array sizes over which this argument holds at its current rate, will only increase. the 2.1 Related Work Many design 1988; Dunphy papers Gibson have et been for al. published Menon on RAID and and reliability, recovery 1992; performance, schemes Schulze and Lui [Clark et 1990; al. and on variations et al 1990; parity 1989; placement et al. 1989; Holland Kasson Gray et al 1990; Lee 1990; Muntz and Gibson 1992]. Our work builds on these studies: we concentrate here on the architectural issues of parallelizing the techniques used in a centralized RAID array, so we take such work as a given—and assume basic RAID concepts in the following discussion. The HP7937 family of disks realizes a physical architecture of TickerTAIP gether disk switch-over functions to Several similar [Hewlett-Packard bus, which attached as a disk for between (such that hosts array) database 1988]. allows These disks can by a 10MB/s access to “remote” of system use including were provided, familiarity similar as well with to that toas fast be connected failure. disks in the event systems No multiarchitecture however. a hardware Bubba [Boral 1988; “shared-nothing” adopted TickerTAIP, Copeland et al. 1988], Gamma [DeWitt et al. 1986; 1988], Teradata [Neches 1984; Sloan 1992], and Tandem [Bartlett et al. 1990; Siewiorek and Swarz 1992]. across However, multiple none nodes, appears to use a distributed RAID implementation and all are intended as database engines rather than makes and para widely was called parallel implementations of RAID. On the other hand, TickerTAIP extensive use of well-known techniques such as two-phase commit tial-write distributed spread failure. inside ordering was RAID from made the database to connect community networks [1989]. [Gray This 1978]. to form approach A proposal RADD-Redundant of processors controller in Stonebraker Arrays of Distributed Disks. It proposed using disks across a wide-area network to improve availability in the face of a site In contrast to the RADD study, we emphasize the use of parallelism a single RAID server; we assume the kind of fast, reliable interconnect that is easily constructed inside a single-server cabinet; we couple processors and disks closely, so that a node failure is treated as (one or more) disk failures; and we provide much improved performance analyses—Stonebraker used “all ization disk of the operations parallel take RAID 30 ins.” design The result approach is a new, in detailed characterdifferent a significantly environment. 3. DESIGN This section ISSUES describes the TickerTAIP design issues in some detail. It begins with an examination of normal mode operation (i.e., in the absence of faults) and then examines the support needed to cope with failures. Table II may prove helpful in understanding the data layout used for the RAID5 array we are describing. ACM Transactions on Computer Systems, Vol 12, No. 3, August 1994 242 . Pel Cao et al Table II ~ Data Layout for a 5-Disk Left-Symmetric RAID 5 Array [Lee 1990] /og/ca/ b/ock number I i 2 3 4 Each spans column most represent of stripe a disk The shaded, blocks cmtl]ned have darker area represents and one possible are marked request a P that 1 (a “large stripe”). stripe”), all of stripes 2 and 3 (“full shading stripes”), and a small with amount of stripe 4 (a “small Parity 3,1 In Normal-Mode normal mode, Reads no parity computation is required for reads, so they are quite straightforward. All the necessary data is read at the workers and forwarded to the originator, where it is assembled, and transmitted to the host in the correct order. The main performance issue that arises has to do with skipping reads over the parity data blocks: and we than found and it then beneficial to discard separate to the perform parity requests sequential of both parity, blocks inside the worker nodes, rather that omitted reading the parity blocks. 3.2 to generate Normal-Mode array, Writes writes data partial require calculation Each is or modification stripe computation, maintained. of stored since this The parity to in that a single redundancy. redundancy executed is considered separately discussion that In a RAID maintain determining across follows request 3.2.1 the partial which spans. How to the method the and site for parity is the unit describes the algorithms on each of the stripes Calculate calculate the new much of the stripe —full stripe: New Parity. The first design choice parity. There are three alternatives, depending is being updated (Figure 4): is how to upon how parity —small all of the data blocks in the stripes have can be calculated entirely from the new data; to be written, and stripe: and parity be written, with —large stripe: less than half of the data blocks in a stripe are to be written, is calculated by first reading the old data of the blocks that will XORing them with the new data, and then XORing the results the old parity block data; more than half of the data blocks in the stripe are to be written; the new parity block can be computed by reading the data blocks in the stripe that are not being written and XORing them with the new data (i.e., reducing this to the full-stripe case) [Chen et al. 1990]. ‘11-ansactlom on Computer Systems, Vol 12, No 3, August 1994 ACM TickerTAIP . 243 “y’’mn~~~ clhbffcks (a) Full stripe old dafa bei~g read Y parity block read-mo~fy-write cycles’ (b) Small stripe (c) Large Fig. 4. Three stripe ml stripe update size policies. x indicates where parity calculations occur. different Notice all but that the first a single request stripe is just might will span always all three be full kinds ones. of stripes, optimization, although since and last mode The large-stripe the right behavior cases. We discuss 3.2.2 where tures Where a (potential) performance can be obtained from whether it is beneficial New Parity. using just the smallin practice later. The second design and full-stripe to Calculate consideration is the parity is to be calculated. Traditional centralized calculate all parity at the originator node, since processing capability. In TickerTAIP, every node RAID architeconly it has the has a processor, the work over as 5): necessary so there are several among the nodes—in many —at choices. The key design goal is to load balance particular, to spread out the parity calculations Here are three possibilities (shown in Figure nodes originator: as possible. all parity calculations calculations are done at the originator; for a stripe take place at the parity —solely-parity: node for that —at-parity: all parity stripe; same as for solely-parity, except that partial results during a small-stripe write are calculated at the worker nodes and shipped to the parity node. ACM Transactions on Computer Systems, Vol 12, No 3, August 1994 244 . Pel Cao et al . - .. -J (a) At originator r - “ - ---- . . . ., ) .& b) Solely parity f==l f=l ‘+’:’ + Fig. the 5 node Three where different parity places to calculate occur. panty @ indicates the orl~nator node; X, indicates calculations The tween 3.3 solely-parity not the other scheme pursue two, it later always further. uses more We messages than the at-parity comparisons one, be- so we did provide performance in the article. Single Failures— Request Atomicity We begin with a discussion of single-point failures. Notice that a primary goal of a regular RAID array is to survive single-disk failures.1 The TickerTAIP architecture extends this to include failure of a part of its distributed controller: we do not make the simplifying assumption that the controller is not a TickerTAIP is intended to be used possible failure point. The way in which provides duplex paths to its host (see Figure 3), and since there are several techniques for doing so, we have legislated that the internal interconnect fabric is itself single-fault resilient. As a result the overall architecture is 1There failures are variants et al and on of the 1989] parity calculation further Vd scheme, here. 12 No that TAIP can compensate for multlple these disk cases is [Gibson The extension of the Ticker architecture to cover straightforward, ACM TransactIons not discussed Computer Systems, 3, August 1994 TickerTAIP Table III. Algorithms block being used to perform written a write in failure mode, as a function being updated . of the kind of 245 to, and the amount physical of the stripe stripe size block type on faded disk parity I none not updated small stripe strategy updated small ‘ large stripe strategy I lar9e full large stripe strategy full stripe strategy none small stripe strategy none — capable of surviving a fault in any single system component. However, there are certain TickerTAIP section requirements on the software algorithms used at the nodes in a system to ensure correct operation in the presence of faults. This the first of them: the need to provide request atomicity. discusses Just as with a regular RAID array, packaging and power-supply issues are very important if the system availability is to be maximized. Some of these decisions are discussed in Schulze [1988] and Schulze et al. [1989]; the design approach that used for these questions disk In until array. TickerTAIP, RAID the disk a disk array: the is repaired failure array is treated continues in normal in just operation the contents mode. the in of in a TickerTAIP-based array is identical to used for a regular Disk Failure. 3.3.1 same way as in a traditional (failed) mode are reconstructed: degraded the outside appropriate failed disk disk Table 3.3.2 or replaced; the new disk and execution resumes From of the array, the effect is as if nothing has happened. Inside, data reconstructions occur on reads, and 1/0 operations to the are suppressed. Exactly the same algorithm is used if an entire goes bad for some reason. The algorithms are summarized in string III. Worker Failure. disk RAID failure, and A TickerTAIP worker failure is treated just like a is masked in just the same way. (Just as with a regular multiple disks per head-of-string controller, a failing controller with worker means that an entire column of disks is lost at once, but the same recovery algorithms apply, ) We assume fail-silent nodes so that we can significantly simplify the fault-isolation and normal-case protocols we use the isolation offered by the networking between the nodes. In practice, protocols used to communicate between nodes is likely to make this assumption realistic in practice for all but the most extreme cases—for which RAID arrays are probably not appropriate choices. (In support of our position, Gray [1988] explains that the complexities of handing tine failure modes are rarely deemed worthwhile A node alive” is suspected within to have failed request a reasonable time. (This the more complicated in practice.) respond only is the place that Byzanyou such if it does not to an “are time-outs occur, to simplify the maintenance of other portions of the system.) The node that detects a failure of another node initiates a distributed consensus protocol much like two-phase commit, taking the role of coordinaACM TransactIons on Computer Systems, Vol. 12, No 3, August 1994. 246 . Pel Cao et al tor of the consensus protocol. All the remaining this means on the number and identity of the ensures Multiple possible 3.3.3 nodes reach failed node(s). mode down agreement by This protocol same to time. prevent that data all the remaining the nodes array to enter shut failure itself at the safely failures cause corruption. Failure and Request Atomicity. Originator Failure to a host of a node with is lost; an originator on it brings new concerns: a channel any worker on the same node will be lost as well; and the fate of requests that arrived through this node needs to be determined since the failed originator was responsible for coordinating their execution. Originator failures during reads are fairly simple: the read operation is aborted host. Failures portions to avoid since there during of the write compromising is no longer write a route to communicate its results back to the different are taken write. operations the are more of the complicated, unless stripes extra because steps could be at different consistency stages involved in the Worst is failure of a node that is both a worker and an originator, since it will be the only one with a copy of the data destined for its own disk. (For example, if such a node fails during a partial-stripe write after some of the blocks in the stripe have been written, it may not be possible to reconstruct the state of the entire stripe, violating the rule that a single failure must be masked.) Our solution either Notice drives a write that this to both operation is a much being these concerns is to ensure successfully, guarantee arrays. With than these, until or write atom icity: that is, disk completes stronger disk it makes no changes. of a range completes provided the by single write or non-parity-protected blocks written the content of logical to is indeterminate successfully. If a write request is aborted or fails, the contents of the targeted range will be in an indeterminate state. To achieve this guarantee, we added a two-phase commit protocol to write operations. Before a write can proceed, sufficient data must be replicated in more than one node’s memory to let the operation restart and complete—even if an arbitrary node fails. If this cannot be achieved, the request is aborted before it can make any changes. (A similar problem caches with occurs that must in disk RAID controllers failure This that issue have cache two-part half, nonvolatile write tolerate of either possibly in Menon in conjunction and Courtney early concurrent failures. is discussed [1993]; similar We identified commit solutions to the one we adopted two approaches to implementing serve there as well.) the two-phase commit: tries to make the decision as quickly as possible; late commit delays its commit decision until all that is left to do are the writes. We describe them in reverse order, since late commit is the simpler of the two. the commit point (when it decides whether to continue or In late commit, not) is reached only after the parity has been computed. The reason for this choice is that the computed parity data, suitably distributed, provides exactly the partial redundancy needed. In late commit, all that remains to be done after the commit decision is to perform the writes. ACM Transactmns on Computer Systems, Vol 12. No 3, August 1994 TlckerTAIP Table IV. Data needed for recovering strategy a stripe during a write, — . 247 and the stripe-size used to do so faded node block type at fa//ed node updated stripe size -+ small strpe panty node has copy; large-stripe strategy parity not computed — originator has copy; large-stripe strategy parity not computed — /arge stripe parity node has copy; large-stripe strategy fu// strpe --1 ( ~ I originator L panty notupdated updated I updated and panty nodes ave copy, full-stripe strategy qh_._ -.-, panty not computed panty not computed ----1 parity node has copy; k--Iarge-stripe strategy originator has copy; large-stripe strategy parity not computed parity node has copy; large-stripe strateg~l L--=I orlgmator has copy; full-stripe strategy panty not computed + 1 ~ L worker t parity not updated — —1 point that as the elsefor node nodes In quickly new where, early commit, the during goal is for the array to get node to its This same commit requires must as possible destined in case the the execution originator/worker fails after of the request. commit. The data or the originator has to be replicated be done old data being read as part of a large-stripe write, in case the reading fails before it can provide the data. We duplicate this data on the parity of the affected stripes—this involves of parity calculations at the parity preferred policy). The commit point redundancy has been achieved. Late rency commit restarting originator, the request, nodes that commit and higher point. the is much request When from easier latency. sending no additional node (which we will is reached as soon data in the case see below is the as the necessary lower not concurreach its for in to implement, We explore if any node event those fails, nodes but has somewhat the magnitude worker originator were already the that does failure, of this cost later. A write operation is aborted a worker In the among involved is responsible a temporary participating operation. of an originator it processing. the request nodes already chosen is elected to complete or abort was already participating in Choosing minimizes have the one of the data and necessary control traffic interchanges, since these information about the request itself. Table IV summarizes the different cases that For each combination of node role and block type which that node has a copy of the data must be applied to the stripe. required need to be accommodated. that has been lost, it shows and the write policy for recovery, 3.4 Multiple Failures — Request Sequences This section discusses measures designed to help limit the effects of multiple concurrent failures. The RAID architecture tolerates any single disk failure. However, it provides no behavior guarantees in the event of multiple failures ACM Transactions on Computer Systems, Vol 12, No 3. August 1994 248 . Pei Cao et al. (especially power-fail), and it does not ensure ping requests that are executing simultaneously. son and Sturgis [1981], multiple failures are covered fault set for RAID. TickerTAIP troller failures; and it goes beyond this the effects As with the independence of overlapIn the terminology of Lampdisasters: events outside the conlimit introduces coverage for partial sequencing to by using request of multiple failures in a way that is useful to file system a regular RAID, a power-fail during a write can corrupt techniques is exactly designers. the stripe being written to unless more extensive recover logging) are used—in this respect, TickerTAIP failure power TAIP’s wishing model. Power failures can be handled supply for both TickerTAIP request sequencing also to tolerate crashes and (such as intentions emulating the RAID by the use of an uninterruptible and a regular RAID array, but Tickerprovides improved performance to hosts other failures. Strengthening the regular RAID failure guarantees of wanting to maximize lower 3.4.1 in the controller follows naturally as a consequence performance in the array; in turn, doing so at the to simplify system its own failure designers rely recover typically mechanisms. on the presence level allows the host File Requirements. of ordering invariants to allow For example, in 4.2 BSD-based them to recover from crashes or power failure. file systems, metadata (inode and directory) writes must occur before the data to which they refer is allowed to reach the disk [McKusick et al. 1984]. The simplest way to achieve this is to defer queueing the data write until the metadata write has completed. Unfortunately, this can severely limit concurrency: for example, parity calculations can no longer is unfortunate, be overlapped with the execution of the previous request. This and becoming more so, as the technology of disk drives improves to include command queueing, immediate reporting, and more nearly optimal request sequencing that exploits position information available only at the disk itself [Seltzer et al. 1990; Jacobson and Wilkes 1991; Ruemmler and Wilkes 1993]. A better way to achieve the desired invariant is to provide—and preserve —partial write orderings in the 1/0 subsystem. This technique can significantly improve file system performance. From our perspective as RAID array designers, it also allows the RAID array to make more intelligent decisions about request scheduling. We discuss the effects to of some of these multiple scheduling hosts. As a decisions later A TickerTAIP in the article. array can be configured support result, some mechanism needs to be provided to let requests from different hosts be serialized without recourse to either sending all requests through a single host or requiring one request to complete before the next can be issued. Finally, multiple overlapping requests from a single or multiple hosts can be in flight simultaneously. This could lead to parts parts of another in a nonserializable fashion, which vented. (Our write commit protocols provide atomicity no serializability guarantees.) 3.4.2 Request Sequencing. of one write replacing clearly should be prefor each request, but a request-sequencing ACM Transactions on Computer To address these requirements, we introduced mechanism using partial orderings for both reads and Systems, Vol. 12. No 3, August 1994 TickerTAiP . 249 writes. graphs Internally, these are represented in the form of directed acyclic (DAGs): each request is represented by a node in the DAG, edges of the DAG represent dependencies between requests. To express allowed TAIP perform on which to list guarantees eager the DAG, that each request requests (this the effect complete some is given a unique identifier. until one or more on which allows it depends begins explicitly; while the is A request Tickerthe requests the freedom testbed is as if of which no request it depends the implementation we exploited in our to evaluation, proto- type). If a request is aborted, all requests that depend explicitly aborted (and so on, transitively). If a host later wishes to reissue any of the aborted dependent free to do so, of course. Having TickerTAIP itself propagate dependent handshake An alternative detects hosts up into protocol. determine depended any had on it are also requests, the abort it is to requests preserves sequencing guarantees without requiring a with the host on every operation in the normal (error-free) case. designz abort, would that have they TickerTAIP had aborted push the data in enter any a special and all mode until requests once it all the that back recovery it to improving to not during which it would would execute no requests acknowledged on the failed but one. This giving dependency-handling and fragile thereby to TickerTAIP allows the hosts, Additionally, which at the cost of a more the dependency can be executed assign complicated parallel, requests performance in the normal case. Also, TickerTAIP will arbitrarily prevent propagated cies, the schedule. 3.4.3 Sequencer States. The through a high-level state table, their transitions, in Figure 6): —NotIssued: sufficient implicit dependencies Aborts are dependenserializable overlapping across order requests implicit from executing concurrently. is some dependencies. requests In the absence of explicit arbitrary in which are serviced management of with the following sequencing is performed states (diagramed, with request —Unresolved: the request itself has not yet reached TickerTAIP, has referred to this request in its dependency list. it depends but another that at the the request has been issued, but has not yet reached the TickerTAIP array. array, but on a request arrived —Resolved: all of the requests that this one depends at least one has yet to complete. dependencies have on have been —InProgress: begun all of a request’s executing. a request satisfied, so it has —Completed: —Aborted: has &ccessfully finished. on which this request de- pended a request was aborted, or a request explicitly has been aborted. ~Due to one of the anonymous reviewers. ACM Tran.actmm on Computer Systems, Vol 12, No. 3, August 1994 250 . Pei Cao et al. referenced by another n?quest Issued by a host issued by a host anti-dependents resolved anti-dependents completed Fig. 6. States of a request. An “antidependent” is a request that this request is waiting for. Old request state the hosts number oldest depends Aborts outstanding any completes, dependency has to be garbage-collected. their requests sequentially incomplete request requests than exception the from oldest completed request satisfied. to this We do this by requiring and by keeping track each host. When Any this request that of the request that the that older can be deleted. recorded mechanism on an older immediately one can consider since a request are an important depends on an aborted request should itself be aborted, whenever the original request was aborted—even if this was some considerable time in the past. The simplest solution is to require that a host never issue a request that depends on one that has been aborted, but this would require an unnecessary serialization at the host. As a result, we decided to propagate aborts to other requests already in the TickerTAIP array. Unfortunately, this is not enough: there is a potential race condition between the request being aborted and the host being told about it, and the host ceasing to emit further requests that may depend on the aborted request. Our solution is to maintain state about aborted requests for a guaranteed minimum time— 10 seconds in our prototype. ACM (This Transactions is not ideal: on Computer in the presence Systems, Vol 12, No of a large 3, August number 1994 of cascaded aborts, TlckerTAIP . 251 we may However, Similarly, as a host have to delay accepting this issues situation new commands state until the 10 seconds rare are up. such we believe a time-out that never is likely to be extremely other requests four in practice.) errors on the NotIssued a request Alternatives. can be used to detect for which are waiting. designs for the 3.4.4 Sequencer Design sequencer mechanism: (1) We considered a single, Fully centralized: its transitions. (A primary point the trips of failure.) sequencer, In the additional round-trip central sequencer manages the state and a backup are used to eliminate absence the of contention, times: sequencer and each between with request latency table and a single two and suffers message the originator the sequencer. and between its backup. One of these is not needed if the originator is co-located a centralized sequencer handles the state table (2) Partially centralized: until all the dependencies have been resolved, at which point the responsibility is shifted to the worker nodes involved in the request. This requires that the status of every request be sent to all the workers, to allow them to do the resolution of subsequent transitions. This has more concurrency, (3) Originator but driven: requires in place a broadcast of a central on every request the completion. originator nodes sequencer, than the (since there will typically be fewer of these distributed-consensus protocol to determine constraints, always (4) after which more the partially than generates messages the workers) conduct a overlaps and sequence approach all is used. This their node schemes. centralized the centralized Worker driuen: the workers are responsible for transitions. This widens the distributed-consensus in the array, and still requires the end-of-request the they higher-numbered the fully of the above designs the states and protocol to every broadcast. may increase Although rency, concurlargely do so at the cost of increased message traffic complexity. required for the to be that of two overhead acceptwe made We chose to implement because of the complexity alternatives. As expected, round-trip messages table onds of state request 3.5 centralized model in our prototype, of the failure recovery protocols we measured the resulting latency plus We believe sequencing this additional (i.e., 440 KS in our prototype) management. that request optional a few tens of microsecNonetheless, able for the benefits sequencing provides. for those cases where it is not needed. The RAIDmap sections have presented the policy issues; this one discusses an technique requests we found useful. Our first design retained a great deal of centralized for sequencing authority: the Previously implementation and coordinating originator tried to coordinate the actions taking place at each of the different nodes (reads and writes, parity calculations). We soon found ourselves faced with the messy problem of coping with a complex set of interdependent actions taking place on multiple ACM remote TransactIons nodes, on Computer and coordinating Systems, Vol 12, No. these proved 1994 3, August 252 . Pel Cao et al Stripe node O . x, . . unused 4,1,2 data --.! unused for a write block the column array; node 1 --‘, unused 5,1,2 data -,2,parity request on a disk, physical on each node 2 2,0,3 data -,1,parity 6,2,1 data spanmng block node); node 3 Type small stripe full stripe large stripe 2 through tuple disk (which to send 7 Each equates parity cell m are data the to, to the 0 1 -,o,parity 3,1,2 data 7,2,1 data lo~cal 2 Fig 7. RAIDmap represents number block number example in the blocks on this the figure logical stripe and a physical is only rightmost and contains number the a four-part node number 32 1 The parts if there The one disk a block type. is d]scussed In SectIon exceedingly complex—especially into consideration. so when potential failure modes were taken To avoid this complexity, we developed a new approach: rather the originator tell each worker what it had to do, and coordinate than having the stream of asynchronous events that resulted, we delegated management of its own work to each worker, and then coded everything to assume that all the nodes were doing what they were supposed to without any further prompting. So, once the workers are told about the original request (its starting point and length, and whether it is a read or a write) and given any data they needed, they can proceed on their own. For example, if node A needs data from node B, A can rely on B to generate and ship the data to A with no further it is characterized prompting. We call this approach collaborative execution; by each node assuming that other nodes are already doing their part of the request. It proved to be an enormous simplification. To orchestrate a two-dimensional stripe.3 Each as a function all the work, we developed a structure known as a RAIDmap; array with an entry for each column (worker) and each of the RAIDmap, filling in the blanks or write), the layout policy, and the where data and parity blocks or RAID 4 or 5). The execution service the request (e.g., where worker builds its column of the operation (read policy determines execution policy. The layout are placed in the RAID array (e.g., mirroring, policy determines the algorithm used to parity is to be calculated). A simplified RAIDmap is shown in Figure 7. One component of the RAIDmap is a state table for each block (the states are described in Table V). It is this table that the layout and execution policies fill out. A request enters the states in order, leaving each one when the associated function has been completed, or immediately if the state is marked as “not needed” for this request. For example, a read request will ‘Although practice since ACM the Idea there of the RAIDmap to actually descriptions 1s more generate all look V.] simply pretty 12, No described rows much 3, August as If the full of the array array was present, long request, in m no need full-strip on all the for a very the inner Transactions ahke. 1994 Computer Systems, TickerTAIP . 253 Table V. State-Space for each Block at a Worker Node I State I I Funct/on I Wrerinfomration I disk address I II I Read old data I enter The several results state 5 6 XOR incomin data with local old dat a?Dantv Write new data or parity i disk address 1 1 (to read the data), skip skip through state 2 to state states. allowing to calculate 3 (to send it to us to test partial or parity out the originator), RAIDmap different locally, and then proved policy or whether through the remaining mechanism, (e.g., whether to be a flexible alternatives parity nodes). to send the data to the originator Additionally, the same techniques is used in failure mode: the RAIDmap indicates to each node how it is to behave, but now it is filled out in such a way as to reflect the failure mode fied the configuring of a centralized operations. Finally, the implementation, using case. is to maximize this, data RAIDmap the same utilization computation disks. servicing simplipolicies and is For their we and assumptions as in the distributed The goal of any RAID implementation minimize overlapped the same own parity request with reason, latency. other workers important To help operations, or local prototype increased together, two-phase disk achieve the RAIDmap before such as moving needed disk transfers. the disk or accessing send data to optimize elsewhere computations It also proved accesses themselves. until because the the of disk described When delayed writes in our available, throughput writes In were coalesced the implement implementation by 25–3070 reducing commit parity data was data and parity seeks needed. in Section 3.3, the number protocols additional states were added to the worker and originator node state tables. The placement of these additional states determines the commit policy: for early commit, as soon as possible; for late commit, just before the writes. 3.6 Scheduling Unlike traditional Disk Accesses centralized RAID designs, TickerTAIP provides request atomicity and sequencing to support multiple outstanding requests. As a result, more than one disk access request can be queued at a worker node at one time, which means that it is beneficial to consider more sophisticated request-scheduling policies inside the array (preserving the write-order invariant determined by the sequencing algorithms, of course). In theory, a worker node could use any of the algorithms proposed in the (fairly extensive) literature on disk scheduling. In practice, we are mostly interested in those ACM Transactions on Computer Systems, Vol 12, No 3, August 1994 254 . Pel Cao et al, that are inexpensive with four and yet give good performance. We report here on our experiments such algorithms. is what which is seek time —first come first is implemented served (FCFS): that is, no request reordering—this in the working prototype described below, and of the results we present; used for the majority —shortest seek time first (SSTF): the request that has the shortest from the current disk head position is served first; —shortest access time first (SATF): time (seek time + rotation time) served —batched first [Seltzer nearest neighbor (BNN): the request that has the shortest access from the current disk head position is and Wilkes SATF, except 1991]; that requests are like et al. 1990; Jacobson batched—each in the queue not attempt Among these time it runs, the scheduler takes all the requests currently as a batch, and runs the SATF algorithm over them; it does to serve any new SAFT requests gives until generally the current the best batch is finished. when algorithms, throughput applied to Unix system-like workloads, but can potentially starve requests. BNN remedies this at the cost of a small reduction in throughput. We found that scheduling improved both the throughput and average response time load condition The results 3.7 of requests. The improvement of the array, and (as expected) in Section 4.7. depended on the workload and was largest under heavy loads. are reported Memory Management in this work is memory limitaexample, memory The main functionality issue we did not address explicitly buffer management at the originator nodes. In a real system, tions would complicate additional flow control some of the algorithms presented here. For might be needed to ensure that the originator would not get swamped if the array was presented However, these costs will be small: by definition, requests are larger than would fit comfortably into tor node, so the cost of the flow control will moving the data. Alternatively, the originator node might requests up into chunks monly used in disk drive with many large requests. they only show up if the the memory of an originahidden by the up very cost of large be largely choose to break with some maximum size. This approach is comcontrollers today; the main difference would be the that the array could deliver data at use of much larger chunk sizes to ensure close to its full potential bandwidth. 4. EVALUATING TICKERTAIP This section presents the vehicles we used to evaluate architecture. the design choices and performance of the TickerTAIP 4,1 The Prototype We first constructed a working prototype design, including all the fault tolerance ACM TransactIons on Computer Systems, Vol 12, No implementation features 3, August described 1994 of the TickerTAIP above. The intent TlckerTAIP Table VI. Characteristics of the HP97560 Disk Drive . 255 [ properfy diameter value 525” 19 data, 1 servo 1.3GB track size 72 sectors 512 bytes 4002 RPM 2.2MB/s 5MB/s = I controller overhead 1ms 1.67ms 1.28 + 1.15~d ms 4.84 + 0.193~d + 0.00494d ms of this implementation was a functional testbed, to help ensure that we had made our design complete. (For example, we were a-ble to test our fault recovery code by telling nodes to “die.”) The prototype also let us measure path lengths and obtain We implemented the comprised a local leave of a Parsytec interface, was 4KB. SCSI unit) early performance data. design on an array of seven MSC card with experiments. node had a T800 transputer, disk unit (the connected Each to a local a local SCSI storage 4MB drive. SCSI The disk 1991]: nodes, of RAM, disks each and were inter- spin-synchronized for these A stripe block-level HP97560 [Hewletta small, Packard 1991] with the properties shown in Table VI. The prototype was built in C to run on Helios [Perihelion lightweight operating system nucleus. We measured latency for short messages between directly connected the peak internode the relatively slow means that they bandwidth processors, overlap the one-way message nodes to be llO~s, and to be 1.6MB/s. Performance was limited by and because the design of the Parsytec cards computation and data transfer across their perforof our cannot SCSI bus. Nevertheless, mance data for design simulator. Our prototype and test routines. RAID functionality. 4.2 The Simulator We also built tasking library a detailed [AT & T the prototype provided useful comparative choices, and served as the calibration point a total 12k lines of 13.3k of this lines was of code, including directly associated comprised About comments with the event-driven 1989]. This ACM Transactions simulator enabled on Computer us using the to explore VOI AT&T C+ + the effects of 3, August 1994 Systems, 12, No 256 . Pei Cao et al changing link and processor speeds, and to experiment with larger configurations than our prototype. Our model encompassed the following components: — Workloads: both fixed (all requests of a single type) and imitatiue (pat- terns that simulate existing workload model, and the method of independent obtain steady-state measurements. —Host: array; a collection disk driver HP-UX nodes our of workloads path lengths and systems. (workers sharing were patterns); we used a closed replications [Pawlikowski an access port estimated from to the queueing 1990] to TickerTAIP made were deMSC measurements lengths the on our local —TickerTAIP rived type from and originators): (we code path running that measurements HP-UX would not occur of the algorithms workstations in a real design). on the working Parsytec proto- assumed limitations —Disk: we modeled the HP97560 tation, using data taken from disk model was fairly detailed, —the seek time settling about during position; and profile times head a data from for disks as used measurements and included: VI; than reads track- on the prototype implemenof the real hardware. The Table writes —longer optimistic —trackincurred —rotation —SCSI from —Links: (the but and disk can afford to be times positioning transfer; for reads, not for writes); cylinder-switch cylinder-skews, including bus and controller overheads, including the mechanism into a disk track buffer bus (the granularity used was 4KB). channels such represent communication overlapped data transfers and transmissions across as the small-area network the SCSI and the SCSI buses. We report here data from a complete point-to-point interconnect design with a DMA engine per link, since this is both the simplest effects results assume topology and the one from which it is easiest to extrapolate to the of other designs. would be obtained multicast Our preliminary from mesh-based studies suggest that similar switching fabrics. We did not capabilities. Under the same design choice and performance parameters, our simulation results agreed with the prototype (real) implementation within 3% most of the time, and always within 6%. here This gave us confidence disk array in the with predictive abilities of the simulator. The system we evaluate 1s a RAID5 left-symmetric parity [Lee 1990] (the same data layout shown in Table II and Figure 7), stripes composed from a 4KB block on each disk, spin-synchronized disks, FIFO disk scheduling (except where noted), and without any data replication, spare blocks, floating parity, or indirect-writes Stepanov 1992; Menon and Kasson 1992]. The hosts and 11 worker nodes, with each worker disk attached to it via a 5MB/s SCSI bus. ACM Transactions on Computer Systems, Vol 12, No for data or parity [English and configuration simulated had 4 node having a single HP97560 Four of the nodes were both 1994 3, August TickerTAIP . 257 Table VII Read performance (all relative for fixed-size deviations workloads, were with less than varying 2%) link speeds standard Request size 4KB 40KB lMB 10MB throughput MEW 0.94 1.79 15.2 21.1 latency (in ms) lMBA 33 38 178 1520 10MBIs 31 34 ae 610 100MBA 30 33 76 520 originators and about exploring workers; for simplicity, and since the effects of the internal design we were choices, most concerned we used only a here with single infinite-speed connection between each host and the array. Except for the results in Section 4.7, the throughput numbers reported were obtained only when the system was driven at a time. to saturation; response times one request in the system For the throughput measurements we timed 10,000 requests in each run; for latency Each data point on a graph represents the average we timed 2000 requests. of two independent runs, with all relative standard deviations less than 1.59%, and most less than 0.5%. Each value in a table is the average of five such runs; the relative standard deviations are reported with each table. In section 4.7, our throughput and response time numbers are means of 5 simulations, each consisting of 10,000 requests. Nearly all relative standard deviations for the data points in Section 4,7 are less than few (on the OLTP workload) were as high as 6.0%. In before 4.3 all cases, 100 requests were run to completion to minimize any measurements were taken, startup 1.OYC, although the simulator a through effects. Read Performance n-disk array for random data show no significant but 10MB/s transfers. or Table VII shows the performance of our simulated read requests across a range of link speeds. The difference in throughput for more are needed to minimize 4.4 Write Performance: We first consider the the any link speed above lMB/s, request latencies for the larger An Exploration effect is small, of the but of the Design Alternatives large-stripe enabling the policy. Figure 8 shows policy the large-write resulted result: difference always in a slight increase in latency. improvement in throughput at the expense of a slight We chose to enable the large-stripe mode for the remain- der of the experiments. Next, we compared the at-originator and at-parity policies for parity calculation. Figure 9 gives the results: at-parity is significantly better than atoriginator, with the differences largest (as expected) at larger write sizes and with lower processor speeds. This is due to the at-parity algorithm spreading ACM TransactIons on Computer Systems, Vol 12, No 3, August 1994 258 . Pel Cao et al. ol~ 100 Reqwst s,,. (iii3) (log SC.le) 1000 10000 o Res Ponse 14 Tlrne, “s Request S,,, (10 MIPS, With Without 10 — * MB/s) 0.12 /’ :::; ~ ----’”~ +? 0.04 0.02 ol~ 100 Request S,ze (KB) 1000 (log scale) 10000 Fig. half 8, the Effect stripe of enabhng the large-stripe parity computation policy for writes larger than one theparity calculation across several processors more evenly, so weused it for the remainder of our experiments. The effect of the late-commit protocol 10: the effect response effect time of the commit time by up to 20%. protocol This on response is more marked, on performance with the late is shown commit point in Figure but the as a increasing on throughput is because is small ( < 2%), the commit is acting synchronization barrier, which prevents some of the usual overlaps between disk accesses and other operations. For example, a disk that is only doing writes for a request will not start seeking until the commit message is given. The delay that results could presumably be reduced by sending the disk a seek command ring, although only show Although than that recommend ACM Transactions to position its head while the parity we have not performed this experiment disk array. computation was occurbecause the effect will is slightly better As a result, we its throughput is up on an otherwise-idle not shown, the performance of early commit of late commit, but not as good as no commit. late on commit Computer as the Systems, Vol preferred 12, No design 3, August 1994 choice: TlckerTAIP Throughput 12 vs Link and CPU (Random lMB) . 259 10 8 6 4 2 0 0 5 Link 10 (MB/s) 15 20 & CPU (MIPS) 25 30 Responsetlme 0.7 r vs Link and CPU at (Random CIrlglnator at Pa,, lMB) — + 0.6 ; ‘c G ty ;2 : G 0.1 ‘L n I “o 5 Link 10 Speed 15 (MB/s) 20 CPU (MIPS) 25 30 Fig, 9, Effect of parity calculation policy on throughput and response times for lMB random writes. almost as good as no commit implement than early commit. 4.5 protocol at all, and it is much easier to Comparison with a Centralized RAID Array How would a TickerTAIP array compare with a traditional centralized RAID array? This section answers that question. We simulated both the same n-node TickerTAIP system as before and an n-disk centralized RAID. The simulation components and algorithms used in the two cases were the same: our goal was to provide a direct comparison of the two architectures, uncontaminated dedicated by other originator factors. node, The together centralized with array was modeled nodes as a single, that did read stripa set of worker to do so. for a 10-disk and write operations only when directed For amusement, we also provide data nonfault-tolerant ing array implemented using the TickerTAIP architecture. The results for 10MIPS processors with 10MB/s links are shown in Figure 11: clearly a nondisk bottleneck is limiting the throughput of the centralized ACM TransactIons on Computer Systems, Vol 12, No 3, August 1994 260 . Pel Cao et al, . .. ‘.7- . .,qhvut 12 ,. ,,s Request S,ze (10 MIPS. k..., N. Corn, 10 + t MB,. — . ) .. a 1 / 6 4 2 1 10 Request 100 Slzt! (KB) (log 1000 scale) 10000 Response 0.13 0.12 q 0.11 : “ a . 0 2 : al 2 : : a 0.07 0 06 0.08 01 09 Times v. Request Size (10 MIPS, 10 — - MB/s) Comlt NO Comrnlt ,/i’ /’ / /“’’”””” 0.05 0.04 1 10 Request Size 100 (KB) (log 1000 SC,le) 10000 Fig 10 Effect of the late-commit protocol on write throughput and response time system for request sizes larger than 256KB, and its response time for requests larger than 32KB. The obvious candidate is a CPU bottleneck from parity calculations, and this is indeed what we found. To show this, we plot performance as a function of CPU and link speed (Figure 12), and both varying writes, These together but a much (Figure effect smaller that 13)—these effect the graphs show that changing the CPU speed has a marked graphs on the performance on TickerTAIP, TickerTAIP architecture is successfully exof the centralized case for lMB show ploiting load balancing to achieve similar (or better) throughput and response times with less powerful processors than the centralized architecture. For lMB write requests, TickerTAIP’s 5MIPS processors and 2–5MB/s’ links give comparable throughput to 25MIPS and 5MB/s for the centralized array. The centralized array needs a 50MIPS processor to get similar response times as TickerTAIP. Finally, we looked at the effect of scaling the number of workers in the array, with both constant request size (400KB) and a varying one with a fixed amount of data per disk (ten full stripes, however large a stripe becomes). In ACM TransactIons on Computer Systems, Vol 12, No .3, August 1994 TickerTAIP Throughput 16 14 12 10 8 6 4 E // 2 0 1 ‘*, 10 100 (KB) .,,, 1000 scale) > * r “s Size (CPU 10 MI PS, Link 10 (M B, s)) -— = . 261 I Cent Al> zed TIckerTAIP Str, plnq .. Request 10000 S,ze (log Responset, 0.3 mes “, Request Size (CPU 10 MI PS, Link 10 (MB 0.25 0.2 0.15 /’ ,+ 01 0.05 /“”: m= o 1 10 Request ..=. 100 (KB) ~ u Size (log 1000 scale) 1000$ Fig. 11. Writs throughput and response time for three different array architectures these experiments, are seen in Figure slightly with larger four of the worker nodes were also originators. 14. With constant request size, the performance number of disks. This is exactly The results grows only as the as expected: number of disks increases, the fixed-size 400KB request touches a smaller fraction of the stripe size, so the disks get to do less useful work. On the other hand, the performance improvement shown as the request size is scaled up with the number of disks shows almost perfect linearity. (In practice, these at some data are a point the host links would become strong vindication of our scalability 4.6 The a bottleneck.) We believe claims for TickerTAIP. Synthetic results Workloads reported so far that have been from fixed, constant-sized would workload workloads. over mixtures, TickerTAIP performance scale as well To test our hypothesis some other workloads, we tested a number designed to model “real-world” applications: —OLTP: based on the TPC-A ACM database Transactions of additional benchmark on Computer [Dietrich Systems, Vol et al. 1992]; 12. No 3, August 1994 262 . Thrcwqhp.t >+2LL’L “s Pel Cao et al. CPU (Random lMB, L,nk Centralized 100 (t.fF3/s) — t ) Throughput 12 +,/ Centralized T,ckerTAIP — ~ v, L,nk (Random 1M?3, CPU 100 MIPs) 12 10 I‘ r ‘lckerTA’p; - _ ,0 9 / , 7 6 4 2 0 .~ 0.5 1 1.5 L.nk 2 2.53354455 Speed (wB/see) Pes PO”,’3t,me 25 “3 L,nk (Random 1~, CPU 10fl MIPS) — - ,, ,- ‘\ Central, zcd T. CkerTAIP Fig. 12. Throughput and response time as a funct]on of both CPC1 and link speed lMB random writes. —timeshare: based on measurements 1993]; of a local Unix timesharing system [Ruemmler —scientific: running of about and Wilkes based on measurements taken from supercomputer on a Cray [Miller and Katz 1991]; “large” has a mean 0.3 MB; “small” has a mean around 30KB. applications request size Table VIII gives the throughputs for a range of processor and link forms the centralized architecture eventually able to drive the disks sizes are quite small. TickerTAIP’s 4.7 of the disk arrays under these workloads speeds. As expected, TickerTAIP outperat lower CPU speeds, although both are to saturation—mostly availability is still because the request higher, of course. The Effect of Scheduling Individual Disk Accesses Our previous results used the simplest possible request-scheduling algorithm, FCFS, at the disk device drivers in the worker nodes. In this section we explore the effects of changing this scheduling algorithm. Clearly, this will have little effect when the queue sizes seen at the disk are small, but our early experiments led us to believe that they can sometimes get quite large ACM Transactmns on Computer Systems, Vol 12, No. 3, August 1994 TickerTAIP Throughput vs Link and CPU (Random lMB I/o) . 263 ~ t L / + 6 ,/ 4 2 O* 0510152025 Link -i ,/ , /’ ~. “ /’ , w s g q f’ 3035404550 (MB/see) and CPU (MIPS) Responsetlme 1.2 vs LL. k speed and CPU (Random lMB — + ~ 1/0) ~ c j . ; 2 . ; -$ : 02 0.4 08 1 Central, zed TlckerTAIP Strlplng 0.6 1. - ~. %;; 0 0 10 20 30 Link 40 50 (MB/see) 60 CPU 10 (MIPS) Eo ~0 1(30 ;;~ -; +-------: -- Fig, 13. Throughput and response time as a function of both CPU and link speeds. lMB random writes, (especially scheduling performance. and BNN writes show loads writes ing the and that and when operating near saturation)—at which a marked point, a better in SATF, 40KB worklMB algorithm is quite likely to produce This is indeed what we found. this, writes, 40KB Figure algorithms as well can nearly writes. them the individual (which Similar that both 15 shows the results scheduling lMB random scheduling because between on workloads as the OLTP double The 1/0s the smaller is the effects SATF we prefer improvement SSTF, To demonstrate of applying workload. under comprised synthetic throughput improvement effect of the of fixed-sized OLTP for The graphs shown better results gaps are larger, so the effect of improvscheduling time algorithms) graphs. Our initial for scheduling is less noticeable. results suggest properties. ACM are seen on the and BNN response are good candidates of its inherent algorithms. Currently, BNN because starvation-resistant Transactions on Computer Systems, Vol 12. No 3. August 1994 264 . Pel Cao et al. Throughput 25 scaled write Sizes ~ v. Array’ Size h Request SLze 20 ; ? z ~ D , 0 * E 5 /’ 10 /’” /~ 15 ,’ / / / . 0 # 5 of 10 Nodes (n) 15 Request 20 Size 25 ((n-l 30 )x40K) 35 400KB W,, te, 8 ; ./” / o 5 10 f 15 of Nodes 20 (n) 25 30 35 Fig. 14 Effect of TlckerTAIP array size on performance Table VIII Thmughputs, m MB,/s, different of the three workloads array architectures under Speeds ~ Workload -z MEW I A’4/Ps I ~ OLTP t/meshare small sclenttfic large sc/ent/f/c 101 –;;. 1 101 059(17%)1059(14%) 1 ~ 0.43 (0.9%) ~o.7fj (0.8%) 163 (1 O%) 1.69 (1 3%) 169 (1 4%) -.+_ - ‘ L_ 10 1 1 I i176 {2 5%) 0.71 (4.2%) O 76 f (~7%) .20(1 .2%) (1 173 (o 4%) 1 10 10 ~ 1=2.3 10 823 %) (4 8%) 120 8.39 9%) 73 (o 2%) ! (3.3%) 981 (2.1%) ~ : The shown shading In hlghhghts parenthcse~ ) comparable total-MIPS configurations ( Relatlve standard cleviatlons are ACM Transaction. on Computer Systems, Vol 12, No 3, August 1994 TlckerTAIP Ttrouc4wJI m Loads w HOS (rmdom 40KB wni6 WK=SIS) 3 p jz :F 25 Mean ~ TIm’6 . 40f@ * SATF SNN SSTF FCFS . 265 WI19S) . -- W F& H@ (~!lb?l , .— --—. J f_-* -------- -——-—— 31 ~ 05} I 1 I c : s ‘: , ,, /% /p o ‘ = ,“ - /’;’””: o—---___J o 5 i 05 15 10 #0flmd2perh0st 20 25 0 5 15 10 #01 bad5G9h05t 20 25 ta) Fixed-size 40KB writes. (b) Fixed-size 40KB writes. Tmevs L0sd5Pw HcsI (mndm1M3ti83) SATF — BNN SSTF Q FCFS x ., . 20 18 16 TfWUEW W+bfds W( Hc61 (randcm 1M me rape+ h4a&nRes+mw 10 SATF — . .&mT? Q FCFS -- 98 ~ - 14 s ~ 12 .&z-::... ~ ~.z ‘6 to - ---- ..:.. .: ~: :/./ ~~~ ~ o~ o o 5 10 #0fbnc!5pefhost 15 20 25 0 5 # :? ILMIISw 15 I105t 20 25 (c) Fixed-size 1MB writes. (d) Fixed-size 1MB writes, Thru.I@@ w Lads w H- (OLTP WOikk@ 1 !0: f06 g 04 02 : - x---=~. : ---------------------------20 40 SATF — BNN + SSTF o FCFS . ! 00 #0ffcad5Derflo6i 01 0 I 20 40 #of fOmdsrtlfwsi 60 lW (e) Synthetic OLTP workload. Fig. The 15. Effect of different graphs response disk-level time. (f) Synthetic OLTP workload, on TickerTAIP and scheduling performance. policy, the request-scheduling as a function algorithms of load left-hand display throughput right-hand ones the 5 CONCLUSIONS TickerTAIP is a new parallel architecture for RAID arrays. Our experience is that it is eminently practical (for example, our prototype implementation took only 12k lines of commented code). The TickerTAIP architecture exploits its physical redundancy to tolerate any single point of failure, including single ACM Transactions on Computer Systems, Vol. 12, No. 3, August 1994. 266 . Pei Cao et al failures sizes with provides in the in its distributed controller: growth; configuration outstanding are just on how we more it is scalable flexibility. ordering requests as good central have that—at and across a range we of system node model how such one, to as and im- smooth incremental and its worker/originator Further, and to support multiple least clean considerable face of multiple showed faults, provide—and power —eleven provided mentations provements failures. prototyped—partial-write We have also demonstrated semantics application in this 50MIPS RAID 5MIPS processors data Finally, from as a single parallel the quantitative compare. available array imple- demonstrated performance sophisticated request-scheduling algorithms. and centraland this turns most of the has calculaUnfortubecause that Most of the performance differences between the TickerTAIP ized designs result from the cost of doing parity calculations, out to be the main other been tions nately, of the system is in CPU-intensive made in the centralized it seems high the increases memory problem that speeds they thing work that the changes lackluster with the processor is hidden by disk delays. dedicated at. Because the processor It speed: One suggestion of these XOR the XOR engines. in part is to improve performance case by constructing the resulting have than this linearly rather way systems with than to operate can be unwieldy, its performance—much is our cost of a processor of the cost the that contention itself—tackling faster in system XOR-speed off-the-shelf bring with cost, where With disks. design approach is unproductive. microprocessors are, in fact, cost-effective XOR engines, and they them all their advantages of economies of scale in manufacturing time, and reliability. to divide Thus it is better, parallelize we believe, the work to use an like TickerTAIP request larger become disks sizes, more are up and to the point the such microprocessors small With requests, added can be used. it is easy for either the difference worker, architecture more as the to saturate marked becomes and as parity to increase are obvi- calculations as multiple smarter significant. to each algorithms The difference is included. Both is also likely improvements cost of performing disk-scheduling ous upgrade paths for TickerTAIP (indeed, we have them here); both will make the TickerTAIP architecture than the centralized model. We recommend array for implementers. multinode beyond use in the TickerTAIP Additionally with parallel locally RAID attached the TickerTAIP demonstrated one of even more attractive to future is well this case, it disk suited can architecture architecture disks. In multicomputers provide hardware RAID resilience without any dedicated that already provided for the multicomputer or specialized itself. ACKNOWLEDGMENTS The TickerTAIP Hewlett-Packard is based the ACM work was done as part [Wilkes version by Chia Vol 12, No of the DataMesh 1992]. Chao. 3, August research Wiener, project at Laboratories on a centralized driver on Computer The prototype by Janet Jacobson implementation and uses the improved loosely disk written SCSI developed Systems, David 1994 Transactions TickerTAIP AT& T tasking library to use a double input options. into Chris our for its time understanding Ruemmler helped value. us Federico improve . 267 Malucelli and disk our provided models. significant of the sequencing request-scheduling We also thank the IEEE for allowing us permission to publish this and the ACM anonymous reviewers for helping us to improve it. Finally, whence the name? Because tickerT’AIP is used in all pa(rallel)lltills! REFERENCES AT& T. 1989. Code In UnLY System AT& 1990. Parallelism G, A.j VAT& T C+ + In. language system release 2.0 selected revision, the best readings. D., 90.5, Select 307-144. D. T, Indianapolis, Fault and KATZ, using and tolerance Calif. data R. an BARTLETT, Tandem BoR~L, tronics CH~N, P. redundant H. J., BAmm,m, Computers, 1988. and M., Computer GIBSON, arrays W., CARR, R., GARCIA, D,, GRAY, J., HORST, R., JARDINE, R., LENOSIU, in Tandem computer Tech. Tex. D. A. 1990. New An evaluation SIGMETRICS 74-85. G. D. JR. 1986; Rep, systems Tech. Rep. Cupertino, Technology AND MCGUIRE, management. Austin, 5890. H., ACA-ST-156-88, Microelecof Corporation, Amdahl AND PATTERSON, In of Computer of disks Proceedings Systems, 4,761,785; T. 1988, ACM, filed of ACM Conference Parity on Measurement to enhance Modeling York, CLARK, B. E., LAWLUR, F. D., SCHMIDT-STUMPF, spreading 1988. W., BOU~HT~R, storage 2 August W. E., STEWART, T. J., AND TIMMh, 1988. granted access. U.S. Patent E., AND KELLER. Conference 12 June COPELAND, G., ALEXANDER, In Proceedings York, DEWITT, 1986 DEWITT, Gamma DIETRICH, 322-331. 99-108, Data placement of Data. in Bubba. ACM, New M. of 1988 SIGMOD Internattonul on Management D. J., GERBER, R. H., GRA~F~, GAMMA-a high performance Conference G,, HEYTENS, dataflow Data M. L., KUMAR, K. B., AND MURAIJKRISHNA, machine. 1988. SIGMOD In Proceedings 228-237. analysis VLDB Endowment, database Bases. of of the 12th of the on Znternatzonal on Very Large In New D. J., GHAN~EHARIZ~DE.H, database S. W., machine. ACM, M., BROWN, of Data. S., AND SCHNEIDER, D. Proceedings York, 1988 350-360. E., A performance International S. 1992. Conference A practitioner’s Managen~ent introduction CORTES-RELLO, AND WUNDERLIN, and to database performance benchmarks measurements. E. L., SIMHAN, 1994. Comput. J. 35, (Aug.) DRJUWAU, A. L., SHIRRIFF, K. W,, HARTMAN, J. H., MILLER width network file IEEE, 28 June server. New 1988, Wtnter’92 In Proceedings 234-244 1990. of 21st S., KATZ, R. H., LUTZ, K., RAID-II: A high-bandon Computer 4, 914, ProceedFailure on ReL,. Symposz unz PATTERSON, D. A., LF,F., E, K., CHEN, P. M., AND GIBSON, G. A. Internat~onal Disk Architecture. 656; filed ENGLISH, ings York, granted DUNPHY, R. H. JR., WALSH, R., AND BOWERS, J. H. 3 April A. 1992. 1990. Loge: R. M. AND STEPANOV, A drwe memory. storage Berkeley U.S. patent device. Calif., In A self-organizing USENIX Assoc., of 3rd Operating problem Calif. of USENIX Tech nzcal disk Conference. arrays. In 237–251. Conference GIBSON, G. A., HELLERSTEIN, correction Architectural 23, Apr., GRAY, J. problem. GRAY, J. N. Course. storage Large 1988. Tech Lecture with Data tecbmques Support 123-132. A comparison Rep Notes Notes acceptable Bases. VLDB for L., KARP, R. M., KATZ, R. H., AND PATTERSON, D. A. Proceedings and Znternatzonal Systems. Programming Languages 1989. Oper. for large Syst. of the Byzantine Computers, base operating Science, 1990. M. In agreement Cupertino, systems. and the transaction Systems: Berlin, arrays: commit 88.6 Tandem on data in Computer throughput 1978. In Operating An Aduanced rehable on Very GRAY, J., HORST, B., AND WALKKR, vol 60. Springer-Verlag, Parity striping of disc 393-481. Low-cost Conference Proceedings of 16 International Endowment, ACM 148-159 on Computer Systems, Vol 12, No 3, August 1994. Transactions 268 . Pei Cao et al HEWLETT-PACKARD. Manual. Manual. HOLLAND, dant dmk D. 1991. 1988b. HP 97556, HP 7936 1992. 97558, and and HP 97560 Company, 7937 5,25-znch Boise, Drlues Boise, SCSI Idaho Dnk Drz ves: Technzczzl and operation Insiallatlon in redunSupport rotational Part Part No. 5960–0115. No 07937-90902. In AND Hewlett-Packard Hewlett-Packard Parity of 5th Operating J. 1991. HEWLETT-PACKARD Disc Operating Idaho, Company, declustering Cornput. M., AND GIBSON, G. A. arrays. M. for continuous Arch. algorithms Proceedings and WILKES, International Systems Disk scheduling Conference on Architectural News, based for posi- Programrnzng JACOBSON, tion. Languages Rep 20, 23-35. on Tech HPL-CSP-91-7, Zmplementatzon. New Efficient 986-987 Software of Cahforma, York, Hewlett-Packard H. E, 1981. 246-265 storage parity issues Div., Calif S The Atomic An Advanced mass Lahoratoz-ies, transactions. Lecture Course. Palo In Notes AJto, Cahf. Systems—ArScience, Tech. vol. LMVIPSON, B. W. AND STURGIS, chltecture 105 Bull LFE, Dzstrzbuted in Computer In IBM and Sprmger-Verlag, 1981. 2?4, 2 (July), E. K. 1990. 90/573. Univ. M In ACM K., Joy, Trans. LAWLOR, F. D. recovery mechamsm Dmlos and performance Science Berkeley, L~FFLER, Syst. 1993 in the unplementatlon of Electrical of a RAID and file prototype. Computer system RAID IEEE, arrays. New for conNew In UCB,’CSD Sclencc, McKusIcK, UNIX. M~NON, troller York, MENON, 74-83 M[LLER> tlons 51 –59 M(TNIY, Computer W. N., Dept Engmeermg 1984. A fast J., AND FABRY, R. S. 181-197. architecture Symposium for Improved on System the 1/0 Comput. of 20th J 1992 2, 3 (Aug.), J. AND COURTNF,Y, J In Proceedings 76-86. J AND KMSON, of 25th L. of a fault-tolerant on Computer update cached International Methods Arch ztectare, of disk 1. IEEE, performance Vol. Proceedings E International R H llth S 1991 Conference Analyzing Sczences. behavior Storage York, AND KATZ, of supercomputer Systems, arrays IEEE, under VLDB apphcaNew York, In In Dzgest R R AND of Papers, LUI, J. C IEEE 1990 Symposzum Performance on Mass analys[s Large of disk Data fadure, Endowment, In Pro, w’dlng$ 162-17’3 N~[(IN, 1) P~TTLRS( IN. D >lVC d]>ks o~ Data PAWL1hoW\hl, A , CH~N, GIMWN, ( RAID), ANI) KATZ, Sprzng H 1988 R. H. to redundant 112–117 of mexpenarrays Inexpensive disks COMPCON’89, Internc[tlonal IEEE, York, A , GIBSON, G , AND KATZ, R In 1990. In In P(’I 1991 A(”M proce~dmgs Steady-state Comput Sure, Intel Parallel 1993. Berkeley, D]v , Dept of 1988 New York. A ca~e for redundant Co?lference ( RAID) K SIGMOD on Management of problems ACM, simulation Corporation, Operatzng UNIX Cahf., disk of queueing 123–170 Hlllsboro, System processes. Or. A survey and PCI +{]lut]on> 1!994 (’ M E 22, 2 (,June), Speclficatzon. The Hellos J PERIHELION RummIL1>l{. [lSENLY S(HUI m’, Prentice-Hall International, London, ANI) WILKES> access patterns. of a RAID Engmeermg D 1989 In Proceedings prototype. Tech. of Winter Rep, UCB Science, 1993 CSD of In LTSENIX 1988 Computrr Berkeley, M., GIIIW)N, COitlPCON’89 on Assoc., Science Cahf 405-420 and Computer How reliable Umv. Considerations in the design of Electrical 88-448. California. SCHULZE, S,artng ACM G , KATZ, IEEE, (’oruputer R , AND PATTERSON, New York, Vol M a RAID? 118-123. 12, No 3, August 1994 Transaction. Systems, TickerTAIP . 269 SCSI. 1991. Secretariat, American (SCSI-2), USENZX 1991. Draft Computer ANSI and Business for standard USENIX Equipment Manufacturers systems—Small 2 February 1991 Association. Computer (revision Draft System 10d). of proposed Interface-2 Wznter 25-35. SIEWIOREK 2nd In SLOAN, R. D. 320-327. National Standard information X3 T9.2/86-109, Disk SELTZER, M., CHEN, P., AND OUSTERHOUT, J. 1990 Conference. SHIN, K. G. HARTS: A distributed 1992. 1990. Assoc., real-time Reliable scheduling Calif., revisited. 313-323. In IEEE In Proceedings 24, 5 Berkeley, architecture. Computer Conzput. and (May), D. P. ANII SWARZ, 1%.S. Press, of 25th M. M89/56, 1992. Assoc., 1991. 1989. DataMesh Berkeley, The Bedford, 1992. A practical Systems: Design Evaluation. DBC/ New Tech. 1012. York, Rep. ed. Digital Proceedings Mass. implementation Conference RAID—a Lab., project, of the database on System new Univ. phase project. machine—Teradata ScLences. copy Vol. 1. IEEE, International Distributed research Calif., STONEBRAIWR, UCB\ERL WILKES, USENIX Wn.KEs, J. Amsterdam, J. multiple of California, 1. In In USENIX algorithm. Calif. Electronics Research 63-69. research Berkeley, Workshop Vol. on File Systems. Press, DataMesh 7’ranspztttng’91, 2. 10S 547-553. Recewed October 1993; revLsed May 1994; accepted June 1994 ACM Transact]cms on Computer Systems, Vol 12, No 3, August 1994

Related docs
Reliability and Security of RAID Storage
Views: 81  |  Downloads: 3
HIERARCHICAL DISK CACHE MANAGEMENT IN RAID 5
Views: 62  |  Downloads: 5
Understanding RAID
Views: 27  |  Downloads: 4
Phalanx RAID
Views: 4  |  Downloads: 0
RAID-Redundant-Array-of-Independent-Disks
Views: 2  |  Downloads: 1
Parallel Computer Architecture Part I
Views: 0  |  Downloads: 0
Parallel Database Systems
Views: 56  |  Downloads: 10
WELCOME TO PARALLEL COMPUTER ARCHITECTURE
Views: 34  |  Downloads: 1
Computer Architecture Parallel Processing
Views: 0  |  Downloads: 0
FINAL ARCHITECTURE
Views: 0  |  Downloads: 0
premium docs
Other docs by Rehan Shabbir
An Analysis of Data Corruption
Views: 91  |  Downloads: 5
Reliability and Security of RAID Storage
Views: 81  |  Downloads: 3
Analysis of a New Intra-Disk Redundancy Scheme
Views: 21  |  Downloads: 1
Disk Scrubbing Versus Intra-Disk Redundancy
Views: 95  |  Downloads: 1
A New Intra-disk Redundancy Scheme
Views: 50  |  Downloads: 1
Multi Level RAID for very large disk Arrays
Views: 42  |  Downloads: 2
HIERARCHICAL DISK CACHE MANAGEMENT IN RAID 5
Views: 62  |  Downloads: 5