Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

RAID Survey by niusheng11



       P. M. Chen, U. Michigan
       E. K. Lee, DEC SRC
       G. A. Gibson, CMU
       R. H. Katz, U. C. Berkeley
       D. A. Patterson, U. C. Berkeley
• The seven RAID organizations
• Why RAID-1, RAID-3 and RAID-5 are the most
• The small write problem occurring with RAID-5
   – Possible solutions
• Review of actual implementations
            Original Motivation
• Replacing large and expensive mainframe hard
  drives (IBM 3310) by several cheaper
  Winchester disk drives
• Will work but introduce a data reliability problem:
   – Assume MTTF of a disk drive is 30,000 hours
   – MTTDL for a set of n drives is 30,000/n
      • n = 10 means MTTDL of 3,000 hours
          Today’s Motivation
• “Cheap” SCSI hard drives are now big enough
  for most applications
• We use RAID today for
   – Increasing disk throughput by allowing parallel
   – Eliminating the need to make disk backups
      • Disk drives are too big to be backed up in
        an efficient fashion
                   RAID 0
• Spread data over multiple disk drives
• Advantage
   – Simple to implement
   – Fast
• Disadvantage
   – Very unreliable
      • RAID 0 with n disks has MMTF equal to 1/n
        of MTTF of a single disk
                     RAID 1
• Mirroring
   – Two copies of each disk block on
     two separate drives
• Advantages
   – Simple to implement and fault-tolerant
• Disadvantage
   – Requires twice the disk capacity of normal file
                    RAID 2
• Instead of duplicating the data blocks we use an
  error correction code
• Very bad idea because disk drives either work
  correctly or do not work at all
   – Only possible errors are omission errors
   – We need an omission correction code
      • A parity bit is enough to correct a single
         RAID 2


Data disks    Error correction

                    RAID 3
• Requires N+1 disk drives
  – N drives contain data
      • 1/N of each data block on each drive
      • Block b[k] now partitioned into N fragments
        b[k,1], b[k,2], ... b[k,N]
  – Parity drive contains exclusive or of these N
            p[k] = b[k,1]  b[k,2]  ...  b[k,N]
       RAID 2
           RAID 3
      Data disks            Error correction

     RAID 3

      Data disks          Parity disk

A stripe consists of a single block
                    RAID 4
• Requires N+1 disk drives
  – N drives contain data (individual blocks)
  – parity drive contains exclusive or of the
    N blocks in stripe
         p[k] = b[k]  b[k+1]  ...  b[k+N-1]
              RAID 4

  RAID 4

    Data disks      Parity disk
    RAID 5 multiple blocks
A stripe now contains 25% Parity

                      75% Data
                     RAID 5
• Single parity drive of RAID-4 is involved in every
   – Will limit parallelism
• RAID-5 distribute the parity blocks among the
  N+1 drives
    RAID 5

 Data disks   Parity disk
RAID 5        25% Parity

               75% Data
       The small write problem
• Specific to RAID 5
• Happens when we want to update a single block
   – Block belongs to a stripe
   – How can we compute the new value of the
     parity block

      b[k]    b[k+1]    b[k+2]   ...    p[k]
                First solution
• Read values of N-1 other blocks in stripe
• Recompute
   p[k] = b[k]  b[k+1]  ...  b[k+N-1]

• Solution requires
   – N-1 reads
   – 2 writes (new block and parity block)
             Second solution
• Assume we want to update block b[m]
• Read old values of b[m] and parity block p[k]
• Compute
   p[k] = new b[m]  old b[m]  old p[k]

• Solution requires
   – 2 reads (old values of block and parity block)
   – 2 writes (new block and parity block)
                   RAID 6
• Each stripe has two redundant blocks:
   – P + Q redundancy
• Advantage
   – Much higher reliability
• Disadvantage:
   – Costlier updates
• Focus on system throughput
• Measure it against system cost expressed in
  number of disk drives
     Throughputs per dollar
       Small      Small        Large    Large
       read       write         read    write
RAID 0   1         1              1       1
RAID 1   1          ½            1       ½
RAID 3   1/G       1/G         (G-1)/G (G-1)/G
RAID 5   1     max(1/G, 1/4)     1     (G-1)/G
RAID 6   1     max(1/G, 1/6)     1     (G-2)/G
• Performance per dollar of RAID 3 is always less
  or equal to that of a RAID 5 system
• For small writes,
   – RAID 3, 5 and 6 are equally cost -effective at
     small group sizes
   – RAID 5 and 6 are better for large group sizes
• Theoretical reliability is very high
   – Especially for RAID 6
• In practice,
   – System crashes can cause
     parity inconsistencies
   – Uncorrectable bit errors can happen during
     repair times (one in 1014 bits)
   – Correlated disk failures happen!
 Impact of parity inconsistencies
• Happen when system crashes during an update
   – New data were written but parity block was
     not updated
• Has little impact on RAID 3 (bad block)
• Significant impact on RAID 5
• Bigger impact on RAID 6
   – Same as simultaneous failures of both P& Q
• System crashes and unrecoverable bit errors
  have biggest effect on MTTDL
• P + Q redundant disks protect against correlated
  disk failures and unrecoverable bit errors
   – Still vulnerable to system crashes
   – Should use NVRAM for write buffers
• Must prevent users from reading corrupted data
  from a failed disk
   – Mark blocks located on the failed disk
   – Mark reconstructed blocks valid
• To avoid regenerating all parity blocks after a
   – Must keep track of parity consistency and
     store it in stable storage
• Maintaining consistent/inconsistent state
  information for all parity blocks is a problem for
  software RAID systems
   – Rarely have NVRAM
• If updates are local, keep track in stable storage
  of a small number of parity blocks that could be
• Otherwise use group commits
• Asynchronous writes can help if future updates
  overwrites previous ones
• Caching recently read blocks can help if old data
  necessary to compute new parity are in cache
• Caching recently written parity can also help
   – Parity is computer over many logically
     consecutive blocks
• Floating Parity
   – Make parity update cheaper, by putting parity
     in a rotationally-nearby unallocated block
   – Requires directories for locations of nearby
     unallocated blocks
   – Should be implemented at controller level
• Parity Logging :
   – Defers cost of parity update by logging XOR
     of old data and new data
   – Replay log file later to update parity
   – Reduces update cost to two blocking writes
     (if we have in the old data block in RAM)
   – It works because nearly all storage systems
     have idle times.
          Declustered Parity (I)
• Addresses issue of high read cost when
  recovering from a failure a failure
• Looking at example:
   – A failure of disk 2 generates additional read
     requests to disks 0, 1 and 3 every time a read
     request is made for a block that was stored on
     disk 2
Declustered Parity (II)
          Declustered Parity (III)
• With declustered parity:
   – Same disk belongs to different groups
• Looking at example:
   – Disk 2 is in groups (0,1, 2, 3), (4, 5, 2 , 3) and
     so on
   – Additional read requests caused by a failure
     of disk 2 are now spread among all remaining
         Declustered Parity (IV)
• Extra workload caused by the failure of a disk is
  now shared by all remaining disks
• Sole Disadvantage:
   – A failure of any two disks will now result in
     data loss
   – In a standard set of RAID array, the two failed
     disks had to be in the same array
  Exploiting On-Line Spare Disks
• Distributed Sparing:
   – No dedicated spare disk
   – Each disk has 1/(N+1) of its capacity reserved
• Parity Sparing:
   – Also spreads the spare space but uses it to
     sore additional party blocks
      • Can split groups into half groups
      • More …
       Distributed Sparing

S0, S1 and S2 represent spare blocks
             CASE STUDIES
• TicketTAIP
• AutoRAID
   – See presentation
                TickerTAIP (I)
• Traditional RAID architectures have
   – A central RAID controller interfacing to the
     host and processing all I/O requests
   – Disk drives organized in strings
   – One disk controller per disk string (mostly
               TickerTAIP (II)
• Capabilities of RAID controller are crucial to the
  performance of RAID
   – Can become memory-bound
   – Presents a single point of failure
   – Can become a bottleneck
• Having a spare controller is an expensive
             TickerTAIP (III)
•  Uses a cooperating set of
   array controller nodes
• Major benefits are:
  – Fault-tolerance
  – Scalability
  – Smooth incremental growth
  – Flexibility: can mix and match components
                TickerTAIP (IV)


                Controller nodes
            TickerTAIP ( V)
A TickerTAIP array consists of:
• Worker nodes connected with one or more
    local disks through a bus
• Originator nodes interfacing with host
    computer clients
• A high-performance small area network:
   – Mesh based switching network (Datamesh)
   – PCI backplanes for small networks
             TickerTAIP ( VI)
• Can combine or separate worker and originator
• Parity calculations are done in decentralized
   – Bottleneck is memory bandwidth not CPU
   – Cheaper than having faster paths to a
     dedicated parity engine
• RAID original purpose was to take advantage of
  Winchester drives that were smaller and cheaper
  than conventional disk drives
   – Replace a single drive by an array of smaller
• Nobody does that anymore!
• Main purpose of RAID is to build fault-tolerant
  file systems that do not need backups

To top