RAID Survey

Document Sample
RAID Survey Powered By Docstoc

       P. M. Chen, U. Michigan
       E. K. Lee, DEC SRC
       G. A. Gibson, CMU
       R. H. Katz, U. C. Berkeley
       D. A. Patterson, U. C. Berkeley
• The seven RAID organizations
• Why RAID-1, RAID-3 and RAID-5 are the most
• The small write problem occurring with RAID-5
   – Possible solutions
• Review of actual implementations
            Original Motivation
• Replacing large and expensive mainframe hard
  drives (IBM 3310) by several cheaper
  Winchester disk drives
• Will work but introduce a data reliability problem:
   – Assume MTTF of a disk drive is 30,000 hours
   – MTTDL for a set of n drives is 30,000/n
      • n = 10 means MTTDL of 3,000 hours
          Today’s Motivation
• “Cheap” SCSI hard drives are now big enough
  for most applications
• We use RAID today for
   – Increasing disk throughput by allowing parallel
   – Eliminating the need to make disk backups
      • Disk drives are too big to be backed up in
        an efficient fashion
                   RAID 0
• Spread data over multiple disk drives
• Advantage
   – Simple to implement
   – Fast
• Disadvantage
   – Very unreliable
      • RAID 0 with n disks has MMTF equal to 1/n
        of MTTF of a single disk
                     RAID 1
• Mirroring
   – Two copies of each disk block on
     two separate drives
• Advantages
   – Simple to implement and fault-tolerant
• Disadvantage
   – Requires twice the disk capacity of normal file
                    RAID 2
• Instead of duplicating the data blocks we use an
  error correction code
• Very bad idea because disk drives either work
  correctly or do not work at all
   – Only possible errors are omission errors
   – We need an omission correction code
      • A parity bit is enough to correct a single
         RAID 2


Data disks    Error correction

                    RAID 3
• Requires N+1 disk drives
  – N drives contain data
      • 1/N of each data block on each drive
      • Block b[k] now partitioned into N fragments
        b[k,1], b[k,2], ... b[k,N]
  – Parity drive contains exclusive or of these N
            p[k] = b[k,1]  b[k,2]  ...  b[k,N]
       RAID 2
           RAID 3
      Data disks            Error correction

     RAID 3

      Data disks          Parity disk

A stripe consists of a single block
                    RAID 4
• Requires N+1 disk drives
  – N drives contain data (individual blocks)
  – parity drive contains exclusive or of the
    N blocks in stripe
         p[k] = b[k]  b[k+1]  ...  b[k+N-1]
              RAID 4

  RAID 4

    Data disks      Parity disk
    RAID 5 multiple blocks
A stripe now contains 25% Parity

                      75% Data
                     RAID 5
• Single parity drive of RAID-4 is involved in every
   – Will limit parallelism
• RAID-5 distribute the parity blocks among the
  N+1 drives
    RAID 5

 Data disks   Parity disk
RAID 5        25% Parity

               75% Data
       The small write problem
• Specific to RAID 5
• Happens when we want to update a single block
   – Block belongs to a stripe
   – How can we compute the new value of the
     parity block

      b[k]    b[k+1]    b[k+2]   ...    p[k]
                First solution
• Read values of N-1 other blocks in stripe
• Recompute
   p[k] = b[k]  b[k+1]  ...  b[k+N-1]

• Solution requires
   – N-1 reads
   – 2 writes (new block and parity block)
             Second solution
• Assume we want to update block b[m]
• Read old values of b[m] and parity block p[k]
• Compute
   p[k] = new b[m]  old b[m]  old p[k]

• Solution requires
   – 2 reads (old values of block and parity block)
   – 2 writes (new block and parity block)
                   RAID 6
• Each stripe has two redundant blocks:
   – P + Q redundancy
• Advantage
   – Much higher reliability
• Disadvantage:
   – Costlier updates
• Focus on system throughput
• Measure it against system cost expressed in
  number of disk drives
     Throughputs per dollar
       Small      Small        Large    Large
       read       write         read    write
RAID 0   1         1              1       1
RAID 1   1          ½            1       ½
RAID 3   1/G       1/G         (G-1)/G (G-1)/G
RAID 5   1     max(1/G, 1/4)     1     (G-1)/G
RAID 6   1     max(1/G, 1/6)     1     (G-2)/G
• Performance per dollar of RAID 3 is always less
  or equal to that of a RAID 5 system
• For small writes,
   – RAID 3, 5 and 6 are equally cost -effective at
     small group sizes
   – RAID 5 and 6 are better for large group sizes
• Theoretical reliability is very high
   – Especially for RAID 6
• In practice,
   – System crashes can cause
     parity inconsistencies
   – Uncorrectable bit errors can happen during
     repair times (one in 1014 bits)
   – Correlated disk failures happen!
 Impact of parity inconsistencies
• Happen when system crashes during an update
   – New data were written but parity block was
     not updated
• Has little impact on RAID 3 (bad block)
• Significant impact on RAID 5
• Bigger impact on RAID 6
   – Same as simultaneous failures of both P& Q
• System crashes and unrecoverable bit errors
  have biggest effect on MTTDL
• P + Q redundant disks protect against correlated
  disk failures and unrecoverable bit errors
   – Still vulnerable to system crashes
   – Should use NVRAM for write buffers
• Must prevent users from reading corrupted data
  from a failed disk
   – Mark blocks located on the failed disk
   – Mark reconstructed blocks valid
• To avoid regenerating all parity blocks after a
   – Must keep track of parity consistency and
     store it in stable storage
• Maintaining consistent/inconsistent state
  information for all parity blocks is a problem for
  software RAID systems
   – Rarely have NVRAM
• If updates are local, keep track in stable storage
  of a small number of parity blocks that could be
• Otherwise use group commits
• Asynchronous writes can help if future updates
  overwrites previous ones
• Caching recently read blocks can help if old data
  necessary to compute new parity are in cache
• Caching recently written parity can also help
   – Parity is computer over many logically
     consecutive blocks
• Floating Parity
   – Make parity update cheaper, by putting parity
     in a rotationally-nearby unallocated block
   – Requires directories for locations of nearby
     unallocated blocks
   – Should be implemented at controller level
• Parity Logging :
   – Defers cost of parity update by logging XOR
     of old data and new data
   – Replay log file later to update parity
   – Reduces update cost to two blocking writes
     (if we have in the old data block in RAM)
   – It works because nearly all storage systems
     have idle times.
          Declustered Parity (I)
• Addresses issue of high read cost when
  recovering from a failure a failure
• Looking at example:
   – A failure of disk 2 generates additional read
     requests to disks 0, 1 and 3 every time a read
     request is made for a block that was stored on
     disk 2
Declustered Parity (II)
          Declustered Parity (III)
• With declustered parity:
   – Same disk belongs to different groups
• Looking at example:
   – Disk 2 is in groups (0,1, 2, 3), (4, 5, 2 , 3) and
     so on
   – Additional read requests caused by a failure
     of disk 2 are now spread among all remaining
         Declustered Parity (IV)
• Extra workload caused by the failure of a disk is
  now shared by all remaining disks
• Sole Disadvantage:
   – A failure of any two disks will now result in
     data loss
   – In a standard set of RAID array, the two failed
     disks had to be in the same array
  Exploiting On-Line Spare Disks
• Distributed Sparing:
   – No dedicated spare disk
   – Each disk has 1/(N+1) of its capacity reserved
• Parity Sparing:
   – Also spreads the spare space but uses it to
     sore additional party blocks
      • Can split groups into half groups
      • More …
       Distributed Sparing

S0, S1 and S2 represent spare blocks
             CASE STUDIES
• TicketTAIP
• AutoRAID
   – See presentation
                TickerTAIP (I)
• Traditional RAID architectures have
   – A central RAID controller interfacing to the
     host and processing all I/O requests
   – Disk drives organized in strings
   – One disk controller per disk string (mostly
               TickerTAIP (II)
• Capabilities of RAID controller are crucial to the
  performance of RAID
   – Can become memory-bound
   – Presents a single point of failure
   – Can become a bottleneck
• Having a spare controller is an expensive
             TickerTAIP (III)
•  Uses a cooperating set of
   array controller nodes
• Major benefits are:
  – Fault-tolerance
  – Scalability
  – Smooth incremental growth
  – Flexibility: can mix and match components
                TickerTAIP (IV)


                Controller nodes
            TickerTAIP ( V)
A TickerTAIP array consists of:
• Worker nodes connected with one or more
    local disks through a bus
• Originator nodes interfacing with host
    computer clients
• A high-performance small area network:
   – Mesh based switching network (Datamesh)
   – PCI backplanes for small networks
             TickerTAIP ( VI)
• Can combine or separate worker and originator
• Parity calculations are done in decentralized
   – Bottleneck is memory bandwidth not CPU
   – Cheaper than having faster paths to a
     dedicated parity engine
• RAID original purpose was to take advantage of
  Winchester drives that were smaller and cheaper
  than conventional disk drives
   – Replace a single drive by an array of smaller
• Nobody does that anymore!
• Main purpose of RAID is to build fault-tolerant
  file systems that do not need backups