RAID by primusboy


									RAID, Can it Fail? If it Does is Data Recovery Possible?
Data Recovery - What is RAID? Can data be recovered when one fails?
Originally, as envisaged in 1987 by Patterson, Gibson and Katz from the
University of California in Berkeley, the acronym RAID stood for a
"Redundant Array of Inexpensive Disks". In short a larger number of
smaller cheaper disks could be used in place of a single much more
expensive large hard disk, or even to create a disk that was larger than
any currently available.
They went a stage further and postulated a variety of options that would
not only result in getting a big disk for a lower cost, but could improve
performance, or increase reliability at the same time. Partly the options
for improved reliability were required as using multiple disks gave a
reduction in the Mean-Time-Between-Failure, divide the MTBF for a drive
in the array by the number of drives and theoretically a RAID will fail
more quickly than a single disk.
Today RAID is usually described as a "Redundant Array of Independent
Disks", technology has moved on and even the most costly disks are not
particularly expensive.
Six levels of RAID were originally defined, some geared towards
performance, others to improved fault tolerance, though the first of
these did not have any redundancy or fault-tolerance so might not truly
be considered RAID.
RAID 0 - Striped and not really "RAID"
RAID 0 provides capacity and speed but not redundancy, data is striped
across the drives with all of the benefits that gives, but if one drive
fails the RAID is dead just as if a single hard disk drive fails.
This is good for transient storage where performance matters but the data
is either non-critical or a copy is also kept elsewhere. Other RAID
levels are more suited for critical systems where backups might not be
up-to-the-minute, or down-time is undesirable.
RAID 1 - Mirroring
RAID 1 is often used for the boot devices in servers or for critical data
where reliability requirements are paramount. Usually 2 hard disk drives
are used and any data written to one disk is also written to the other.
In the event of a failure of one drive the system can switch to single
drive operation, the failed drive replaced and the data transferred to a
replacement drive to rebuild the mirror.
RAID 2 introduced error correction code generation to compensate for
drives that did not have their own error detection. There are no such
drives now, and have not been for a long time. RAID 2 is not really used
RAID 3 - Dedicated Parity
RAID 3 uses striping, down to the byte level. This adds a hardware
overhead for no apparent benefit. It also introduces "parity" or error
correction data on a separate drive so an additional hard disk is needed
that gives greater security but no additional space.
RAID 4 - Dedicated Parity
RAID 4 stripes to the block level, and like RAID 3 stores parity
information on a dedicated drive.
RAID 5 - The most common format
RAID 5 stripes at the block level but does not use a single dedicated
drive for storing parity. Instead, parity is interspersed within the
data, so after each run of data stripes there is a strip of parity data,
but this changes then for the next set of stripes.
This could means, for example, that in a 3 disk RAID 5 there are data
strips on disks 0 and 1 followed by a parity strip on disk 2. For the
next set of stripes the data is on disks 0 and 2 with the parity on disk
1, then data on disks 1 and 2 with parity on disk 0.
RAID 5 is generally faster for smaller reads, so eminently suitable for
server systems being shared by large numbers of users created smaller
data files or accessing smaller amounts of data each time. For other
applications, however, RAID 4 will outperform RAID 5 quite considerably.
Beyond RAID 5?
Advances on RAID 5 do exist, though in general these use RAID 5
techniques and enhance them, for example by mirroring two RAID 5 arrays,
or by having 2 parity stripes.
RAID data recovery
It might be imaged that with all of this fault tolerance that data
recovery would not be a requirement, but things will still go wrong.
With all RAID levels logical corruption, damage to the file system, has
just as devastating effect as with a single hard disk. You might have a
robustly stored file system, but it is a robustly stored and corrupted
file system.
With RAID 0 the result of a failure of one disk is terminal for the RAID,
if data cannot be recovered from the failed disk then a percentage of the
data is lost for good, and since RAID uses data striping, this could be
like losing 1 MB of data out of every 4 MB, and the chances of that
leaving any major files intact are low. For smaller files, those less
than the sum of a strip each from the working drive there will be files
that are fortunately intact, for larger files (e.g. Exchange or SQL
databases) there will be considerable data loss and structural damage and
low level work will be required to salvage any useful data from them.
For RAID levels where there is parity and the chance to recover from a
single disk failure then the most common problems were see are:
Degraded running
A single disk fails and is ignored, or there is not a spare available and
so one is ordered. Either way the RAID unit stays in operation but with a
disk missing so there is no longer any redundancy.
Usually the hard disks in a RAID are part of the same manufacturing
batch, have been stored and run in the same environment, if the unit has
been mis-handled then each disk in the RAID has been mis-handled. So,
there is quite a good chance that another drive will fail sometime soon,
if not for any of the reasons just given but because bad things don't
happen singly.
Multiple failure
Striped RAID is fault tolerant if a single drive fails nice and cleanly.
If multiple drives fail then the RAID is lost, but also if one drive
fails and de-stabilises the SCSI bus. This can result in multiple drives
appearing to fail, the RAID unit believes that they have failed, and so
the RAID will not operate.
Configuration loss
When a RAID is configured information is stored about the order of the
disks the size of a strip of data and so on. If there is a failure within
the RAID controller and this information is lost then the RAID will no
operate, and it is not always practicable to re-instate it.
Some RAID controllers will consider re-programming the RAID configuration
as a rebuild request and re-write to each of the disks destroying the
People making it worse
One of the worst sounds we hear with RAID problems is that of human
panic, and frantic attempts to repair the problem. "We're just going to
try one more thing" is often the sound that signals the end of the data
as a RAID is repaired with the disks in the wrong slots, or rebuild and
set back to its original state.
What to do when a RAID fails
Make sure that anything you do is going to be non-destructive.
Get Advice
Do not let anyone push you into precipitous action, they might have a
deadline and be applying pressure but they will quickly forget their part
in driving proceedings when the RAID is fatally damaged by a hurried
repair attempt.
How can data be recovered from a RAID?
Much of RAID recovery is the same as for a single disk recovery, data
must be secured and backed up to guarantee that the problem will not be
exacerbated. For logical problems the difficult work is all on the
analysis of the file system, that it is from a RAID makes no major
difference once the RAID scheme has been identified and the correct
access to it worked out.
For mirrored RAID data can be "mixed and matched" from the good sectors
of two drives to rebuild a good drive. With striped RAID schemes that use
parity then data can be rebuild at the stripe level rather than on a per
drive basis so if there are bad sectors throughout more than one drive
these can be corrected individually.
With non-redundant RAID schemes each sector that cannot read from a disk
results in data loss from the RAID set. For redundant RAID schemes,
however, there is much that can be done to rebuild when data is missing.
Whilst a RAID controller will take a disk off-line when it fails and
operate in degraded mode rebuilding the data from the missing disk on
demand, a data recovery process can be somewhat more sophisticated. With
properly written recovery software the level of granularity can be one
sector rather than one disk so for each sector that fails the data can be
rebuild so long as all sectors can be recovered from the remainder of the
disks. Even if the next failed sector is on a different drive in the set,
so long as the same sector can be read from the other disks then a
complete rebuild can be made.
For levels of RAID that have greater redundancy, the number of failed
sectors across a set of disks can be even greater without data loss.
Even as data recovery specialists we are, however, still bound by the
rules of mathematics. If sector 99 is missing from both disks 0 and 4 in
a RAID5 set then rebuilding of the missing data is not a possibility.
Once the raid/disk issues have been resolved then the data recovery
process can continue just as it would for a single disk.
The author has been working as a data recovery engineer and software
developer for the past 25 years in the UK, Germany and the US and has now
started his own data recovery aimed at providing a technical rather than
sales led service.
Visit the author's blog:
Visit the author's web site:

To top