It may seem that parity can make things more complicated with a RAID array, so why not just stick with something like RAID 0 or RAID 1 and leave parity out of the equation? For starters, RAID 0 gives no fault tolerance, so it's not suitable for high-availability environments. RAID 1 doesn't use parity and is very inefficient with its use of disk space, because it requires a full 50 percent of the available storage since the data is simply mirrored. Using parity and RAID 3, 4, or 5, you can create a highly available disk array that can tolerate the loss of one of the disks. The data can be rebuilt using the parity information stored in the array, and these RAID levels make much more efficient use of the available disk space. What happens when parity goes bad?
With a single drive failure under any of RAID levels 1, 3, 4, 5, or 6, the failed drive can be replaced. The RAID array controller will automatically regenerate the data on the new drive using the parity information from the other drives and restore fault tolerance to the entire array. Although RAID provides an extra level of protection in the event of drive failure, parity errors can crop up. When you encounter a parity error, it indicates there is bad data on the drive. If the data cannot be corrected, you may need to load the data off to a backup tape. You know that the data cannot be corrected if you try to open a file or run an application that attempts to read that particular portion of the disk, but the file will not open, or the application crashes or doesn't run at all. In many instances, you will be notified via an error message that there was a problem reading from the disk. Often, the problem will become evident during the system backup, when all of the data on the disk is read in one sweep. In a RAID array, when a parity error is detected, the source data is reread to try to get it right. With or without RAID, parity errors can be generated because of a number of factors other than a failed disk. For example, parity errors may occur if the drive cables aren't properly connected or shielded, or if the wrong type of cable is being used to connect the disks to the controller. If you notice a significant number of parity errors, try swapping the cables and testing the controller card to make sure it hasn't gone bad. Also, check the SCSI terminators to see if one may have come loose. Most RAID controllers come with diagnostics programs to do some of the troubleshooting, so be sure to make good use of any of these packages, too. You should also investigate the physical connections to your SCSI devices to determine if they may be the source of the parity problems. First, make sure you're using the right SCSI cable. Ram Electronics has pictures of many common SCSI connectors and the SCSI Trade Association (STA)-endorsed terms and specifications for each type of connector. Most internal SCSI cables are of the ribbon variety, with any number of individual wires running through the ribbon. If even one of those wires is exposed, shorting out, cut, or not fully attached to the connector on the end, it may create data transfer problems. Make sure the SCSI cable is properly connected to both the controller card and the drive, and that the pins on the devices line up with the pins on the SCSI connector. Testing a controller card is a little more difficult. The easiest way is to use the diagnostic program that comes with many SCSI and RAID adapters. During system installation for certain servers, such as those from Dell and Compaq, utilities are written to a small partition on a disk array. Among these utilities are programs that test the array controller. You can run these programs at system boot time by pressing a key combination on the keyboard, which interrupts the boot process and instead runs the system utilities. Newer systems also include Windows-based array utilities that perform many of the same functions. Dell, for example, includes its Array Manager product for servers shipping with an array controller, which you can install with the rest of the system management suite. A second controller testing method involves moving the controller to another machine and testing it with different hardware. This isn't the preferable method, because it could result in more downtime and assumes that you have spare hardware lying about that you can use to test the hardware. How does the parity become corrupted?
A number of issues could cause the corruption of parity on a disk, including:
- System crashes: When a system crashes, any data not written to the disk is lost. In the event that data was being written to a RAID array, it is possible that either the data or the parity was written to disk, but not both. In a situation such as this, you can't rely on the parity to reconstruct the data on the disk. Reducing the number of system crashes by making use of UPS units and redundant power supplies will help protect against this type of parity corruption.
- Uncorrectable bit errors: A hard disk in an array is nothing more than a bunch of magnetic bits that gradually lose the ability to hold data over time. Eventually, bit errors are detected when an attempt is made to read data back from the drive. Many RAID arrays use embedded software that monitors the individual disks and informs an operator when a disk may be about to fail. When I am informed of an impending disk failure, I generally run a diagnostic on the RAID array to make sure the controller is working properly and verify that the error message was correct. If the verification comes back with a problem, I either replace the RAID card -- which rarely happens -- or replace any drives that the diagnostics identified as bad.
- A disk failure: Like a system crash, a disk failure can have a negative impact on parity. Disks can fail for a variety of reasons: age, overuse, excessive powering up and down, or power surges. When a disk in an array fails, replace it immediately and run a diagnostic on the array. A single disk failure may indicate more failures to come.
- Other possible causes: If the array checks out okay and the cables have been tested, the power supply in the system may be delivering too much power to a disk in the array, causing parity problems. You can test for such an issue with a voltmeter -- but be careful, because electrocution is always a possibility when working with a voltmeter. First, disconnect the system from the power source and insert the probes of the voltmeter into the socket. Next, verify the output against the local standard (110 to 120 volts in North America). Once you plug the system back in to the wall, you can disconnect the drive array from the power supply and use the voltmeter to test the individual power leads in the same way. Exact power specifications for the leads can be found in the system guide or on the manufacturer's Web site.
Luckily, most of today's RAID and SCSI controllers are very good about making sure parity errors are not introduced onto the disk. However, if this does happen, follow the suggestions above to minimise the risk of data corruption and failure. If you aren't using a parity-enabled RAID scheme on a mission-critical system, do a cost/benefit analysis and get RAID installed. It will be worth much more than the cost of a disk failure. An excellent discussion of RAID advantages and disadvantages can be found at Advanced Computer & Network Corporation's RAID.edu Web site.






