ANALYSIS ReliabilityThe reliability of the hardware and software can also be verified from customer references and industry analysts. Beyond that, you should consider performing what I call an
empirical component reliability analysis. This requires the following steps:
- Review and analyze problem management logs.
- Review and analyze supplier logs.
- Acquire feedback from operations personnel.
- Acquire feedback from support personnel.
- Acquire feedback from supplier repair personnel.
- Compare experiences with other shops.
- Study reports from industry analysts.
An analysis of problem logs should reveal any unusual patterns of failure. You should study them by supplier, product, using department, time and day of failures, frequency of failures, and time to repair. Suppliers often keep on-site repair logs you can use to conduct a similar analysis.
You'll find that feedback from operations personnel can often be candid and revealing as to how components are truly performing. This can especially be the case for off-site operators. For example, they may be doing numerous resets on a particular network component every morning prior to start-up, but they may not bother to log it since it always comes up. Similar conversations with various support personnel such as systems administrators, network administrators, and database administrators may solicit similar revelations.
You might think that feedback from repair personnel from suppliers would be biased, but in my experience they can be just as candid and revealing about the true reliability of their products as the people using them. This then becomes another valuable source of information for evaluating component reliability, as is comparing experiences with other shops. Shops that are closely aligned with your own in terms of platforms, configurations, services offered, and customers can be especially helpful. Reports from reputable industry analysts can also be used to predict component reliability.
RepairabilityRepairability is the relative ease with which service technicians can resolve or replace failing components. Two common metrics used to evaluate this trait are how long it takes to do the actual repair and how often the repair work needs to be repeated. In more sophisticated systems, this can be done from remote diagnostic centers, where failures are detected and circumvented and arrangements are made for permanent resolution with little or no involvement of operations personnel.
RecoverabilityRecoverability refers to the ability to overcome a momentary failure in such a way that there is no impact on end-user availability. It could be as small as a portion of main memory recovering from a single-bit memory error, and as large as having an entire server system switch over to its standby system with no loss of data or transactions. Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as the retrying of transmissions down network lines.
ResponsivenessResponsiveness is the sense of urgency all people involved with high availability need to exhibit. This includes having well-trained suppliers and in-house support personnel who can respond to problems quickly and efficiently. It also pertains to how quickly the automated recovery of resources, such as disks or servers, can be enacted.
RobustnessThe final characteristic of high availability is robustness, which describes the overall design of the availability process. A robust process will be able to withstand a variety of forces -- both internal and external -- that could easily disrupt and undermine availability in a weaker environment. Robustness puts a high premium on documentation and training to withstand technical changes as they relate to platforms, products, services, and customers; personnel changes as they relate to turnover, expansion, and rotation; and business changes as they relate to new direction, acquisitions, and mergers.
Understanding and applying these seven characteristics of high availability can help transform the continuous uptime of your infrastructure into what may be the most significant R of all, a reality.