The Sombers Group, Inc., a custom software house, has been building large real-time systems for its customers for over thirty years. Our multi-platform expertise includes Compaq Himalaya (Tandem), Compaq Alpha (Digital), IBM, Unisys, Stratus, UNIX, and PC systems. Sombers’ focus on mission critical systems means that we have had to become experts in building reliability into our systems. Reliability implies high availability (5 9’s), fast recovery, transaction integrity, and data base integrity. The following paper, published in the ITUG journal “The Connection,” explores the calculation of availability and the study of transaction integrity.
The Sombers G roup
Dr. Bill Highleyman Chairman The Sombers Group, Inc.
Abstract There are two aspects controlling the reliability of a transaction processing system. One is the availability of the system, and the other is the integrity it guarantees for the transactions which it is processing. Availability lends itself to computational analysis. Availability analysis requires a minimal understanding of the system and a modest grounding in simple mathematics. The analysis of transaction integrity requires a detailed understanding of the system’s software architecture and a bulldog approach to scouting out failure windows. Detected cases of transaction corruption can be corrected by redesign or in some cases rationalized by cost analysis. Introduction The now-classic world of high volume, mission-critical transaction processing systems which drive most of our businesses today has spawned a new initiative, the Zero Latency Enterprise, or ZLE. ZLE demands the immediate availability of data as soon as it is captured by the system. The concept of ZLE is so powerful that businesses who build on this technology will become ever more dependent on their systems, and the cost of system failures will continue to escalate. As important as it may have been in the past, it is even more important now that systems are continuously available and that transactions are not lost or corrupted by any reasonable system failure. Reliability Analysis concerns itself with quantifying and improving the reliability of a system. There are many aspects to reliability, and the reliability
profile of one system may be quite different from that of another. However, there are two major aspects of reliability that are common to all systems: Availability is the proportion of time that all critical functions are available. Transaction Integrity is the requirement that no transaction or critical data be lost or corrupted.
These two characteristics are quite independent. A system may have a very high availability, but transactions may be lost or corrupted during the unlikely occurrence of a failure. On the other hand, a system may never lose a transaction but might be down quite often. This paper considers the analysis of availability and transaction integrity in transaction processing systems.
The 9s Game We have all heard a system’s reliability being characterized as three 9s, or five 9s, or whatever. Three 9s, for example, means that a system will be available 99.9% of the time. This implies that it will be down 0.1% of the time. This equates to 8.76 hours of down time per year, or almost 44 minutes per month. Not too good for a mission-critical system. Various “9” specifications result in down time as follows: Down Time Hours/Year Minutes/Month 87.60 8.76 .88 .09 .01 438 44 4.4 .44 .04
Nines 2 3 4 5 6
% Available 99% 99.9% 99.99% 99.999% 99.9999%
Windows NT servers are now reporting three 9s or better. Most high-end UNIX servers are striving for four 9s, while Compaq Himalaya systems are reaching for six 9s. Calculating Availability Let us make the following definitions: U = Mean time between system failures (MTBF), or Uptime. D = Mean time to repair the system (MTR), or Downtime. a = system availability. System availability is the proportion of time that the system is up. Since the system can only be either up or down, then
a = U U+D 1 = 1+D/U
Digging a little deeper (and for those who remember probability theory), if two systems with availability a1 and a2 must both operate in order for the system to be up, then the combined system availability is
as = a1 a2
The probability that either system has failed is (1-a1) or (1-a2). If either system must operate, but not both, then the combined system has failed only if both systems have failed. This will occur with probability (1-a1)(1-a2), and the combined system availability is therefore as = 1 – (1-a1) (1- a2) Doubling the 9s Assume that we have a system with an availability of .99. We need more reliability, so we come up with some clever technique to provide a backup for that system such that, if one subsystem fails, the other immediately takes over without losing any data. Now we have a system made up of two subsystems, each with an availability of .99. The availability of the new system is, from equation (3), as = 1 – (.01)2 = .9999 We have gone from two 9s to four 9s. Repeating this exercise, we would see that the use of two subsystems with three 9s availability would lead to a system availability of six 9s, four 9s leads to eight 9s, and so on. In short, the provision of a backup doubles the 9s. This, of course, is the rationale behind the age-old technique of backup. We have simply quantified it here. More Complex Systems These concepts can be expanded to cover a system made up of multiple identical system elements, only some of which need to be operational for the system to be operational. Let: U = MTBF of a system element. D = MTR of a system element. n = number of system elements in the system. r = number of system elements that must be operational. Us = MTBF of the entire system. Ds = MTR of the entire system. (3)
Then it can be shown 1 that the system MTBF is
U(U/D) n-r n( r-1)
where the binomial coefficient ( r-1) is
( r-1) =
The system MTR is
and the system availability is, from equation (1),
Us Us+D s
Equations (4) through (6) express the availability relations for most multiprocessor and multi-computer systems today. An Interesting Insight: More is not always better To many of us, it is an axiom that the more computers we have in a multiple computer system, the more reliable it will be. But this is not necessarily so. Consider a system made up of n system elements (say a multiple computer system with n processors). Let us assume that the operational requirements are that the system will continue to operate if any one processor is down, but the failure of two or more processors is deemed to be a system failure. Furthermore, let us assume that the availability of a single processor and its peripherals is .995. The requirement that only one processor can be down means that there is only one redundant system, or r = n-1
See “Multiple Processor Systems for Real-Time Applications,” Burt H. Liebowitz, John H. Carson; Prentice-Hall, 1985.
That is, the number of processors that must be up, r, is one less than the total number of processors in the system, n. For this relationship, equations (4) through (6) become
Ds = D/2
Using equations (4a) and (5a),
as = 1+
1 n(n-1) 2
As noted above, equation (1) can be written as a = 1/(1+D/U) which gives the general relation between a and D/U. From this relation, we have 1-a D/U = (7) a Since we have assumed that the availability of a processor and its peripherals is .995, then the ratio of its down time to its up time is
1-.995 .995 = .005025
Let us now consider a four processor system and an eight processor system. From equation 6a, the availability of a four processor system, a4, is a4 = .999849 and for an eight processor system, a8 is a8 = .999293
The eight processor system is less reliable than the four processor system. The eight processor system will be down .000707 of the time, almost five times as often as the four processor system’s down time probability of .000151. Adding processors to such a system makes it less reliable since, when one processor fails, there is a greater chance for a second failure to bring the system down. A Case Study Of course, things are never that simple. Most systems of this type can survive multiple failures so long as two processors which are backing each other up do not both go down. The following is a true case study illustrating the analysis of such a system. The capacity requirements for this system dictated that an eight processor system was required. The original design assumed that the dual ported disks and dual ported communication controllers would be distributed among the eight processors, as would be the primary and backup application processes. In general, the failure of any two processors was likely to take out one or more disks, communication lines, and/or critical processes. Thus, the system could tolerate at most one processor failure. The availability requirement for this system was .9998. Each processor and its associated peripherals had an availability of .995. Thus, as calculated above, the system availability was .999293, short of the specification. A rearrangement of the system solved the problem. The eight processor system was segmented into four two-processor pairs. Not only did each processor pair control its own dual-ported peripherals, but all process pairs were assigned to a particular processor pair. Now the system could withstand multiple processor failures so long as both processors in a particular segment did not fail. The availability of the resulting configuration can be calculated as follows. Since the availability of a single processor and its peripherals is .995, the availability of a processor pair, is, from equation (3), a2 = 1 – (.005)2 = .999975 In the system, there are four such processor pairs. If any one of these should fail, then the system will fail. Analogous to equation (2), then, the system availability is as = (.999975) 4 = .9999
This is greater than the availability of the more general configuration (.999293) and exceeds the required availability of .9998. Though the difference between these availability values seems rather small, note that the down time probability of .0001 for the final system is half that of the specification (.0002) and better than one-seventh that of the general configuration (.000707). Thus, the final configuration would be down only half as much as allowed by the specification. The general configuration would be down over 7 times as much as the final configuration.
Unfortunately, the issue of transaction integrity does not lend itself to a convenient mathematical analysis as does availability. Even worse, there are no testing strategies that can guarantee that all transactions will be handled properly no matter what failure modes might occur. The analysis of transaction integrity is a brute force, time consuming process of trying to envision all possible failure sequences and convincing oneself that a transaction will either survive each failure sequence or will be properly rolled back out of the system. The Two Phase Commit Protocol… Probably the most important advance for transaction integrity is the two phase commit protocol with which most of us are familiar. Used mainly for database updates, this protocol ensures that either all updates associated with a transaction are made to the data base or that none are. Thus, the data base always reflects a consistent state. The earliest transaction managers that used this technique were IBM’s CICS and IMS. Tandem improved commit performance significantly with its Pathway product, and the industry then standardized on the XA protocol for open two phase commit. BEA’s Tuxedo, Transarc’s Encina, and Microsoft’s MTS are leading examples of XA-compliant transaction managers. Most of today’s relational database management systems offer an XA interface so that they can participate in transactions managed by an XA transaction manager as well as provide transaction management for transactions using only their database product. …Is Not a Panacea Transaction management for disk updates can today be considered rock solid. Used properly, the data base will always be in a consistent state; and the submitter of a transaction will always know that the transaction completed properly or that it failed and must be resubmitted. However, not all cases are covered, leaving a lot of room for transactional worry: In typical systems, not all events generated by a transaction are covered. For instance, updates to a memory resident table are typically not protected. Also, a message sent to an external system cannot be recalled, nor can a printed report. If a transaction is aborted, any of these actions have been irretrievably taken.
If the data base is distributed over a wide area, it can be left in an unknown state by a communication error. This is due to the nature of the two phase commit protocol. During the first phase, it asks each resource manager (the disks) if it is prepared to commit. It waits for acknowledgements from each RM and then issues its second phase command, a commit (or abort if one RM cannot acknowledge). If a communication failure occurs between the prepare response and the commit response, the transaction manager does not know what the remote resource manager has done. Meanwhile, the remote resource manager may be holding one or more records locked if it did not receive the commit command, and can’t wait forever. It must ultimately decide whether to commit or abort the transaction, leaving the transaction and thus the data base in an indeterminant state. Finally, transaction protection is not cheap. Transaction managers cost license and maintenance fees, they can be difficult to install and manage, and they impose their toll on system performance. Especially in systems with rather simple transactions, the developers may well try to design around the need for a commercial transaction manager.
Failure Windows Thus, no matter what, one is left with the task of analyzing transactions and failure modes to try to detect cases in which a transaction may be corrupted. This usually results in the detection of failure windows. A failure window is a brief period of time during which an improbable sequence of events could cause transaction corruption. It is often tempting to argue that the failure window is so small that the probability of a transaction being corrupted by a critical failure is inconsequential. In fact, in some cases a corrupted transaction is not a disaster. Its cost may simply be the time of someone to straighten out the data base. In these cases, a cost analysis of transaction corruption is certainly a valid approach. But be careful! A tiny failure window of 5 milliseconds in a transaction environment of ten transactions per second means only a 5% chance (5 msec. out of 100) of trouble following a critical failure. But in a 1000 transaction per second environment, it means an average of five transactions will be corrupted following such a failure. The question has changed from “What is the probability that a transaction will be corrupted?” to “How many transactions will be corrupted?”
Transaction Integrity Analysis Thus, the analysis of a system for transaction integrity is a lot of work. It entails: a detailed understanding of how the system works and the transactions that flow through it. a search for all possible failure windows. a determination of design changes to close the failure windows, or a determination that the cost of the corruption of a transaction by a failure window is acceptable.
A Case Study In a UNIX-based trading system, all orders were to be kept in a memoryresident order book to obtain the maximum throughput for the order matching process. To protect the book, the matching engine was replicated; and the memory-resident replicated book was to be kept synchronized via checkpoint messages. When a buy/sell match was found between an incoming order and an order posted to the book, a transaction was to be initiated which: updated the order in the memory-resident order book. wrote two trade records to the trade log, one for each side to the trade. checkpointed the order book change to the backup’s memoryresident copy.
Following the successful completion of the transaction, the trade log was used to generate execution reports to the two parties and to broadcast the trade information. Should the transaction abort, then the trade was rolled back in the primary’s memory-resident order book; and a backout checkpoint message was sent to the backup to roll back its order book. Upon analysis, it was realized that if a transaction was aborted, the backout checkpoint was at risk. If the primary processor just happened to fail between the time the backup checkpoint was made and the time the transaction completed, or between an abort and the time the backout checkpoint could be made, then the backup’s order book would show that the trade had been executed, which was not
the case. The owners of those orders would never get that trade executed. These brief periods of time – between the update checkpoint and the transaction commit, or between the transaction abort and the backout checkpoint – represent examples of failure windows.
update local order book
checkpoint order book
The situation was even worse if the update checkpoint was made following the transaction. In this case, if the primary failed between the completion of the transaction and the checkpoint, the backup’s order book would show that the order had not been executed. It would now be executed again, and the trader would not be pleased to learn that he had bought his million dollars worth of IBM twice. The solution was to keep the update checkpoint within the transaction but to mark the order locked in the backup’s order book. When the transaction completed (or aborted), the backup would be informed to unlock (or roll back) the order. If the primary failed before this could be done, the backup would find the highest priority order on one side or the other of its book locked. It would check the trade log to see if that order had been executed and would unlock or roll back the order as appropriate. Since only one order could be affected, this did not significantly affect recovery time. Note that in this case the failure window was extremely short – probably measured in microseconds. It would be tempting to argue that the probability of contaminating the order book was so small that it could be ignored. However, the cost of such contamination could be very high indeed. The cost of the design change to fix the problem was not serious at all.
Summary The analysis of a system’s reliability must focus on two independent areas – availability and transaction integrity. The first lends itself to computational analysis; the second requires brute force. The first can be done with a minimum knowledge of the system but requires a reasonable grounding in simple mathematics. The second requires an in-depth knowledge of the software architecture and the transaction flow in the system as well as a bulldog approach to the search for failure windows. Both are equally important to building a system that won’t have the telephones ringing off the hook at the help desk or, worse still, making very undesirable headlines in the national newspapers.