Making Web Services Dependable L. E. Moser P. M. Melliar-Smith Wenbing Zhao Electrical and Computer Engineering Electrical and Computer Engineering Electrical and Computer Engineering University of California, Santa Barbara University of California, Santa Barbara Cleveland State University Santa Barbara, CA 93106 Santa Barbara, CA 93106 Cleveland, OH 44115 email@example.com firstname.lastname@example.org email@example.com Abstract Web Services create opportunities for efﬁcient ecosystems of consumers and suppliers, collaborating and competing for Web Services offer great promise for integrating and au- products and services over the Internet. tomating software applications within and between enter- Web Services standards deﬁne the syntax of Web Ser- prises over the Internet. However, ensuring that Web Ser- vices documents, the format of messages, and the means vices are dependable, and can satisfy their clients’ requests to describe and ﬁnd Web Services. They do not deﬁne im- when the clients need them is a real challenge because, typi- plementation mechanisms or application program interfaces, cally, a business activity involves multiple Web Services and which remain proprietary to individual vendors. Different a Web Service involves multiple components, each of which vendors can implement Web Services infrastructures in dif- must be dependable. In this paper, we describe fault toler- ferent ways. Thus, Web Services standards provide interop- ance techniques, including replication, checkpointing, and erability between Web Services that have been implemented message logging, in addition to reliable messaging and trans- on different infrastructures, but they do not provide portabil- action management for which Web Services speciﬁcations ity of application programs from one vendor’s infrastructure exist. We discuss how those techniques can be applied to to another. The basic Web Services standards comprise: the components of the Web Services involved in the business ¯ The eXtensible Markup Language (XML), which de- activities to render them dependable. ﬁnes the syntax of Web Services documents, so that the information in those documents is self-describing ¯ The Simple Object Access Protocol (SOAP) for XML 1 Introduction messaging and mapping of data types, so that applica- tions can communicate with one another ¯ The Web Services Description Language (WSDL) for Web Services  enable the software of one enterprise describing a Web Service, its name, the operations that to interact with that of another enterprise over the Internet, can be called on it, the parameters of those operations, even if those enterprises use different hardware, different op- and the location to which to send requests erating systems, and different programming languages, thus ¯ The Universal Description Discovery and Integration allowing disparate computing systems and applications to (UDDI) standard, which is used by the service registry be coupled together. Web Services enable direct computer- where service providers publish and advertise their ser- to-computer interaction by invoking operations of the en- vices, and clients query and search for services to dis- terprises automatically that, otherwise, would be invoked cover what the services offer and how to access them. manually by a human through a browser and, thus, they streamline business activities. Web Services can run not only Web Services introduce new problems into the operation on mainframe computers and server computers but also on of enterprise computing systems. client desktop computers and mobile handsets. ¯ A problem in one participant of a multi-enterprise busi- The potential widespread use and beneﬁts of Web Ser- ness activity can affect another enterprise, and can dam- vices are very compelling, because they facilitate: age relationships between that enterprise and its cus- ¯ Automation of business processes distributed across tomers, suppliers and partners. multiple enterprises ¯ Business activities that span multiple enterprises present challenges for reliability, availability, data consistency, ¯ Collaboration among multiple enterprises by coupling concurrency, scalability and security. together the business processes running on their vari- ous computers. These and other problems become more challenging as business activities become more automated, as one Web Ser- 1 Company C (supplier) Web Service Web services middleware 2. Check availability operating system and other tiers Company A Company B Company D (customer) (distributor) (credit card company) Web Service Client Web Service Web Service Web services middleware Web services middleware Web services middleware 1. Request 5. Make a quote payment operating system operating system operating system and other tiers 3. Respond and other tiers and other tiers with a quote 4.Order the product 5. Make Company E payment (shipper) 6. Provide shipping information Web Service Web services middleware 6. Arrange shipping operating system and other tiers Figure 1. Use of Web Services in business-to-business activities that span multiple enterprises. vice triggers other Web Services, and as business activities is the probability that no fault occurs. For example, if Ô involve more enterprises and more steps. In this paper, we ¼ ¼¼¼¼½, Ñ and Ò ¿, then the probability that no fault investigate some of these problems, and possible strategies occurs is Õ ´½ Ôµ½¾ . The values of Õ for different values for solving them, that result from automating business ac- of ½ Ô are shown in Figure 2. tivities, as Web Services that span multiple enterprises. We For Ð independent business activities (e.g., Ð business ac- focus, in particular, on reliability, high availability, and data tivities per day), the probability that no fault occurs in any consistency. of them is Ö ÕÐ ´½ ÔµÐÑÒ ½º½ À Ú Ð Ð ØÝ With the same values of Ñ and Ò as above, i.e., Ñ and Ò ¿, and with ½ Ô ¼ , the probability that µ½¾Ð. The values of Ö for High availability must be provided for all of the Web Services of a business activity, and all of the components no fault occurs is Ö ´¼ of those Web Services. If one of the components of a Web different values of Ð are shown in Figure 2. Service is not available, all of the others will be affected. The Ñ , Ò ¿ ½ Ô ¼ availability of a business activity can be much less than the availability of any of the components of the Web Services 1-p q l r that comprise that business activity, as the following simple 0.9 0.282 10 0.99880 example shows. 0.99 0.886 100 0.98807 Let Ò be the number of tiers in a Web Services architec- 0.999 0.9881 1000 0.88692 ture within an enterprise and let Ñ be the number of Web 0.9999 0.9988 10000 0.30119 Services of different enterprises that are involved in a busi- 0.99999 0.99988 100000 0.00001 ness activity. Assume that Ò is the same for all of the enter- prises and that Ñ is the same for all of the business activities. Figure 2. The availability Õ of a single busi- Assume further that the processes within the different tiers ness activity based on the availability ½ Ô of and within the different enterprises are independent. a single component, assuming Ñ enter- Let Ô be the probability that the processes in any one of prises and Ò ¿ tiers, and the availability Ö of the tiers within an enterprise fails. Then ½ Ô is the proba- a number Ð of business activities. bility that they do not fail. If all of the processes within those tiers are operational at the start of the business activity, then Õ ´½ ÔµÑÒ 1.0 Probability that database becomes actions to abort business activities that cannot be completed. 0.8 Unfortunately, compensating transactions are difﬁcult to de- sign and program, have a high error rate, and incur a high potentially inconsistent 0.6 10-3 10-4 10-5 risk of data inconsistencies. 0.4 Figure 3 shows the probability of potential inconsistency for a business activity, where compensating transactions are 0.2 assumed to incur the same fault rate as regular transactions 10-6 though, realistically, their fault rate is probably higher. While 0.0 0 1 10 102 103 104 105 106 107 108 109 the risk of data being locked for a substantial period of time Number of Business Activities (because the transaction coordinator failed) is unacceptable, the risk of data inconistency resulting from the use of com- Figure 3. The probability that a database is pensating transactions is even more unacceptable. Conse- left in a potentially inconsistent state after Ð quently, mechanisms that prevent both locking of data by business activities, when using compensat- failed transactions, and potential inconsistency of data re- ing transactions. sulting from incorrect compensation, are essential for reli- able operation of business activities. Many business systems must process 100,000 business activities per day and, as Figure 2 shows, the probability 2 WS Dependability Speciﬁcations of complete success can be astonishingly low. Because of the nature of Web Services and business activities, all of the The Web Services (WS) community has published sev- components of all of the Web Services involved in a busi- eral speciﬁcations related to reliable messaging and trans- ness activity must be highly available in order to achieve a action management. The aim of those speciﬁcations is to high probability that all of the business activities will com- provide assurance not only that messages are delivered to plete successfuly. Even with careful programming and test- the destination applications but also that they are correctly ing, it is unlikely that the probability of a fault in a step of a processed by the destination applications. business activity will be reduced below 0.00001. Therefore, the required levels of availability cannot be achieved realis- tically without fault recovery and retry. Consequently, fault ¾º½ Ê Ð Ð Å ×× Ò recovery and retry must be regarded as essential for reliable operation of business activities using Web Services. The Web Services Reliable Messaging speciﬁcations in- clude WS-ReliableMessaging  and WS-Reliability . ½º¾ Ø ÓÒ× ×Ø Ò Ý Both of these speciﬁcations deﬁne an application-level reli- able messaging protocol that operates over SOAP. If a SOAP message is not successfully delivered (e.g., because it has an Maintaining the consistency of business data is essen- incomplete address), the sender application gets a response tial to enterprise computing. Data consistency is crucial for containing a SOAP fault element that gives status or error Web Services, where a business activity can span multiple information. enterprises and where detecting and correcting inconsisten- SOAP typically operates over HTTP, which in turn oper- cies can be difﬁcult, time consuming, and expensive. ates over TCP, which operates over IP, as shown in Figure 4. The use of transactions to maintain data consistency with- in an enterprise is one of the great successes of enterprise computing, effective, efﬁcient, and easy to use. Unfortu- Client Web Service Application Application nately, transactions have not been as successful in wide-area distributed systems. The participants in a distributed trans- action incur a risk that their data will be locked, and will RMP RMP be inaccessible, for an arbitrary period of time if the trans- action coordinator fails. In theory, this risk can be miti- SOAP SOAP gated by a three-phase commit protocol or by replicating HTTP HTTP the transaction coordinator [15, 27]. In practice, few trans- TCP TCP action processing systems use three-phase commit because of the high overheads in the fault-free case, and replication IP IP presents challenging problems as discussed below. More- over, transaction commit scales poorly as the number of par- ticipants in the transaction increases. Currently, business activities are typically implemented Figure 4. Reliable messaging protocol stack. using multiple local transactions, with compensating trans- Even though TCP delivers messages reliably, and in order, to blocks and requires the participants to wait for the coordina- the Transport Layer, there is no guarantee that SOAP mes- tor to recover, which can take a relatively long time. sages sent over HTTP are reliably delivered all the way up The Web Services Business Agreement (WS-Business the protocol stack to the destination application. Moreover, Agreement) protocols  support long-running transactions the reliable message delivery of TCP is not coordinated with that span multiple enterprises, are not two-phase, and allow fault handling and recovery for Web Services. the business logic to determine whether the business activity Both WS-Reliability and WS-ReliableMessaging provide should roll forward or roll backward. reliable messaging for SOAP using acknowledgments and The Web Services Coordination (WS-Coordination) spec- retransmissions with different quality of service levels, in- iﬁcation  describes a framework for plugging in proto- cluding at least once, at most once, exactly once, and source cols that coordinate the actions of distributed applications, ordered delivery. including those that require strict consistency and those that Pallickara, Fox and Pallickara  provide an analysis require agreement of only a proper subset of the participants. of the WS-Reliability and WS-ReliableMessaging speciﬁca- A Web Service creates a context that is used to propagate an tions. They identify the similarities and differences of the activity to other Web Services and to register for a particu- two speciﬁcations, and recommend extensions to the proto- lar coordination protocol. Participants make heuristic deci- cols to ensure ordered delivery across sets of messages and sions regarding the outcome of transactions. However, con- across multiple destinations. They also discuss how the two tinued processing without waiting for coordinator recovery speciﬁcations might be used together. can compromise the consistency of the data. Neither the WS-Reliability speciﬁcation nor the WS-Reli- able Messaging speciﬁcation addresses the topics of mes- 3 Fault Tolerance Technology sage persistence and recovery from faults. However, these topics are essential for reliable operation of business activ- Fault tolerance technology can be used to increase the ities composed of one or more Web Services, and they are level of reliability, availability and data consistency of Web tightly coupled to the reliable handling of messages. Services. This technology includes reliable messaging, repli- If a Web Service fails after a message has been delivered cation, checkpointing and restoration, message logging and to it and after it has acknowledged receipt of the message replay, and transactions, as discussed below. but before it has fully processed the message (e.g., because it has invoked a nested request), the following actions are required. ¿º½ Ê Ð Ð Å ×× Ò ¯ The recovering Web Service must be restored to a check- WS-Reliability and WS-ReliableMessaging can be read- pointed state it had at some moment preceding the ily extended to make the re-establishment of the connections fault. of a Web Service transparent to remote clients and servers so ¯ The TCP connections must be restored. that they do not need to reissue requests or replys. We refer ¯ Messages received subsequent to checkpointing the to this capability as transparent SOAP connection failover. state must be replayed from a log on distributed or Transparent SOAP connection failover involves the re- persistent storage. liable message header and body for the group or sequence ¯ Messages, generated by the recovering Web Service, of messages and, for WS-Reliability, the state that was ne- that have already been delivered to other Web Ser- gotiated for the agreement, before transmitting the messages vices, must be detected and suppressed. in that set. The messages can be logged on disk for local restart, or in the volatile memory of a backup computer for ¾º¾ ÌÖ Ò× Ø ÓÒ× Ò Ù× Ò ×× ØÚØ × failover to the backup computer. Aghdaie and Tamir  have investigated failover of Web The Web Services speciﬁcations [6, 7, 8] for both short- server connections and replay of messages for Web servers running transactions and long-running business activities that up to the HTTP layer of the protocol stack, by modifying span multiple enterprises aim to provide data consistency the Linux kernel and the Apache Web server. That work and protection against faults. is similar to our transparent TCP connection failover , The Web Services Transaction (WS-Transaction) speci- which is intended for general kinds of applications, includ- ﬁcation  includes protocols for atomic distributed trans- ing Web Services. Implementing failover support in the op- action commitment, based on the Two-Phase Commit (2PC) erating system kernel improves efﬁciency, but is not portable protocol. Transaction processing based on the 2PC proto- across different operating systems. col, as deﬁned by the WS-Transaction speciﬁcation, pro- vides data consistency for Web Services applications. How- ¿º¾ Ê ÔÐ Ø ÓÒ ever, if the transaction coordinator fails and all of the par- ticipants in a transaction have voted to commit but have not Replication is used in fault-tolerant systems to protect received a commit from the coordinator, the 2PC protocol an application against faults, so that if one replica becomes faulty, another replica is available to provide the service to State() method. The getState() method captures par- the clients. The most commonly used replication strategies ticular parts of the application state and encodes that are passive, active, and semi-active replication, summarized state into a byte sequence, and the setState() method below. decodes the byte sequence and restores the application ¯ In passive replication, there is a single primary replica from the checkpoint. that executes operations invoked on the Web Service, ¯ In application-transparent checkpointing, the check- and one or more backup replicas that do not execute pointing infrastructure uses operating system mecha- those operations. The replication infrastructure trans- nisms  to capture the state of the application pro- fers a checkpoint of the primary to the backups peri- cess (including ﬁle descriptors, thread stacks, etc), with- odically or at the end of each remote invocation. out the need for the application programmer to imple- ¯ In active replication, all of the replicas execute the ment the getState() and setState() methods. operations invoked on the Web Service independently For applications that involve multiple threads within a and at approximately, but not necessarily exactly, the process or data structures that contain pointers, it is difﬁ- same physical time. A checkpoint is used only to bring cult to implement the getState() and setState() methods of up a new active replica. application-aware checkpointing. On the other hand, applica- ¯ In semi-active replication, both the primary and the tion-transparent checkpointing does not produce checkpoints backup replicas execute each invoked operation. The that are portable across hardware architectures, because a primary provides directives (such as the order in which checkpoint taken as a binary image contains values of vari- to process messages) to the backups. The backups fol- ables that differ for different architectures, such as memory low those directives, and thus lag slightly behind the addresses. primary in executing the operations. Only the primary communicates results and makes further invocations. ¿º Å ×× ÄÓ Ò Ò Ê ÔÐ Ý The most challenging aspect of replication is maintain- ing replica consistency, as operations are invoked on the ser- For simple restart without replication or checkpointing, vice, as the states of the replicas change, and as faults occur. and for all replication and checkpointing strategies, a fault Replica consistency is obviously critical for active and semi- tolerance infrastructure must provide message logging and active replication, which must maintain the consistency of replay. Again, the messages can be logged either in the two or more concurrently executing replicas. Less obvi- memory of another processor or on disk. However, logging ously, replica consistency is also important for passive repli- the messages on disk and subsequently replaying them from cation because a recovering replica must repeat computa- disk can have adverse performance impacts. tions and communications with other Web Services since For restart without either replication or checkpointing, the most recent chechpoint. Those computations and com- the entire message history (from the ﬁrst message in the set munications must be consistent with the prior computations to the most recent message) must be retained and all of the and communications to avoid disrupting other Web Services. messages must be replayed. For replication and checkpoint- Maintaining replica consistency requires the sanitization of ing, only the messages since the most recent checkpoint need non-deterministic operations and also the handling of side to be replayed to the new or recovering replica. effects, as discussed below. ¿º Ë Ò Ø Þ Ò ÆÓÒ¹ Ø ÖÑ Ò ×Ø ÇÔ Ö Ø ÓÒ× ¿º¿ ÔÓ ÒØ Ò Ò Ê ×ØÓÖ Ø ÓÒ There has been extensive research undertaken on the topic Checkpointing is used by all replication strategies but in of sanitizing non-deterministic operations (see, e.g., ). different ways. Passive replication uses checkpointing dur- Messaging is one source of replica non-determinism, be- ing normal operation. Active replication and semi-active cause messages can be received by the replicas in different replication do not use checkpointing during normal opera- orders, due to loss of messages and retransmissions, delays tion, but use it to initialize a new or recovering replica. in the network, etc. To maintain replica consistency, mes- The checkpoints of an application process can be stored sages must be delivered to the replicas reliably and in the on disk, or can be transmitted to and stored in another pro- same order. Such a message delivery service is sometimes cessor. If a fault occurs, the application process is then restarted called atomic broadcast . The Java Messaging Service on the same or a different processor, and the most recent (JMS) has been extended to provide atomic multicasts . checkpoint is used to restore the process to the state it had at For passive replication, the infrastructure must log messages the time of the checkpoint. There are basically two kinds of on disk or in the memory of another processor so that, if the checkpointing, application-aware and application-transparent. primary replica fails, the messages after the checkpoint can be replayed. For both active and passive replication, the in- ¯ In application-aware checkpointing, the application pro- frastructure must detect and suppress duplicate invocations grammer implements a getState() method and a set- and duplicate responses. Another source of replica non-determinism is multithread- 4 WS Architectures and Dependability ing. If two threads within a replica share data, they must claim and release mutexes that protect that shared data. How- Web Services applications are typically programmed in ever, the threads in two different replicas will likely run at Java or .Net. We refer here to a Java-based implementation. slightly different speeds and, thus, they might claim mutexes A three-tier Web Services architecture involves clients in different orders. To maintain replica consistency, the mu- in the ﬁrst tier, a Web server, a servlet engine and/or a J2EE texes must be granted to the threads within the replicas in the application server in the middle tier, and a database system same order. in the third tier, as shown in Figure 5. The clients com- Other sources of replica non-determinism include oper- municate with the Web server and invoke operations on the ating system functions that return values local to the pro- Web Service, which is deployed in a server-side container. cessor on which they are executed, such as rand() and get- The server-side container can be a servlet container such as timeofday(), or inputs for the replicas from different redun- Tomcat, or a EJB container in a J2EE application server such dant sources, or system exceptions due to, say, exhaustion as JBoss or Geronimo. The Axis SOAP engine works with of memory on one of the processors. Such sources of replica both types of containers. Typically, there are multiple clients non-determinism must be sanitized, so that all of the replicas and multiple Web servers that are used for load balancing the see the same values of the functions, the same inputs from clients’ requests. the redundant sources, and the same system exceptions. This In a typical Web Services use case, the client accesses a sanitization must be done, whatever replication strategy is Web page, which consists of HTML and Java Servlets or JSP used. scripts. The client can make invocations by clicking on one or more links provided in that page. Once a request reaches ¿º À Ò ÐÒ Ë ¹ « Ø× the servlet engine, a servlet creates dynamic content for a response Web page, and might also perform simple business In addition to sanitizing non-deterministic operations, side- logic processing and communicate with the database system. effects that occur as the result of a client invoking opera- Applications that involve complex business logic processing tions on a Web Service must be handled properly to achieve typically use a J2EE application server. Multiple J2EE ap- replica consistency. plication servers might be used to load balance the clients’ In particular, if a Web Service writes data to ﬁles or a requests. The servlet creates dynamic content for the pre- database, those operations must be handled correctly. The sentation of the Web page and communicates with the EJB actions taken depend on whether each replica has its own container, containing session beans that perform the busi- copy of the database or the ﬁles, or the replicas have a single ness logic processing and entity beans that correspond to the shared copy. Similarly, if a Web Service sends messages to, database records. or invokes operations on, other Web Services, those opera- Within an enterprise, the components of a Web Service tions can have side-effects that must be handled properly. can be made dependable using fault tolerance technology, as shown in Figures 6 and 7 and discussed below. See also . ¿º ÌÖ Ò× Ø ÓÒ× Ò Ù× Ò ×× ØÚØ × º½ Ï Ë ÖÚ Ö Within a single enterprise, transactions have been suc- cessfully used to provide data consistency and to protect data A Web server is sometimes regarded as “stateless” in that against faults by means of their ACID properties. it does not retain any application state between its response It is possible to achieve higher availability of Web Ser- to a client’s request and the client’s next request. The ap- vices that employ transactions by using transactions and repli- plication state is either stored in the database or returned to cation together. In particular, by replicating the transaction coordinator, the 2PC protocol can be rendered non-blocking Client Primary Backup and exactly-once semantics can be provided for the clients’ Web Server Web Server invocations. Moreover, by replicating the middle-tier com- FT FT ponents and using transparent transaction retry, roll-forward RMP RMP RMP recovery can be achieved. We have implemented an infras- SOAP SOAP SOAP tructure for CORBA  that replicates the transaction coor- HTTP HTTP HTTP dinator and also the middle-tier application objects to protect TCP TCP TCP the business logic processing and to avoid potentially long IP IP IP service disruptions caused by failure of the coordinator. A similar infrastructure for Web Services that uniﬁes transac- tions and replication can provide both data consistency and high availability. Figure 6. Fault-tolerant Web server. Application Logic Application Servlet Session Beans Browser Axis SOAP Engine Entity Beans Apache Database Tomcat Servlet Engine Axis SOAP Engine Web Server JBoss Application Server Client Tier Application Presentation Tier Application Logic Tier Database Tier Middle Tier Figure 5. Three-tier Web Services architecture. the client in a URL or a cookie. However, during the pro- izability to checkpoint and restore the session state, but re- cessing of a client’s request, the Web server does maintain stricts what can be put into a session object to ensure serial- application state and also hidden internal state, such as the izability. Moreover, with that strategy, storing request data progress of nested invocations or disk I/Os or the state of the in decentralized sources, such as temporary ﬁles, can result connections with the Tomcat servlet engine or the database. in data inconsistencies. The state that results from actions that are visible to other In  Bartoli, Prica and di Muro describe a framework processes must be captured, and restored if a fault occurs. for program-to-program interaction across unreliable networks If the Web server fails while processing a client’s re- and an implementation in a Tomcat servlet container. The quest, either the unreplicated Web server must be restarted prototype is based on replication of HTTP client session and the client must reissue the request, or the replicated Web data and replication of a counter. The framework provides server must be failed over from the primary to a backup on the same consistency guarantees as a non-replicated service another processor, as shown in Figure 6. In the latter case, with respect to the order of execution requests. Moreover, it the fault tolerance infrastructure must replay the client’s re- ensures that, even if a client issues duplicate requests (e.g., quests to the restarted or backup Web server after the check- because of a service fault), the service executes the client’s point has been restored. In either case, the Web server must request at most once. not write the state to the database, or send a response to the client, more than once. º¿ Â¾ ÔÔÐ Ø ÓÒ Ë ÖÚ Ö In addition, the infrastructure at the servlet engine or the J2EE application server must ensure that the restarted or Some three-tier Web Services architectures use J2EE ap- backup Web server receives its response messages reliably plication servers. The J2EE standard derives from the CORBA and in the correct order, if the Web server fails. standard, and mandates the use of CORBA’s Internet Inter- ORB Protocol. The CORBA Object Transaction Service º¾ Ë ÖÚÐ Ø  provides data consistency through atomic commitment of distributed transactions. The Fault Tolerant CORBA stan- To achieve high availability, fault tolerance must be pro- dard  provides high availability by replicating the appli- vided for the Tomcat containers and the servlets contained cation objects. There exist several implementations of Fault within them using replication, checkpointing, and message Tolerant CORBA (e.g., [18, 26]). logging. Fault tolerance must be provided for the EJB/J2EE con- Even if the state of the servlet application, such as the tainers and the beans within them even if the application is state stored in a session object, is written to the database be- coded as one or more transactions with rollback and if the fore the servlet sends a response back to the Web server, a beans are entity beans whose states are stored in a database. checkpoint must be taken that includes the state within the Again, the reason is that the EJB container contains consid- Tomcat containers, such as the connections with the Web erable hidden internal state . If application-transparent server and the J2EE application server or database server. checkpointing is used, there are interesting interactions and Subsequently, if the servlet engine fails while it is process- overlap between the checkpoint of the EJB/J2EE container ing a client’s request and interacting with the J2EE applica- process and the states of the entity beans stored in the database tion server or the database server, the servlet engine must be that must be reconciled. brought back up or failed over to a backup servlet engine. Its In  Babaoglu, Bartoli, Maverick, Patarin and Wu de- state must be restored from the checkpoint, and the messages scribe a framework for prototyping J2EE replication algo- after the checkpoint must be replayed. rithms. They divide the replication code into two layers, the In  Hanik describes in-memory session replication framework itself which is common to all replication algo- for the Tomcat servlet engine that uses the Java Groups group rithms and a speciﬁc replication algorithm that is plugged communication toolkit. That strategy exploits Java’s serial- into the framework. Unreplicated Replicated Clients Servers Client Browser A Web Servlet/JSP J2EE Server Engine Server Database W1 S1 E1 D1 Client Browser B Web Servlet/JSP J2EE Server Engine Server Database W2 S2 E2 D2 Client Browser C Tier1 Tier2 Tier3 Figure 7. Fault-tolerant three-tier Web Services architecture. º ÌÖ Ò× Ø ÓÒ ÓÓÖ Ò ØÓÖ º Ù× Ò ×× ØÚØ × To achieve both data consistency and high availability Distributed transactions based on the 2PC protocol are for transaction-based applications, the transaction coordina- seldom used for business activities that span multiple en- tor must be replicated. Replication of the transaction coor- terprises, because they unavoidably involve one enterprise’s dinator renders the 2PC protocol non-blocking and achieves locking the data records of another enterprise. Instead, ex- exactly-once semantics for the clients’ invocations . Fur- tended transactions, as deﬁned by the WS-Coordination  thermore, if the middle-tier servers are also replicated and and WS-BusinessActivity  speciﬁcations, are used. Ex- transparent transaction retry is used, roll-forward recovery tended transactions typically involve local transactions and can be achieved . compensating transactions  that offset committed local transactions when a business activity is rolled back. How- º Ø × Ë ÖÚ Ö ever, compensating transactions can have undesirable effects, such as one transaction’s seeing the results of another trans- Much work has been done on improving the reliabil- action before the compensating transaction is applied. ity and availability of database systems. Vaysburd  has In  we have presented a reservation-based extended provided an excellent survey of commercial packages that transaction protocol for Web Services that coordinates busi- provide fault tolerance for database systems, with respect ness activities and that avoids the use of compensating trans- to such requirements as persistence, data consistency, and actions. Each task within a business activity is executed as availability of service. two steps. The ﬁrst step involves an explicit reservation of resources according to the business logic. The second step º Ï Ë ÖÚ ×Ê ×ØÖÝ involves the conﬁrmation or cancellation of the reservation. Each step is executed as a separate traditional transaction. The UDDI registry for Web Services that contains the WSDL descriptions of the Web Services must be available 5 Conclusion for the clients that are looking for them. Availability of the UDDI registry can be achieved using the fault tolerance tech- If Web Services are to achieve their objective of automat- niques of replication, checkpointing, and message logging, ing business activities across multiple enterprises, they must described previously. be made dependable. The existing Web Services reliable The OASIS standards organization has published a spec- messaging, transactions and business activity speciﬁcations iﬁcation for lazy replication of the UDDI registry , where must be augmented with additional mechanisms to provide the updates are propagated point-to-point from one replica higher levels of reliability, availability, and consistency. to another replica. Sun, Lin and Kemme  have imple- In this paper, we have described various fault tolerance mented the OASIS lazy replication strategy for the UDDI techniques for increasing the reliability, availability, and con- registry, as well as an eager replication strategy. The eager sistency of Web Services, including transparent SOAP con- replication strategy is based on middleware that employs a nection failover, replication, checkpointing, and message log- multicast group communication protocol. They provide re- ging, and have shown how to apply these techniques to a sponse time, propagation time, and execution results for both Web Services architecture. In the future, we plan to inves- replication strategies in a LAN and in a WAN. tigate various security mechanisms that could be integrated with these mechanisms. 6 Acknowledgments  R. Jimenez-Peris, M. Patino-Martinez, G. Alonso, and S. Arevalo, “A low-latency non-blocking commit service,” Pro- This research has been supported in part by MURI/AFOSR ceedings of the International Conference on Distributed Computing, Lisbon, Portugal, October 2001, 93-107. Grant F49620-00-1-0330 for the authors at the University of California, Santa Barbara, and by a faculty startup award for  R. Koch, S. Hortikar, L.E. Moser, and P.M. Melliar-Smith, “Transparent TCP connection failover,” Proceedings of the the author at Cleveland State University. IEEE International Conference on Dependable Systems and Networks, San Francisco, CA, June 2003, 383-392. References  A. Kupsys, S. Pleisch, A. Schiper, and M. Wiesmann, “To- wards JMS compliant group communication,” Proceedings  N. Aghdaie and Y. Tamir, “Implementation and evaluation of of the IEEE International Symposium on Network Computing transparent fault-tolerant Web Service with kernel-level sup- and Applications, Cambridge, MA, August 2004, 131-140. port,” Proceedings of the IEEE International Conference on  P. Narasimhan, L.E. Moser, and P.M. Melliar-Smith, Computer Communications and Networks, Miami, FL, Octo- “Strongly consistent replication and recovery of fault- ber 2002, 63-68. tolerant CORBA applications,” Computer Science and En-  O. Babaoglu, A. Bartoli, V. Maverick, S. Patarin, and H. Wu, gineering Journal 17, 2, March 2002, 103-114. “A framework for prototyping J2EE replication algorithms,”  Object Management Group, Fault Tolerant CORBA, OMG Proceedings of the International Symposium on Distributed Technical Committee Document formal/02-06-59, Chapter Objects and Applications, Agia Napa, Cyprus, October 2004, 23, CORBA/IIOP 3.0, 2000, http://www.omg.org. 1413-1426.  Object Management Group, Transaction Service Speciﬁca-  A. Bartoli, M. Prica, and E. Antoniutti, “A replication frame- tion v1.2 (ﬁnal draft), OMG Technical Committee Document work for program-to-program interaction across unreliable ptc/2000-11-07, 2000, http://www.omg.org. networks and its implementation in a servlet container,” Con-  S. Pallickara, G. Fox, and S.L. Pallickara, “An analysis of re- currency and Computation: Practice and Experience, John liable delivery speciﬁcations for Web SAervices,” Proceed- Wiley & Sons. ings of the IEEE Conference on Information Technology, Las  R. Bilorusets, et al., Web Services Reliable Messaging Pro- Vegas, NV, April 2005, 360-365. tocol (WS-ReliableMessaging), February 2005, http://www-  M. Pasin, M. Riveill, and T. Weber, “High-available enter- 128.ibm.com/developerworks/webservices/library/ws-rm/. prise Java Beans using group communication system sup-  D. Booth, H. Hass, F. McCabe, E. Newcomer, M. Champion, port,” Proceedings of the European Research Seminar on Ad- C. Ferris, and D. Orchard, Web Services Architecture, Febru- vances in Distributed Systems, Bologna, Italy, May 2001. ary 2004, http://www.w3.org/TR/ws-arch.  T. Rutt, M. Peel, D. Bunting, K. Iwasa, and J. Durand,  L.F. Cabrera, et al., Web Services Coordination, Septem- Web Services Reliability (WS-Reliability), August 2004, ber 2003, http://www.ibm.com/developerworks/library/ws- http://oasis-open.org/committees/tc home.php?wg abbrev= coor/. wsrm.  L.F. Cabrera, et al., Web Services Transaction, August 2002,  C. Sun, Y. Lin, and B. Kemme, “Comparison of UDDI reg- http://www.ibm.com/developerworks/library/ws-transpec/. istry replication strategies,” Proceedings of the IEEE Inter-  L.F. Cabrera, et al., Web Services Business Activity Frame- national Conference on Web Services, San Diego, CA, July work, January 2004, http://www.ibm.com/developerworks/ 2004, 218-225. library/ws-busact/.  A. Vaysburd, “Fault tolerance in three-tier applications: Fo-  L. Clement, et al., UDDI Version 2.03 Replication Spec- cusing on the database tier,” Proceedings of the IEEE Sym- iﬁcation, http://uddi.org/pubs/Replication-V2.03-Published- posium on Reliable Distributed Systems, Lausanne, Switzer- 20020719.pdf, July 2002. land, October 1999, 322-327.  X. Defago, A. Schiper, and P. Urban, “Total order broadcast  W. Zhao, L.E. Moser, and P.M. Melliar-Smith, “Design and and multicast algorithms: Taxonomy and survey,” Computing implementation of a pluggable fault tolerant CORBA infras- Surveys 36, 4, December 2004, 372-421. tructure,” Cluster Computing: The Journal of Networks, Soft-  H. Garcia-Molina and K. Salem, “Sagas,” Proceedings of the ware Tools and Applications, Special Issue on Dependable ACM SIGMOD Conference on the Management of Data, San Distributed Systems 7, 4, October 2004, 317-330. Francisco, CA, May 1987, 249-259.  W. Zhao, L.E. Moser, and P.M. Melliar-Smith, “Uniﬁca-  F. Hanik, “In-memory session replication with Tomcat 4,” tion of transactions and replication in three-tier architectures April 2002, http://www.TheServerSide.com. based on CORBA,” IEEE Transactions on Dependable and  D. Ingham, S. Shrivastava, and F. Panzieri, “Constructing de- Secure Computing 2, 1, January-March 2005, 20-33. pendable Web Services,” IEEE Internet Computing, vol. 4,  W. Zhao, L.E. Moser, and P.M. Melliar-Smith, “A no. 1, January/February 2000, 25-33. reservation-based coordination protocol for business activi-  G. Janakiraman, J, Santos, D. Subhraveti, and Y. Turner, ties” Proceedings of the IEEE International Conference on “Cruz: Application-transparent distributed checkpoint- Web Services, Orlando, FL, July 2005, 49-56. restart on standard operating systems,” Proceedings of the IEEE International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005, 260-269.