Document Sample

Exchanging Intensional XML Data TOVA MILO INRIA and Tel-Aviv University SERGE ABITEBOUL INRIA BERND AMANN Cedric-CNAM and INRIA-Futurs and OMAR BENJELLOUN and FRED DANG NGOC INRIA XML is becoming the universal format for data exchange between applications. Recently, the emer- gence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional documents. These are XML documents where some of the data is given explicitly while other parts are deﬁned only intensionally by means of embedded calls to Web services. When such documents are exchanged between applications, one has the choice of whether or not to materialize the intensional data (i.e., to invoke the embedded calls) before the document is sent. This choice may be inﬂuenced by various parameters, such as performance and security considerations. This article addresses the problem of guiding this materialization process. ` We argue that—like for regular XML data—schemas (a la DTD and XML Schema) can be used to control the exchange of intensional data and, in particular, to determine which data should be materialized before sending a document, and which should not. We formalize the problem and provide algorithms to solve it. We also present an implementation that complies with real-life standards for XML data, schemas, and Web services, and is used in the Active XML system. We illustrate the usefulness of this approach through a real-life application for peer-to-peer news exchange. Categories and Subject Descriptors: H.2.5 [Database Management]: Heterogeneous Databases General Terms: Algorithms, Languages, Veriﬁcation Additional Key Words and Phrases: Data exchange, intensional information, typing, Web services, XML This work was partially supported by EU IST project DBGlobe (IST 2001-32645). This work was done while T. Milo, O. Benjelloun, and F. D. Ngoc were at INRIA-Futurs. Authors’ current addresses: T. Milo, School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel; email: milo@cs.tau.ac.il; S. Abiteboul and B. Amann, INRIA-Futurs, Parc Club Orsay-University, 4 Rue Jean Monod, 91893 Orsay Cedex, France; email: {serge,abiteboul, bernd.amann}@inria.fr; O. Benjelloun, Gates Hall 4A, Room 433, Stanford University, Stanford, CA 94305-9040; email: benjelloun@db.stanford.edu; F. D. Ngoc, France Telecom R&D and LRI, e e 38–40, rue du G´ n´ ral Leclerc, 92794 Issy-Les Moulineaux, France; email: Frederic.dangngoc@ rd.francetelecom.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0001 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 1–40. 2 • T. Milo et al. 1. INTRODUCTION XML, a self-describing semistructured data model, is becoming the standard format for data exchange between applications. Recently, the use of XML doc- uments where some parts of the data are given explicitly, while others consist of programs that generate data, started gaining popularity. We refer to such documents as intensional documents, since some of their data are deﬁned by programs. We term materialization the process of evaluating some of the pro- grams included in an intensional XML document and replacing them by their results. The goal of this article is to study the new issues raised by the exchange of such intensional XML documents between applications, and, in particular, how to decide which parts of the data should be materialized before the docu- ment is sent and which should not. This work was developed in the context of the Active XML system [Abiteboul et al. 2002, 2003b] (also see the Active XML homepage of Web site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). The latter is cen- tered around the notion of Active XML documents, which are XML documents where parts of the content is explicit XML data whereas other parts are gener- ated by calls to Web services. In the present article, we are only concerned with certain aspects of Active XML that are also relevant to many other systems. Therefore, we use the more general term of intensional documents to denote documents with such features. To understand the problem, let us ﬁrst highlight an essential difference be- tween the exchange of regular XML data and that of intensional XML data. In frameworks such as those of Sun1 or PHP,2 intensional data is provided by programming constructs embedded inside documents. Upon request, all the code is evaluated and replaced by its result to obtain a regular, fully mate- rialized HTML or XML document, which is then sent. In other terms, only extensional data is exchanged. This simple scenario has recently changed due to the emergence of standards for Web services such as SOAP, WSDL,3 and UDDI.4 Web services are becoming the standard means to access, describe and advertise valuable, dynamic, up-to-date sources of information over the Web. Recent frameworks such as Active XML, but also Macromedia MX5 and Apache Jelly6 started allowing for the deﬁnition of intensional data, by embedding calls to Web services inside documents. This new generation of intensional documents have a property that we view here as crucial: since Web services can essentially be called from everywhere on the Web, one does not need to materialize all the intensional data before sending a document. Instead, a more ﬂexible data exchange paradigm is possible, where the sender sends an intensional document, and gives the receiver the freedom 1 See Sun’s Java server pages (JSP) online at http://java.sun.com/products/jsp. 2 See the PHP hypertext preprocessor at http://www.php.net. 3 See the W3C Web services activity at http://www.w3.org/2002/ws. 4 UDDI stands for Universal Description, Discovery, and Integration of Business for the Web. Go online to http://www.uddi.org. 5 Macromedia Coldfusion MX. Go online to http://www.macromedia.com/. 6 Jelly: Executable xml. Go online to http://jakarta.apache.org/commons/sandbox/jelly. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 3 to materialize the data if and when needed. In general, one can use a hybrid approach, where some data is materialized by the sender before the document is sent, and some by the receiver. As a simple example, consider an intensional document for the Web page of a local newspaper. It may contain some extensional XML data, such as its name, address, and some general information about the newspaper, and some intensional fragments, for example, one for the current temperature in the city, obtained from a weather forecast Web service, and a list of current art exhibits, obtained, say, from the TimeOut local guide. In the traditional setting, upon request, all calls would be activated, and the resulting fully materialized document would be sent to the client. We allow for more ﬂexible scenarios, where the newspaper reader could also receive a (smaller) intensional document, or one where some of the data is materialized (e.g., the art exhibits) and some is left intensional (e.g., the temperature). A beneﬁt that can be seen immediately is that the user is now able to get the weather forecast whenever she pleases, just by activating the corresponding service call, without having to reload the whole newspaper document. Before getting to the description of the technical solution we propose, let us ﬁrst see some of the considerations that may guide the choice of whether or not to materialize some intensional data: — Performance. The decision of whether to execute calls before or after the data transfer may be inﬂuenced by the current system load or the cost of commu- nication. For instance, if the sender’s system is overloaded or communication is expensive, the sender may prefer to send smaller ﬁles and delegate as much materialization of the data as possible to the receiver. Otherwise, it may decide to materialize as much data as possible before transmission, in order to reduce the processing on the receiver’s side. — Capabilities. Although Web services may in principle be called remotely from everywhere on the Internet, it may be the case that the particular receiver of the intensional document cannot perform them, for example, a newspa- per reader’s browser may not be able to handle the intensional parts of a document. And even if it does, the user may not have access to a particular service, for example, because of the lack of access rights. In such cases, it is compulsory to materialize the corresponding information before sending the document. —Security. Even if the receiver is capable of invoking service calls, she may prefer not to do so for security reasons. Indeed, service calls may have side effects. Receiving intensional data from an untrusted party and invoking the calls embedded in it may thus lead to severe security violations. To overcome this problem, the receiver may decide to refuse documents with calls to ser- vices that do not belong to some speciﬁc list. It is then the responsibility of a helpful sender to materialize all the data generated by such service calls before sending the document. — Functionalities. Last but not least, the choice may be guided by the applica- tion. In some cases, for example, for a UDDI-like service registry, the origin of the information is what is truly requested by the receiver, and hence service ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 4 • T. Milo et al. Fig. 1. Data exchange scenario for intensional documents. calls should not be materialized. In other cases, one may prefer to hide the true origin of the information, for example, for conﬁdentiality reasons, or be- cause it is an asset of the sender, so the data must be materialized. Finally, calling services might also involve some fees that should be payed by one or the other party. Observe that the data returned by a service may itself contain some inten- sional parts. As a simple example, TimeOut may return a list of 10 exhibits, along with a service call to get more. Therefore, the decision of materializing some information or not is inherently a recursive process. For instance, for clients who cannot handle intensional documents, the newspaper server needs to recursively materialize all the document before sending it. How can one guide the materialization of data? For purely extensional data, schemas (like DTD and XML Schema) are used to specify the desired format of the exchanged data. Similarly, we use schemas to control the exchange of intensional data and, in particular, the invocation of service calls. The novelty here is that schemas also entail information about which parts of the data are allowed to be intensional and which service calls may appear in the documents, and where. Before sending information, the sender must check if the data, in its current structure, matches the schema expected by the receiver. If not, the sender must perform the required calls for transforming the data into the desired structure, if this is possible. A typical such scenario is depicted in Figure 1. The sender and the re- ceiver, based on their personal policies, have agreed on a speciﬁc data exchange schema. Now, consider some particular data t to be sent (represented by the grey triangle in the ﬁgure). In fact, this document represents a set of equiv- alent, increasingly materialized, pieces of information—the documents that may be obtained from t by materializing some of the service calls (q, g , and f ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 5 Among them, the sender must ﬁnd at least one document conforming to the exchange schema (e.g., the dashed one) and send it. This schema-based approach is particularly relevant in the context of Web services, since their input parameters and their results must match particular XML Schemas, which are speciﬁed in their WSDL descriptions. The techniques presented in this article can be used to achieve that. The contributions of the article are as follows: (1) We provide a simple but ﬂexible XML-based syntax to embed service calls in XML documents, and introduce an extension of XML Schema for describ- ing the required structure of the exchanged data. This consists in adding new type constructors for service call nodes. In particular, our typing dis- tinguishes between accepting a concrete type, for example, a temperature element, and accepting a service call returning some data of this type, for example, () → temperature. (2) Given a document t and a data exchange schema, the sender needs to decide which data has to be materialized. We present algorithms that, based on schema and data analysis, ﬁnd an effective sequence of call invocations, if such a sequence exists (or detect a failure if it does not). The algorithms provide different levels of guarantee of success for this rewriting process, ranging from “sure” success to a “possible” one. (3) At a higher level, in order to check compatibility between applications, the sender may wish to verify that all the documents generated by its appli- cation may be sent to the target receiver, which involves comparing two schemas. We show that this problem can be easily reduced to the previous one. (4) We illustrate the ﬂexibility of the proposed paradigm through a real-life application: peer-to-peer news syndication. We will show that Web services can be customized by using and enforcing several exchange schemas. As explained above, our algorithms ﬁnd an effective sequence of call invoca- tions, if one exists, and detect failure otherwise. In a more general context, an er- ror may arise because of type discrepancies between the caller and the receiver. One may then want to modify the data and convert it to the right structure, using data translation techniques such as those provided by Cluet et al. [1998] and Doan et al. [2001]. As a simple example, one may need to convert a temper- ature from Celsius degrees to Fahrenheit. In our context, this would amount to plugging (possibly automatically) intermediary external services to perform the needed data conversions. Existing data conversion algorithms can be adapted to determine when conversion is needed. Our typing algorithms can be used to check that the conversions lead to matching types. Data conversion techniques are complementary and could be added to our framework. But the focus here is on partially materializing the given data to match the speciﬁed schema. The core technique of this work is based on automata theory. For presentation reasons, we ﬁrst detail a simpliﬁed version of the main algorithm. We then describe a more dynamic, optimized one, that is based on the same core idea and is used in our implementation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 6 • T. Milo et al. Although the problems studied in this article are related to standard typing problems in programming languages [Mitchell 1990], they differ here due to the regular expressions present in XML schemas. Indeed, the general problem that will be formalized here was recently shown to be undecidable by Muscholl et al. [2004]. We will introduce a restriction that is practically founded, and leads to a tractable solution. All the ideas presented here have been implemented and tested in the context of the Active XML system [Abiteboul et al. 2002] (see also the Active XML home- page of Web site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). This system provides persistent storage for intensional documents with embedded calls to Web services, along with active features to automatically trigger these services and thus enrich/update the intensional documents. Furthermore, it al- lows developers to declaratively specify Web services that support intensional documents as input and output parameters. We used the algorithms described here to implement a module that controls the types of documents being sent to (and returned by) these Web services. This module is in charge of materializing the appropriate data fragments to meet the interface requirements. In the following, we assume that the reader is familiar with XML and its typ- ing languages (DTD or XML Schema). Although some basic knowledge about SOAP and WSDL might be helpful to understand the details of the implemen- tation, it is not necessary. The article is organized as follows: Section 2 describes a simple data model and schema speciﬁcation language and formalizes the general problem. Ad- ditional features for a richer data model that facilitate the design of real life applications are also introduced informally. Section 3 focuses on difﬁculties that arise in this context, and presents the key restriction that we consider. It also introduces the notions of “safe” and “possible” rewritings, which are studied in Section 4 and 5, respectively. The problem of checking compatibility between in- tensional schemas is considered in Section 6. The implementation is described in Section 7. Then, we present in Section 8 an application of the algorithms to Web services customization, in the context of peer-to-peer news syndication. The last section studies related works and concludes the article. 2. THE MODEL AND THE PROBLEM To simplify the presentation, we start by formalizing the problem using a simple data model and a DTD-like schema speciﬁcation. More precisely, we deﬁne the notion of rewriting, which corresponds to the process of invoking some service calls in an intensional document, in order to make it conform to a given schema. Once this is clear, we explain how things can be extended to provide the features ignored by the ﬁrst simple model, and in particular we show how richer schemas are taken into account. 2.1 The Simple Model We ﬁrst deﬁne documents, then move to schemas, before formalizing the key notion of rewritings, and stating the results obtained in this setting, which will be detailed in the following sections. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 7 Fig. 2. An intensional document before/after a call. 2.1.1 Simple Intensional XML Documents. We model intensional XML documents as ordered labeled trees consisting of two types of nodes: data nodes and function nodes. The latter correspond to service calls. We assume the exis- tence of some disjoint domains: N of nodes, L of labels, F of function names,7 and D of data values. In the sequel we use v, u, w to denote nodes, a, b, c to denote labels, and f , g , q to denote function names. Deﬁnition 2.1. An intensional document d is an expression (T, λ), where T = (N , E, <) is an ordered tree. N ⊂ N is a ﬁnite set of nodes, E ⊂ N × N are the edges, < associates with each node in N a total order on its children, and λ : N → L ∪ F ∪ D is a labeling function for the nodes, where only leaf nodes may be assigned data values from D. Nodes with a label in L ∪ D are called data nodes while those with a label in F are called function nodes. The children subtrees of a function node are the function parameters. When the function is called, these subtrees are passed to it. The return value then replaces the function node in the document. This is illustrated in Figure 2, where data nodes are represented by circles, func- tion nodes are represented by squares, and data values are quoted. Here, the Get Temp Web service is invoked with the city name as a parameter. It returns a temp element, which replaces the function node. An example of the actual XML representation of intensional documents is given in Section 7. Observe that the parameter subtrees and the return values may themselves be intensional documents, that is, contain function nodes. 2.1.2 Simple Schemas. We next deﬁne simple DTD-like schemas for in- tensional documents. The speciﬁcation associates (1) a regular expression with each element name that describes the structure of the corresponding elements, and (2) a pair of regular expressions with each function name that describe the function signature, namely, its input and output types. Deﬁnition 2.2. A document schema s is an expression (L, F, τ ), where L ⊂ L and F ⊂ F are ﬁnite sets of labels and function names, respectively; τ is a function that maps each label name l ∈ L to a regular expression over L ∪ F or to the keyword “data” (for atomic data), and maps each function name f ∈ F to a pair of such expressions, called the input and output type of f and denoted by τin ( f ) and τout ( f ). 7 We assume in this model that function names identify Web service operations. This translates in the implementation to several parameters (URL, operation name, . . . ) that allow one to invoke the Web services. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 8 • T. Milo et al. For instance, the following is an example of a schema: data: τ (newspaper) = title.date.(Get Temp | temp).(TimeOut | exhibit ∗ ) τ (title) = data τ (date) = data τ (temp) = data τ (city) = data τ (exhibit) = title.(Get Date | date) (∗) functions: τin (Get Temp) = city τout (Get Temp) = temp τin (TimeOut) = data τout (TimeOut) = (exhibit | performance)∗ τin (Get Date) = title τout (Get Date) = date We next deﬁne the semantics of a schema, that is, the set of its instances. To do so, if R is a regular expression over L ∪ F , we denote by lang(R) the regular language deﬁned by R. The expression lang(data) denotes the set of data values in D. Deﬁnition 2.3. An intensional document t is an instance of a schema s = (L, F, τ ) if for each data node (respectively function node) n ∈ t with label l ∈ L (respectively l ∈ F ), the labels of n’s children form a word in lang(τ (l )) (respectively in lang(τin (l ))). For a function name f ∈ F , a sequence t1 , . . . , tn of intensional trees is an input instance (respectively output instance) of f , if the labels of the roots form a word in lang(τin ( f )) (respectively lang(τout ( f )), and all the trees are instances8 of s. It is easy to see that the document of Figure 2(a) is an instance of the schema of (∗), but not of a schema with τ identical to τ above, except for (∗∗) τ (newspaper) = title.date.temp.(TimeOut | exhibit ∗ ). However, since τout (Get Temp) = temp, the document can always be turned into an instance of the schema of (∗∗), by invoking the Get T emp service call and replacing it by its return value. On the other hand, consider a schema with τ identical to τ , except for (∗∗∗) τ (newspaper) = title.date.temp.exhibit ∗ . According to its signature, a call to TimeOut may also return performance elements. Therefore, in general, the document may not become an instance of the schema of (∗ ∗ ∗). However, it is possible that it becomes one (if 8 Like in DTDs, every subtree conforms to the same schema as the whole document. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 9 T imeOut returns a sequence of exhibits). The only way to know is to call the service. This type of “on-line” testing is ﬁne if the calls have no side effects or do not cost money. If they do, we might want to warn the sender, before invoking the call, that the overall process may not succeed, and see if she wants to proceed nevertheless. 2.1.3 Rewritings. When the proper invocation of service calls leads for sure to the desired structure, we say that the rewriting is safe, and when it only pos- sibly does, that this is a possible rewriting. These notions are formalized next. v Deﬁnition 2.4. For a tree t, we say that t → t if t is obtained from t by selecting a function node v in t with some label f and replacing it by an arbitrary v1 v2 vn output instance of f .9 If t → t1 → t2 · · · → tn we say that t rewrites into tn , ∗ denoted t → tn . The nodes v1 , . . . , vn are called the rewriting sequence. The set ∗ of all trees t s.t. t → t is denoted ext(t). Note that in the rewriting process, the replacement of a function node v by its output instance is independent of any function semantics. In particular, we may replace two occurrences of the same function by two different output instances. Stressing somewhat the semantics, this can be interpreted as if the value returned by the function changes over time. This captures the behavior of real life Web services, like a temperature or stock exchange service, where two consecutive calls may return a different result. Deﬁnition 2.5. Let t be a tree and s a schema. We say that t possibly rewrites into s if ext(t) contains some instance of s. We say that t safely rewrites into s either if t is already an instance of s, or if there exists some node v in t such v that all trees t where t → t safely rewrite into s. The fact that t safely rewrites into s means that we can be sure, without actually making any call, that we can choose a sequence of calls that will turn t into an instance of s. For instance, the document of Figure 2(a) safely rewrites into the schema of (∗∗) but only possibly rewrites into that of (∗ ∗ ∗). Finally, to check compatibility between applications, we may want to check whether all documents generated by one application (e.g., the sender applica- tion) can be safely rewritten into the structure required by the second applica- tion (e.g., the agreed data exchange format). Deﬁnition 2.6. Let s be a schema with some distinguished label r called the root label. We say that s safely rewrites into another schema s if all the instances t of s with root label r rewrite safely into instances of s . For instance, consider the schema of (∗) presented above with newspaper as the root label. This schema safely rewrites into the schema of (∗∗) but does not safely rewrite into the one of (∗ ∗ ∗). 9 Byreplacing the node by an output instance we mean that the node v and the subtree rooted at it are deleted from t, and the forest trees t1 , . . . , tn of some output instance of f are plugged at the place of v (as children of v’s parent). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 10 • T. Milo et al. 2.1.4 The Results. Going back to the data exchange scenario described in the introduction, we can now specify our main contributions: (1) We present an algorithm that tests whether a document t can be safely rewritten into some schema s and, if so, provides an effective rewriting sequence, and (2) When safe rewriting is not possible, we present an algorithm that tests whether t may be possibly rewritten into s, and ﬁnds a possibly successful rewriting sequence, if one exists. (3) We also provide an algorithm for testing, given two schemas, whether one can be safely rewritten into the other. 2.2 A Richer Data Model In order to make our presentation clear, and to simplify the deﬁnition of doc- ument and schema rewritings, we used a very simple data model and schema language. We will now present some useful extensions that bring more expres- sive power, and facilitate the design of real life applications. 2.2.1 Function Patterns. The schemas we have seen so far specify that a particular function, identiﬁed by its name, may appear in the document. But sometimes, one does not know in advance which functions will be used at a given place, and yet may want to allow their usage, provided that they con- form to certain conditions. For instance, we may have several editions of the newspaper of Figure 2(a), for different cities. A common intensional schema for such documents should not require the use of a particular Get temp function, but rather allow for a set of functions, which have a suitable signature: they should accept as single parameter a city element, and return a temperature el- ement, as previously deﬁned in τ . The particular weather forecast service that will be used may depend on the city and be, for instance, retrieved from some UDDI service registry. One may also want to enforce some security policies, for example, be allowed to specify that the allowed functions should return only extensional results. To specify such sets of functions, we use function patterns. A function pattern deﬁnition consists of a boolean predicate over function names and a function signature. A function belongs to the pattern if its name satisﬁes the Boolean predicate and its signature is the same as the required one. A more liberal deﬁ- nition would be one that requires that the function signature only be subsumed by the one speciﬁed in the deﬁnition, that is, that every instance of the former be also an instance of the latter. This is possible but is computationally more heavy, since it entails checking inclusion of the tree language deﬁned by the two schemas. In terms of implementation, one can assume that this new Boolean predicate is implemented as a Web service that takes a function name as input and returns true or false. To take this feature into account in our model, we deﬁne P to be a domain of function pattern names. A schema s = (L, F, P, τ ) now also contains, in addition to the elements and functions, a set of function patterns P ⊂ P. τ associate with ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 11 each function pattern p ∈ P a signature and a Boolean predicate over function names. We can now, for instance, write a schema for our local newspapers as τ (newspaper) = title.date.(Forecast | temp).(TimeOut | exhibit ∗ ) τname (Forecast) = UDDIF ∧ InACL τin (Forecast) = city τout (Forecast) = temp This schema enforces the fact that the function used in the document has the proper signature and satisﬁes the Boolean predicates UDDIF and InACL. The ﬁrst predicate (UDDIF ) is a Web service that checks if the given function (ser- vice) is registered in some particular UDDI registry. Predicate InACL then veriﬁes if the caller has the necessary access privileges for executing the given function (calling the service). More generally, any Web service that allows the veriﬁcation of some property of the particular function node in the document (here, the weather forecast service), possibly with respect to some contextual information (e.g., the identity of the caller, the system date, etc.) can be used. 2.2.2 Wildcards. Together with function patterns, one may also use wild- cards in schemas. Their use is already common for data. In XML Schema, the keyword any expresses the fact that a certain part of a document may con- tain an arbitrary element, attribute, or even an unconstrained subtree. XML Schema further allows one to restrict wildcards to (or exclude from them) cer- tain domains of data, based on their namespace.10 This extends naturally to our context. We consider the namespace of a function node in an intensional document to be the namespace of the called Web service.11 Therefore, we can use wildcards to allow certain document parts to contain arbitrary sub-trees with arbitrary functions, or restrict them to (respectively exclude from them) certain classes of functions. We believe that the combination of wildcards and function patterns provides a good level of ﬂexibility to describe the structure of documents. For instance, one may specify that the temperature is obtained from an arbitrary function that returns a correct temp element, but may take any argument, being data or function call. 2.2.3 Restricted Service Invocations. Another interesting extension is the following: we assumed so far that all the functions appearing in a document may be invoked in a rewriting, in order to match a given schema. This is not always the case, for the same reasons as mentioned in the Introduction (secu- rity, cost, access rights, etc.). The logic of rewritings will have to take this into account, essentially by considering, among all possible rewritings, only a proper subset. For that, the function names/patterns in the schema can be partitioned into two disjoint groups of invocable and noninvocable ones. A legal rewriting is then one that invokes only invocable functions. The notions of safe and possible 10 TheW3C XML activity. Go online to www.w3.org/XML. 11 Which is described in its WSDL description and, in our model, is one of the components of the function name. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 12 • T. Milo et al. rewritings extend naturally to consider only legal rewritings. Since we are in- terested here only in such rewritings, whenever we talk in the sequel about a function invocation, we mean an invocable one. 2.2.4 XML, XML Schema, and WSDL. The simple XML trees considered above ignore a number of features of XML, such as attributes, and use a single domain for data values. A richer setting may be obtained by using the full ﬂedged XML data model (see footnote 10). Similarly, richer schemas may be deﬁned by adopting XML Schema (see footnote 10), rather than using the simple DTD-like schema used above. Indeed, our implementation is based on the full XML model and on an extension of XML Schema. In our prototype, function calls embedded in XML documents are represented by special function elements that identify the Web services to be invoked and specify the value of input parameters. XML Schemas are enriched for inten- sional documents (to form XML Schemaint ) by function and function pattern deﬁnitions. In both cases, things are very much along the lines of the sim- ple model we used above. We will see an example and more details of this in Section 7. Function signatures are usually speciﬁed by service providers as WSDL def- initions. We similarly extend WSDL to allow the use of XML Schemaint instead of just XML Schema for type speciﬁcations, and we term this extended language WSDLint . While intensional XML documents use a standard XML syntax, XML Schemaint schemas do not comply with the XML Schema syntax. The exten- sion is minimal, and very much along the lines of the simple syntax we used above. We will also see an example and more details in Section 7. Note that this is not the case for WSDL, since its speciﬁcation does not enforce the use of a speciﬁc schema language. Therefore WSDLint documents are valid WSDL documents. 3. EXCHANGING INTENSIONAL DATA We start by considering document rewriting. Schema rewriting is considered later in Section 6. Given a document t that the sender wishes to send, and a data exchange schema s, the sender needs to rewrite t into s. A possible process is the following: (1) Check if t safely rewrites to s and if so, ﬁnd a rewriting sequence, namely, a sequence of functions that need to be invoked to transform t into the required structure (preferably the shortest or cheapest one, according to some criteria). (2) If a safe rewriting does not exist, check whether at least t may rewrite to s. If it is acceptable to do so (the sender accepts that the rewriting may fail), try to ﬁnd a successful rewriting sequence if one exists (preferably with the least side effects on the path to ﬁnd it, and at the least cost). A variant is to combine safe and possible rewritings. For instance, one could con- sider a mixed approach that ﬁrst invokes some function calls and then attempts from there to ﬁnd safe rewritings. There are many alternative strategies. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 13 We will ﬁrst consider safe document rewriting, then move to possible rewrit- ing, and ﬁnally consider the mixed approach. As in the previous section, to simplify the presentation, we ﬁrst consider the problems in the context of the simple data model deﬁned above. Then in Section 7 we will show that the pro- posed solutions naturally extend to richer data/schemas and in particular to the context of full ﬂedged XML and XML Schema. Before presenting solutions, let us ﬁrst explain some of the difﬁculties that one encounters when attempting to rewrite a document to a desired exchange schema. While the examples given in the previous sections were rather simple— and one could determine by a simple observation of the document which service calls need to be materialized—things may be much more complex in general. We explain next why this is the case and present a restriction that will make the problem tractable. 3.1 Going Back and Forth The rewriting sequence may depend on the answers being returned by the functions: we may call one function at some place in the document, and then decide, possibly based on its answer, that another function in the new data or in a different part of the document needs to be called, and so on. In general, this may force us to analyze the same portion of the document many times, reexamining the same function call again and again, deciding at each iteration whether, based on the answers returned so far, the function now needs to be called or not. Such an iterative process may naturally be very expensive. We thus restrict our attention here to a simpler class of “one-pass” left-to-right rewritings12 where, for each node, the children are processed from left to right, and once a child function is invoked, no further invocations are applied to its left-hand sibling functions (i.e., successive children invocations are limited to the new children functions possibly returned by the call, plus the right hand siblings.). This restriction also applies to the results of function calls, which are also processed in a left-to-right manner. Observe that, in general, with this restriction, one can miss a successful rewriting that is not left-to-right. In all the real-life examples that we consid- ered, left-to-right rewritings were not limiting. 3.2 Inﬁnite Search Space The essence of safe rewriting is that it succeeds no matter what speciﬁc an- swers, among the possible ones, the invoked functions return. The domain of the possible answers of each function is determined by its output type. Since the regular expression deﬁning this type may contain starred (“*”) subexpressions, the domain is inﬁnite, and the safe rewriting should account for each possible element in this inﬁnite domain. Moreover, the result of a service call may con- tain intensional data, namely, other function calls. In general the number of such new functions may be unbounded. For instance, consider a Get Exhibits 12 One could choose similarly right-to-left. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 14 • T. Milo et al. function, with output type τout (Get Exhibits) = Get Exhibit ∗ . When Get Exhibits is invoked, an arbitrarily large number of Get Exhibit functions may be returned, and one has to check for each of the occurrences whether this particular function call needs to be invoked and whether, after the invocation, the document can still be (safely) rewritten into the desired schema. 3.3 Recursive Calls As explained above, when a function is invoked, the returned data may itself contain new calls. To conform to the target schema, these calls may need to be triggered as well. The answer again may contain some new calls, etc. This may lead to inﬁnite computations. Observe that such recursive situations do occur in practice. For example, a search engine Web service may return, for a given keyword, some document URLs plus (possibly) a function node for obtaining more answers. Calling this function, one can obtain a new list and perhaps another function node, etc. If the target schema requires plain XML data, we need to repeatedly call the functions until all the data has been obtained. In this example, and often in general, one may want to bound the recursion. This suggests the following deﬁnition and our corresponding restriction: v1 vn Deﬁnition 3.1. For a rewriting sequence t → t1 · · · → tn , we say that a function node v j depends on a function node vi if v j ∈ ti but ∈ ti−1 (namely, if the node v j was returned by the invocation of the function vi ). We say that a rewriting sequence is of depth k if the dependency graph among the nodes contains no paths of length greater than k. The restriction. The restriction that we will impose below is the following: We will consider only k-depth left-to-right rewritings. Note that while this restriction limits the search space, the latter remains inﬁnite, due to the starred subexpressions appearing in the schema. However, under this restriction, we can exhibit a ﬁnite representation (based on au- tomata) of the search space and use automata-based techniques to solve the safe rewriting problem. Even with this restriction, the framework is general enough to handle most practical cases. The problem of arbitrary safe rewriting (without the left-to- right k-depth restriction) was recently shown to be undecidable [Muscholl et al. 2004]. Further work by the same authors [Muscholl et al. 2004; Segouﬁn 2003] has shown that the left-to-right safe rewriting problem is actually decidable, without the k-depth restriction, but the corresponding algorithms have a much higher complexity (EXPTIME or 2EXPTIME, depending on whether the target language is deterministic or not)—and thus are mostly of theoretical interest. 4. SAFE REWRITING In this section, we present an algorithm for k-depth safe rewriting. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 15 We are given a document tree t and a schema s0 = (L0 , F0 , τ0 ) describing the signature of all the functions in the document (as well as the elements/functions used in these signatures). This corresponds to having a WSDL description for each service being used, which is a normal requirement for Web services. We are also given a data exchange schema s = (L, F, τ ), and our goal is to safely rewrite t into s (with a k-depth left-to-right rewriting). To simplify, we assume that function types are the same in s0 and s, including deﬁnitions of the corresponding subelements. This is reasonable since the func- tion deﬁnitions represent the WSDL description of the functions, as given by the service providers. While this assumption simpliﬁes the rewriting process, it is not essential. The algorithm can be extended to handle distinct signatures, For clarity, we decompose the presentation of the algorithm into three parts: (1) The ﬁrst part explains how to deal with function parameters. The main point is that, since the parameters may themselves contain other function calls (with parameters), the tree rewriting starts from the deepest function calls and recursively moves upward. (2) The second part explains how the rewriting in each such iteration is per- formed. The key observation is that this can be achieved by traversing the tree from top to bottom, handling one node (and its direct children) at a time. (3) Finally, the third and most intricate part, explains how each such node, and its direct children, is handled. In particular, we show how to decide which of the functions among these children needs to be invoked in order to make the node ﬁt the desired structure. For presentation reasons, we give here a simpliﬁed version of the actual algo- rithm used in the implementation. To optimize the computation, a more dy- namic variant, based on the same idea, is used there. We explain the main principles of this variant in Section 7. 4.1 Rewriting Function Parameters To invoke a function, its parameters should be of the right type. If they are not, they should be rewritten to ﬁt that type. When rewriting the parameters, again, the functions appearing in them can be invoked only if their own pa- rameters are (or can be rewritten into) the expected input type. We thus start from the “deepest” functions, that is, those having no function occurrences in the parameters, and recursively move upward: — For the deepest functions, we verify that their parameters are indeed in- stances of the corresponding input types. If not, the rewriting fails. — Then moving upward, we look at a function f and its parameters. All the func- tions appearing in these parameters were already handled—namely, their parameters can be safely rewritten to the appropriate type. We thus ignore the parameters of these lower level calls (together with all the functions in- cluded in them) and just try to safely rewrite f ’s own parameters into the required structure. If this is not possible, the rewriting fails, for the same reason as above. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 16 • T. Milo et al. At the end of this process we know that all the outmost function calls in t are ﬁne. We can thus ignore their parameters (and whatever functions that appear in them) and need to safely rewrite t into s by invoking only these outmost calls. 4.2 Top Down Traversal In each iteration of the above recursive procedure we are given a tree (or a forest) where the parameters of all the outmost functions have already been handled, and we need to safely rewrite the tree (forest) by invoking only these outmost functions. To do that we can traverse the tree(forest) top down, treating at each step a single node and its immediate children. Consider a node n whose children labels form a word w. Note that the subtree rooted at n can be safely rewritten into the target schema s = (L, F, τ ) if and only if (1) w can be safely rewritten into a word in lang(τ (label(n)), and (2) each of n’s children subtrees can itself be safely rewritten into an instance of s . Note that since we assumed that s0 and s agree on function types, we only need to rewrite the original children of n and not those that are returned by function invocations. Therefore, we can start from the root and, going down, for each node n try to safely rewrite the sequence of its children into a word in lang(τ (label(n))). The algorithm succeeds if all these individual rewritings succeed. The safe rewriting of a word w involves the invocation of functions in w and (recursively) new functions that are added to w by those invocations. To conclude the description of our rewriting algorithm we thus only need to explain how this is done. 4.3 Rewriting the Children of a Node n This is the most intricate part of the algorithm. We are given a word w—the sequence of labels of n’s children—and our goal is to rewrite w to ﬁt the target schema. Namely, we need to rewrite w so that it becomes a word in the regular language R = τ (label(n)). The rewriting process invokes functions in w and (recursively) new functions that are added to w by those invocations. Each such invocation changes w, replacing the function occurrence by its returned answer. The possible changes that the invocation of a function f i may cause are determined by the output type R f i = τout ( f i ) of f i .13 For instance, if w = a1 , a2 , . . . , f i , . . . , am , invoking f i changes w into some w = a1 , a2 , . . . , b1 , . . . , bk , . . . , am where b1 , . . . , bk ∈ lang(R f i ). Since the functions signatures, as well as the target schema, are given in terms of regular expressions, it is convenient to reason about them, and about the overall rewriting process, by analyzing the relationships between their cor- responding ﬁnite state automata. We assume some basic knowledge of regular languages and ﬁnite state automata, and use in our algorithm standard no- tions such as the intersection and complement of regular languages and the Cartesian product of automata. For basic material, see for instance Hopcroft and Ullman [1979]. 13 Recall from the discussion above that the input parameters can be ignored. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 17 Fig. 3. Safe rewriting of w into R. Given the word w, the output types R f 1 , . . . , R f n of the available functions, and the target regular language R, the algorithm in Figure 3 tests if w can be safely rewritten into a word in R. Then, if the answer is positive, the algorithm presented in Section 4.4 ﬁnds a safe rewriting sequence. We give the intuition behind the ﬁrst algorithm next. To illustrate, we use the newspaper document in Figure 2(a). Assume that we look at the root news- paper node. Its children labels form the word w = title.date.Get Temp.TimeOut. Assume that we want to ﬁnd a safe rewriting for this word into a word in the regular language τ (newspaper) of the schema of (**), namely, R = title.date.temp.(TimeOut | exhibit ∗ ). The process of rewriting involves choosing some functions in w and replacing them by a possible output; then choosing some other functions (which might have been returned by the previous calls) and replacing them by their output, and so on, up to depth k. For each function occurrence we have two choices: either to leave it untouched, or to replace it by some word in it output type. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 18 • T. Milo et al. Fig. 4. The A1 automaton from the newspaper document. w Fig. 5. The complement automaton A for schema (**). The automaton Ak constructed in steps 5–10 of the algorithm represents pre- w cisely all the words that can be generated by such a k-depth rewriting process. The fork nodes are the nodes where a choice (i.e., invoking the function or not) exists, and the two fork options represent the possible consequent steps in the automaton, depending on which of the two choices was made. Going back to the above example, Figure 4 shows the 1-depth automaton A1 for the w word w = title.date.Get Temp.TimeOut, with the signature of the Get Temp and TimeOut functions deﬁned as in Section 2. q2 and q3 are the fork nodes and their two outgoing edges represent their fork options for Get Temp and TimeOut, respectively. An edge represents the choice of invoking the function while a function edge represents the choice not to invoke it. Suppose ﬁrst that we want to verify that all possible rewritings lead to a “good” word, that is, that they belong to the target language R. To put things in regular language terms, the intersection of the language of Ak , consisting of w these words, with the complement of the target language R should be empty. A standard way to test that the intersection of two regular languages is empty is to (i) construct an automaton A for the complement of the language R, (ii) build a Cartesian product automaton A× = Ak × A for the two automata Ak w w and A, and (iii) check whether it accepts no words. The Cartesian product automaton of Ak and A is built in step 11 of the w algorithm. To continue with the above example, the complement automaton for the regular language R = τ (newspaper) of the schema of (**) is given in Figure 5. The accepting states are p0 , p1 , p2 , and p6 . For brevity we use “*” to denote all possible alphabet transitions besides those appearing in other outgoing edges. The Cartesian product automaton A× = A1 × A (where A1 and w w A are the automata of Figures 4 and 5, respectively) is given in Figure 6. The initial state is [q0 , p0 ] and the ﬁnal accepting one is [q4 , p6 ]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 19 Fig. 6. The Cartesian product automaton A× . Fig. 7. The complement automaton A for schema (***). Note, however, that, when searching for a safe rewriting, one does not need to verify that all possible rewritings lead to a “good word,” that is, that none of the words in Ak belongs to A. We only have to verify that for each function, there w is some fork option (i.e., invoking the function or not) that, if taken, will not lead to an accepting state. Since we are looking for left-to-right safe rewritings, we need to check that, traversing the input from left to right, at least one such “good” fork options exists for each function call on the way. The marking of nodes in steps 15–17 of the algorithm achieves just that. Recall that we required in step 4 that the complement automaton A be complete. This is precisely what guarantees that all the fork nodes/options of Ak are recorded in A× and makes w the above marking possible. The marking for our particular example is illustrated in Figure 6. The colored nodes are the marked ones. As can be seen, the fork nodes [q2 , p2 ] and [q3 , p3 ] are not marked. For the ﬁrst node, this is because its fork option is not marked. For the second one, it is due to the unmarked TimeOut fork option. Consequently, the initial state is not marked as well and there is a safe rewriting of the newspaper element to the schema of (**). We will see in Section 4.4 how to ﬁnd this rewriting. For another example, consider the schema of (***). Here, a newspaper is required to have the structure conforming to the regular expression title.date.temp.exhibit ∗ . The complement automaton A for this language is given in Figure 7. To test whether it is possible to safely rewrite our news- paper document into this schema, we construct a Cartesian product automaton A× = A1 × A (with A1 as in Figure 4 and A as in Figure 7). A× is given in w w Figure 8. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 20 • T. Milo et al. Fig. 8. The Cartesian product automaton A× . As one can see, in this case, the two fork nodes [q2 , p2 ] and [q3 , p3 ] have both their fork options marked. Consequently the initial state is marked as well and there is no safe rewriting of w into the schema of (***). Note that this is precisely what our intuitive discussion from Section 2 indicated: the invocation of T imeOut may return performance elements, hence the result may not conform to the desired structure. The following theorem states the correctness of our algorithm. THEOREM 4.1. The above algorithm returns true if and only if a k-depth left- to-right safe rewriting exists. PROOF. To prove correctness we have to show that (i) when the algorithm returns a negative answer a safe rewriting indeed does not exist, and (ii) when the answer is positive, there exists a safe rewriting. Notations. We will use the following notations: — For an automaton X and a state q ∈ X , we denote L(X , q) the language accepted by X when making q the initial state. This is a subset of all sufﬁxes of words accepted by the original automaton X . — In the automaton A× , we use A0 to denote the subautomaton “originating” from Aw , and use A j , 0 < j ≤ k to denote the subautomaton “originating” from some A f added to Ak to represent the possible outputs of f , at the j th iteration of its construction. More formally, these are the projections of A× on nodes [q, q ] such that q belongs to Aw for A0 , and to A f for A j . Note that, in general, several automata are added in the j th iteration. To simplify the notation we use A j to denote any representative of this set. — Given a subautomaton A j , an initial (respectively ﬁnal) state of A j is a state [q, q ] such that q is an initial (respectively ﬁnal) state of Aw /A f i . Completeness. We start by proving that the algorithm is complete, that is, that if it answers negatively, no k-depth left-to-right safe rewriting of Aw exists. We ﬁrst number the nodes of A× based on the order in which they got marked by the algorithm. For a regular node, its assigned number should be greater than the one of its marked successor that caused its marking. For a fork node the assigned number should be greater than all the numbers of the nodes that cause its marking. It is easy to come up with such a numbering by following ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 21 the algorithm and using a counter that is incremented by one each time a node gets marked. We also need the following lemma. LEMMA 4.2. If a state [q, q ] of A j is marked, then there exists a ﬁnite path from [q, q ] to a marked ﬁnal state [ p, p ] of A j , such that all the nodes on the path belong to A j , are marked, and have decreasing numbers. Moreover, either [ p, p ] is a ﬁnal state of A0 , or there is an outgoing edge from it to a marked state of some A j −1 , with a smaller number. PROOF. First, by deﬁnition of marking, there exists a ﬁnite marked path from [q, q ] to a ﬁnal state of A× , where the nodes have decreasing numbers. If [q, q ] is in A0 , note that the ﬁnal state of A× is also a ﬁnal state of A0 . Otherwise, then by construction of A× , such a path must go through a marked ﬁnal state of A j and continue, via an edge, to a marked state in A j −1 with a smaller number. Among such paths, lets look at the one that has the longest marked preﬁx in A j .14 We denote [ p, p ] the last node of the preﬁx. If [ p, p ] is a ﬁnal state of A j that exits via an edge to a marked A j −1 node with a smaller number, we are done. Let’s suppose it’s not. If [ p, p ] is a fork node, then it has at least two outgoing edges that lead to marked states: a function transition, which stays in A j , and the corresponding transition, which leads to some A j +1 . By the deﬁnition of our numbering, both successor nodes have smaller numbers than [ p, p ]. Thus, we can extend our preﬁx in A j by following the function transition, which contradicts the fact that we were on the path with the longest preﬁx in A j . If [ p, p ] is not a fork node, then it must have a marked successor with a smaller number (as otherwise it would not be marked). Its successors can either be in A j or be transitions to some A j −1 (if it is a ﬁnal state of A j ). As we assumed above that the transitions did not lead to marked nodes with lower number, [ p, p ] must have a marked successor in A j with a lower number, that caused its marking. Note, however, that by adding this marked node to the previous preﬁx, we can build a marked path with a longer preﬁx in A j , having nodes with decreasing numbers. Again, a contradiction. We are now ready to prove direction (i). We do this again by contradiction. Assume that our algorithm returns a negative answer, that is, that the initial state of A× is marked, but a k-depth left-to-right safe rewriting from Aw does exists. Recall that such rewritings discover the input word and the answers of functions from left to right, and make their decisions (namely, to invoke func- tions or not) as they proceed. Therefore, we can construct the counterexample incrementally. We do not need to provide the full input word (or the functions output) as a whole, but only “letter by letter,” as the rewriting process is going on. Also recall that, since the rewriting is supposed to be safe, we are free to chose any answer we want for a function call, as long as it matches its output type. The rewriting should succeed anyway. 14 Note that it is not necessarily unique. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 22 • T. Milo et al. We will show that we can provide a ﬁnite sequence of letters (consisting of an initial word and answers to the function calls the rewriting decides to invoke) that stays on a marked path in A× and eventually reaches one of its ﬁnal states. This means, on the one hand, that the sequence represents a legal k-depth rewriting (of a word accepted by Aw ) and, on the other hand, that it does not belong to the target type. Consequently the rewriting is not safe. The sequence is constructed as follows. We begin at the initial (marked) state of A× and start following some ﬁnite marked path in A0 leading to a marked ﬁnal state where the nodes on the paths have decreasing numbers. Such a path must exist, by Lemma 4.2. At each step, when we traverse an edge with a label in L, we simply output its label. If the edge is labeled by a function name in F, we also output the label, but our action depends on whether the rewriting process decides to invoke the function or not: if the function is not invoked, we stay on the same path. Otherwise, we follow an edge from the current automaton Ai (initially i = 0) to a marked state of the next level automaton Ai+1 . Note that such a marked state must exist since the previous fork node was marked. Also, by deﬁnition of the numbering, its assigned number is smaller than the one of the fork node. Then we continue the same process at Ai+1 following a ﬁnite path, with nodes having decreasing numbers, to its ﬁnal state, and on the way possibly moving to higher level automata, as described above. Since all these paths are constructed as in Lemma 4.2, they end on a ﬁnal node of A× (for A0 ), or on a ﬁnal node of A j with an transition to a marked node of A j −1 , with a smaller number. In the latter case, we simply follow this transition, which corresponds to ending the answer of a function call. Observe that, by the above arguments, we follow a path to a ﬁnal state of A× that consists only of marked nodes and correspond to a decreasing sequence of numbers, which means that it is ﬁnite. Feeding the letters on this path to the safe rewriting makes it end on a word that is not in the target language R, a contradiction. Soundness. We now turn to direction (ii), which states the soundness of our algorithm—namely, that if the initial state is not marked, then every word accepted by Aw can be rewritten to match the target schema. We start by proving that if the algorithm succeeds, then the following proposition holds: PROPOSITION 4.3. Let A j be a subautomaton of A× corresponding to some function automaton A f i (or to Aw if j = 0). For every nonmarked state [q, q ] of A j originating from A f i (respectively A0 ), every word in L(A f i , q) (respectively L(Aw , q)) has a “safe rewriting” into a word w such that w corresponds to a nonmarked path in A× leading to a (nonmarked) ﬁnal state of A j .15 PROOF. We use induction on j , starting from j = k and going down to j = 0, to show that every word in L(A f i , q) (respectively L(Aw , q)) that contains only 15 Weoverload here, in a natural manner, the notion of safe rewriting, meaning that the above property holds no matter what answer the function invocations return. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 23 function nodes of depth ≥ j can be safely rewritten. For j = k, A j contains no fork nodes. A state [q, q ] is thus not marked iff all states reachable from it are not marked and the property trivially holds. We suppose now that the hypothesis holds for j + 1, and consider a word that contains function nodes of depth ≥ j . We follow its corresponding path in A j . If all nodes are nonmarked, we get to a nonmarked accepting state of A j , which means we are done. Otherwise, if the path contains marked nodes, we show, by a second induction on the number of function symbols in w, how to rewrite w to a “good” path. The base of the induction is for a word w that doesn’t contain function nodes. Then, clearly, all nodes on the corresponding path must be nonmarked, or else the ﬁrst one, [q, q ], would be marked as well. Suppose we know how to deal with a word containing l function nodes, and consider a word w that contains l + 1 function nodes. We look at the ﬁrst edge e = ([v, v ], [u, u ]) on the path where [v, v ] is not marked but [u, u ] is marked. By deﬁnition of marking, [v, v ] must be a fork node with e labeled by some function name f i . This splits w into subwords w = w1 . f i .w2 . Since [v, v ] is not marked, its other fork option (the edge corresponding to f i ), must lead to a nonmarked initial state of the subau- tomaton A j +1 corresponding to A f i . We choose to invoke this function. By the ﬁrst induction hypothesis, there is a safe rewriting of the returned result into a word w whose corresponding path is not marked and leads to a nonmarked ﬁnal state of A j +1 . By the construction of A× this ﬁnal state must have an outgoing edge leading to a state [u, u ] of A j . Furthermore, observe that the latter is not marked (or otherwise the ﬁnal state of A j +1 would be also marked). Finally, since w2 has l function nodes, by the second induction hypothesis it can be safely rewritten into some word w whose corresponding nonmarked path leads to a ﬁnal state of A j . It follows that the rewritten word w1 .w .w , and its corresponding nonmarked path, leads from [q, q ] to a nonmarked ﬁnal state A j via a path consisting only of nonmarked nodes. We are now ready to prove direction (ii). If the algorithm answers positively, a safe rewriting can be found by essentially the same construction as of the above proposition. Given any word w accepted by Aw , our goal is to ﬁnd a safe rewriting that yields a word w whose corresponding path in A× leads to a nonmarked state [q, p], where q is an accepting state of Aw (namely, an accepting state of A0 ). Note that since the ﬁnal state is not marked, p is not an accepting state of A. And since A is deterministic this implies that w is a “good” word that belongs to the target language R. The fact that such a rewriting indeed exists follows immediately from the above proposition, taking the node [q, q ] of the proposition to be the initial state of A× , and w as a particular input word. The actual rewriting can be found as described in the proof. This concludes the proof of Theorem 4.1. Complexity. We now brieﬂy discuss the complexity of the algorithm. Recall that we use s0 to denote the schema of the sender and s to denote the agreed data exchange schema. The complexity of deciding whether a safe rewriting ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 24 • T. Milo et al. Fig. 9. Finding the rewriting of w into R. exists is determined by the size of the cartesian product automaton: we need to construct it and then traverse and mark its nodes. More precisely, the com- plexity is bounded by O(|A× |2 ) = O((|Ak | × |A|)2 ). The size of Ak is at most w w O((|s0 | + |w|)k ) and the size of the complement automaton A is at most expo- nential in the automaton being complemented [Hopcroft and Ullman 1979], namely, at most exponential in the size of the target schema s. This exponen- tial blow up may happen however only when s uses nondeterministic regular expressions (i.e., regular expressions whose corresponding ﬁnite state automa- ton is nondeterministic). Note, however, that XML Schema enforces the usage of deterministic regular expressions. Hence, for most practical cases, the com- plexity is polynomial in the size of the schemas s0 and s (with the exponent determined by k). 4.4 Finding a Rewriting The algorithm of Figure 3 checks if a safe rewriting exists. The constructive proof we used to show its soundness entails a way to ﬁnd a rewriting sequence when a safe rewriting exists, which corresponds to the algorithm of Figure 9. This algorithm ﬁnds the safe rewriting sequence by following a nonmarked path. Each fork node on the path, together with its nonmarked fork option, determines what needs to be done with the corresponding function—an edge means “invoke the function” while a function edge means “do not invoke.” In the example previously discribed, which corresponds to Figure 6, it is easy to see (following the path with colored background) that Get Temp needs to be invoked while TimeOut should not. The complexity of actually performing the rewriting depends on the size of the answers returned by the called functions. If x is the maximal answer size, the length of the generated word is bounded by w × x k . 4.5 A Mixed Approach As seen above, much of the work in searching for a safe rewriting comes from the size of the automaton Ak that accounts for all possible outputs of function w invocation. A useful heuristic is to adopt a mixed approach, that starts by in- voking some of the functions (e.g., the ones with no side effects or low price) to ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 25 Fig. 10. Possible rewriting of w into R. get their actual output, and then tries to safely rewrite the document. In terms of the algorithm of Figure 3, rather than using the full function signature au- tomaton A f i , we will use a smaller one that describes just (the type of) the actual returned result. This may greatly simplify the resulting automaton Ak . More-w over, the output of the already invoked calls can be reused when performing the actual rewriting, instead of reissuing these calls. 5. POSSIBLE REWRITING We considered safe rewritings in the previous section. We now turn to possible rewritings. While function signatures provide an “upper bound” of the possible output, when invoked with the actual given parameters they may return a restricted “appropriate” output, so a rewriting that looked nonfeasible (unsafe) may turn to be possible after some function calls. To test if a rewriting may exist, we follow a similar three-step procedure as for safe rewriting: (1) test functions parameters ﬁrst, (2) traverse the tree top down, and (3) check each node individually, trying to rewrite the word w consisting of the labels of its direct children. Steps (1) and (2) are exactly as before. For step (3), Figure 10 provides an algorithm to test if the children of a given node may rewrite to the target schema. As before, we use the automaton Ak that describes all the words that may be w derived from the word w in a k-depth rewriting. w may rewrite to a word in the target language R iff some of these derived words belong to R, namely, if the intersection of the two languages, Ak and R, is not empty. To test this, we w construct (in step 4 of the algorithm) the Cartesian product automaton for these two languages, and test (in step 5) whether the ﬁnal state is reachable from the initial one. This is done by a standard marking process, that starts from the ﬁnal nodes, and marks all nodes that have some edge leading to a marked node. If the initial state is marked, this means that the intersection of the two languages is not empty [Hopcroft and Ullman 1979]. For instance, consider the automaton A for the schema of (***) with newspa- per structure title.date.temp.exhibit ∗ given in Figure 11. The initial state is p0 and the ﬁnal accepting states are p3 and p4 . The Cartesian product automaton A× = A1 × A (for A1 as in Figure 4 and A as in Figure 11) is given in Figure 12. w k The initial state is [q0 , p0 ]. The ﬁnal accepting states are [q4 , p3 ] and [q4 , p4 ], and all states (including the initial one) have an outgoing path to a ﬁnal state. The only possible fork options left in the automaton, and which may lead to a possible rewriting, are the ones requiring the invocation of both Get Temp ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 26 • T. Milo et al. Fig. 11. An automaton A for schema (***). Fig. 12. Cartesian product automaton for possible rewriting. Fig. 13. Finding a possible rewriting of w into R. and TimeOut functions. If TimeOut returns nothing but exhibits the rewriting succeeds. The correctness of this algorithm is stated below. PROPOSITION 5.1. The above algorithm returns true iff a k-depth possible rewriting exists. PROOF. Since Ak accepts the language of all possible words obtainable by a w k-depth rewriting, the rewriting is possible iff the intersection of the language accepted by Ak with the target language is not empty. This is classically checked w by computing the cross-product of the corresponding automata, and marking nodes as described, to checked whether a ﬁnal state is reachable from the initial state. The complexity here is again determined by the size of the Cartesian product automaton. However, in this case, it uses the schema automaton A (rather than its complement, as for safe rewriting). Hence, the complexity of checking whether a rewriting may exist is polynomial in the size of the schemas s0 and s (with the exponent determined by k). Finding an actual rewriting is done through a heuristic described by the al- gorithm of Figure 13. We follow a marked path, and invoke functions or not, as ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 27 indicated by the fork options on the path. We have to backtrack when failing (i.e., when the function returns a value that does not correspond to an accept- ing path). This process ends either because we reached a ﬁnal state, which means that a rewriting was found, or because all choices were explored without success. 6. SCHEMA REWRITING So far, we considered the rewriting of a single document. At a higher level, to check compatibility between applications, the sender may wish to verify that all documents generated by her application can indeed be sent to the target receiver. Given a schema s0 for the sender documents, and some distinguished root label r, we want to verify that all instances of s0 with root r can be safely rewritten to the schema s. Interestingly, it turns out that safe rewriting for schemas is not more difﬁcult than for documents. We decompose the algorithm we propose for schema rewriting into two parts: ﬁrst, how to check the initial schema, by traversing it top down and second, for each type in this schema, how to check that the corresponding regular expression safely rewrites into the target schema. We ﬁrst show how safe rewriting can be checked for DTDs, by checking all the element deﬁnitions of s0 . Then, we sketch a top-down algorithm for checking safe rewriting for XML Schema-like schemas. Finally, we explain how it can be checked that any instance of a regular expression can be safely rewritten into a target regular expression. 6.1 Rewriting DTDs In the simple DTD-like schemas we used so far, checking that s0 safely rewrites to s amounts to checking that, for every element deﬁnition τ0 (l 0 ) = r0 in s0 , (a) there exists an element deﬁnition for the element label l 0 in s and that (b) every instance of the regular expression r0 can be safely rewritten into the corresponding regular expression in s, namely τ (l 0 ). We term this last step language safe rewriting, and give an algorithm for it in Section 6.3. Notice that, for such simple schemas, the element deﬁnitions can be checked independently from each other, in any order. s0 safely rewrites into s iff the language safe rewriting succeeds for all element deﬁnitions. 6.2 Rewriting XML Schemas Things are more involved when we consider more expressive schema languages, in the style of XML Schema. Types are allowed to be decoupled from element labels, but it holds that the type of an element is unambiguously determined by its label and the type of its parent. In this case, schema rewriting can be checked by a top-down analysis of the initial schema s0 , starting from the root. The type of the root determines the regular expression that has to be matched by its children, and the type of the root of s determines the target regular expression for the safe rewriting of types. Then, recursively moving down, the types corresponding to the labels of the children on both sides are unambiguously determined, and so are there ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 28 • T. Milo et al. Fig. 14. Language safe rewriting of R0 into R. corresponding regular expressions. Therefore safe rewriting of types can be checked at the next level, and so on. Notice that, while proceeding this way, only pairs of types for which safe rewriting hasn’t been tested yet need to be processed. This ensures that the algorithm terminates, even if schemas are recursive. 6.3 Language Safe Rewriting We explain now how to check for language safe rewriting. Given two regular expressions R0 and R, we want to check that all words in the language of R0 have a safe rewriting into a word in the language of R. The algorithm of Figure 14 checks just that. This algorithm is almost identical to the one presented in Section 4, except that the initial automaton is built to accept the language R0 instead of a single word. The following proposition states its correctness. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 29 PROPOSITION 6.1. The above algorithm returns true if and only if every word in the language R0 has a k-depth left-to-right safe rewriting into a word of R. PROOF. The proof of the algorithm of Section 4 naturally extends to language safe rewriting. There, the completeness of the algorithm was shown by build- ing a counterexample to the fact that there might be a safe rewriting although the algorithm answers negatively. The same construction holds for language rewriting, since it sufﬁces to show that one word in the language R0 does not safely rewrite into R to contradict the fact that R0 doesn’t rewrite into R. The soundness of the algorithm of Section 4 was shown in a constructive manner, by building a word corresponding to a nonmarked path in A× . The same con- struction applies to each word accepted by A R0 , that is, for each word in the language R0 , which establishes the correctness of this algorithm. 7. IMPLEMENTATION The ideas and algorithms presented in the previous sections have been imple- mented and used in the Schema Enforcement module of the Active XML system [Abiteboul et al. 2002] (also see the Active XML homepage of Web site http:// www.rocq.inria.fr/verso/Gemo/Projects/axml). We next present how the in- tensional data model and schema language of the previous sections map to XML, XML Schema, SOAP, and WSDL. Then, we brieﬂy describe the ActiveXML sys- tem and the Schema Enforcement module. 7.1 Using the Standards In the implementation, an intensional XML document is a syntactically well- formed XML document. This is because we also use an XML-based syntax to express the intensional parts in it. To distinguish these parts from the rest of the document, we rely on the mechanism of XML namespaces (see footnote 10). More precisely, the namespace http://www.activexml.com/ns/int is deﬁned for service calls. These calls can appear at any place where XML elements are allowed. The following example corresponds to the document of Figure 2(a): <?xml version="1.0"?> <newspaper xmlns:int="http://www.activexml.com/ns/int"> <title> The Sun </title> <date> 04/10/2002 </date> <int:fun endpointURL="http://www.forecast.com/soap" methodName="Get_Temp" namespaceURI="urn:xmethods-weather"> <int:params> <int:param> <city>Paris</city> </int:param> </int:params> </int:fun> <int:fun endpointURL="http://www.timeout.com/paris" methodName="TimeOut"> namespaceURI="urn:timeout-program"> ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 30 • T. Milo et al. <int:params> <int:param> exhibits </int:param> </int:params> </int:fun> </newspaper> Function nodes have three attributes that provide the necessary information to call a service using the SOAP protocol: the URL of the server, the method name, and the associated namespace. These attributes uniquely identify the called function, and are isomorphic to the function name in the abstract model. In order to deﬁne schemas for intensional documents, we use XML Schemaint , which is an extension of XML Schema. To describe intensional data, XML Schemaint introduces functions and function patterns. These are declared and used like element deﬁnitions in the standard XML Schema language. In par- ticular, it is possible to declare functions and function patterns globally, and reference them inside complex type deﬁnitions (e.g., sequence, choice, all). We give next the XML representation of function patterns that are described by a combination of ﬁve optional attributes and two optional subelements: params and return: <functionPattern id = NCName methodName = token endpointURL = anyURI namespaceURI = anyURI WSDLSignature = anyURI ref = NCName> Contents: (params?, return?) </functionPattern> The id attribute identiﬁes the function pattern, which can then be referenced by another function pattern using the ref attribute. Attributes methodName, endpointURL, and namespaceURI designate the SOAP Web service that im- plements the Boolean predicate used to check whether a particular function matches the function pattern. It takes as input parameter the SOAP identi- ﬁers of the function to validate. As a convention, when these parameters are omitted, the predicate returns true for all functions. The Contents detail the function signature, that is, the expected types for the input parameters and the result of the function. These types are also deﬁned using XML Schemaint , and may contain intensional parts. To illustrate this syntax, consider the function pattern Forecast, which cap- tures any function with one input parameter of element type city, returning an element of type temp. It is simply described by <functionPattern id="Forecast"> <params> <param> <element ref="city"/> </param> </params> <result> <element ref="temp"/> </result> </functionPattern> ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 31 Functions are declared in a similar way to function patterns, by using el- ements of type function. The main difference is that the three attributes methodName, endpointURL, and namespaceURI directly identify the function that can be used. As mentioned already, function and function pattern declarations may be used at any place where regular element and type declarations are allowed. For example, a newspaper element with structure title.date.(Forecast | temp). (TimeOut | exhibit ∗ ) may be deﬁned in XML Schemaint as <xsd:element name="newspaper"> <xsd:complexType> <xsd:sequence> <xsd:element ref="title"/> <xsd:element ref="date"/> <xsd:choice> <xsi:functionPattern ref="Forecast"/> <xsd:element ref="temp"/> <xsd:/choice> <xsd:choice> <xsi:functionPattern ref="TimeOut"/> <xsd:element ref="exhibit" minOccurs="0" maxOccurs="unbounded"/> <xsd:/choice> <xsd:/complexType> <xsd:/element> Note that just as for documents, we use a different namespace (embodied here by the use of the preﬁx xsi) to differentiate the intensional part of the schema from the rest of the declarations. Similarly to XML Schema, we require deﬁnitions to be unambiguous (see footnote 10)—namely, when parsing a document, for each element and each function node, the subelements can be sequentially assigned a correspond- ing type/function pattern in a deterministic way by looking only at the ele- ment/function name. One of the major features of the WSDL language is to describe the input and output types of Web services functions using XML Schema. We extend WSDL in the obvious way, by simply allowing these types to describe intensional data, using XML Schemaint . Finally, XML Schemaint allows WSDL or WSDLint descriptions to be referenced in the deﬁnition of a function or function pattern, instead of deﬁning the signature explicitly (using the WSDLSignature attribute). 7.2 The ActiveXML System ActiveXML is a peer-to-peer system that is centered around intensional XML documents. Each peer contains a repository of intensional documents, and pro- vides some active features to enrich them by automatically triggering the func- tion calls they contain. It also provides some Web services, deﬁned declara- tively as queries/updates on top of the repository documents. All the exchanges ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 32 • T. Milo et al. between the ActiveXML peers, and with other Web service providers/consumers use the SOAP protocol. The important point here is that both the services that an ActiveXML peer invokes and those that it provides potentially accept intensional input param- eters and return intensional results. Calls to “regular” Web services should comply with the input and output types deﬁned in their WSDL description. Similarly, when calling an ActiveXML peer, the parameters of the call should comply with its WSDL. The role of the Schema Enforcement module is (i) to verify whether the call parameters conform to the WSDLint description of the service, (ii) if not, to try to rewrite them into the required structure and (iii) if this fails, to report an error. Similarly, before an ActiveXML service returns its answer, the module performs the same three steps on the returned data. 7.3 The Schema Enforcement Module To implement this module, we needed a parser of XML Schemaint . We had the choice between extending an existing XML Schema parser based on DOM level 3 or developing an implementation from scratch [Ngoc 2002]. Whereas the ﬁrst solution seems preferable, we followed the second one because, at the time we started the implementation, the available (free) software we tried (Apache Xerces16 and Oracle Schema Processor17 ) appeared to have limited extensibility. Our parser relies on a standard event-based SAX parser.16 It does not cover all the features of XML Schema, but implements the important ones such as com- plex types, element/type references, and schema import. It does not check the validity of all simple types, nor does it deal with inheritance or keys. However, these features could be added rather easily to our code. The schema enforcement algorithm we implemented in the module follows the main lines of the algorithm in Section 4, and in particular the three same stages: (1) checking function parameters recursively, starting from the most inner ones and going out, (2) traversing, in each iteration, the tree top down, and (3) rewriting the children of every node encountered in this traversal. Steps (1) and (2) are done as described in Section 4. For step (2), recall from above that XML Schemaint are deterministic. This is precisely what enables the top-down traversal since the possible type of elements/functions can be determined locally. For step (3), our implementation uses an efﬁcient variant of the algorithm of Section 4. While the latter starts by constructing all the required automata and only then analyzes the resulting graph, our implemen- tation builds the automaton A× in a lazy manner, starting from the initial state, and constructing only the needed parts. The construction is pruned whenever a node can be marked directly, without looking at the remaining, unexplored, 16 The Xerces Java parser. Go online to http://xml.apache.org/xerces-j/. 17 The Oracle XML developer’s kit for Java. Go online to http://otn.oracle.com/tech/xml/. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 33 Fig. 15. The pruned automaton. branches. The two main ideas that guide this process are the following: — Sink nodes. Some accepting states in A are “sink” nodes: once you get there, you cannot get out (e.g., p6 in Figures 5 and 7). For the Cartesian product au- tomaton A× , this means that all paths starting from such nodes are marked. When such a node is reached in the construction of A× , we can immediately mark it and prune all its outgoing branches. For example, in Figure 15, the top left shaded area illustrates which parts of the Cartesian product au- tomaton of Figure 6 can be pruned. Nodes [q3 , p6 ] and [q7 , p6 ] contain the sink node p6 . They can be immediately be declared as marked, and the rest of the construction (the left shaded area) need not be constructed. —Marked nodes. Once a node is known to be marked, there is no point in explor- ing its outgoing branches any further. To continue with the above example, once the node [q7 , p6 ] gets marked, so does [q7 , p3 ] that points to it. Hence, there is no need to explore the other outgoing branches of [q7 , p3 ] (the shaded area on the right). While this dynamic variant of the algorithm has the same worst-case com- plexity as the algorithm of Figure 3, it saves a lot of unnecessary computation in practice. Details are available in Ngoc [2002]. 8. PEER-TO-PEER NEWS SYNDICATION In this section, we will illustrate the exchange of intensional documents, and the usefulness of our schema-based rewriting techniques through a real-life ap- plication: peer-to-peer news syndication. This application was recently demon- strated in Abiteboul et al. [2003a]. The setting is the one shown on Figure 16. We consider a number of news sources (newspaper Web sites, or individual “Weblogs”) that regularly publish news stories. They share this information with others in a standard XML for- mat, called RSS.18 Clients can periodically query/retrieve news from the sources they are interested in, or subscribe to news feeds. News aggregators are special peers that know of several news sources and let other clients ask queries to and/or discover the news sources they know. 18 RSS 1.0 speciﬁcation. Go online to http://purl.org/rss/1.0. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 34 • T. Milo et al. Fig. 16. Peer-to-peer news exchange. All interactions between news sources, aggregators, and clients are done through calls to Web services they provide. Intensional documents can be ex- changed both when passing parameters to these Web services, and in the an- swers they return. These exchanges are controlled by XML schemas, and docu- ments are rewritten to match these schemas, using the safe/possible rewriting algorithms detailed in the previous sections. This mechanism is used to provide several versions of a service, without changing its implementation, merely by using different schemas for its in- put parameters and results. For instance, the same querying service is eas- ily customized to be used by distinct kinds of participants, for example, various client types or aggregators, with different requirements on the type of its input/ output. More speciﬁcally, for each kind of peer we consider (namely, news sources and aggregators), we propose a set of basic Web services, with intensional output and input parameters, and show how they can be customized for different clients via schema-based rewriting. We ﬁrst consider the customization of intensional outputs, then the one of intensional inputs. 8.1 Customizing Intensional Outputs News sources provide news stories, using a basic Web service named getStory, which retrieves a story based on its identiﬁer, and has the following signature: <function id="GetStory"> <params> <param> <xsd:simpleType ref="xsd:string" /> </param> </params> ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 35 <result> <xsd:element name="story" type="xsd:string" /> </result> </functionPattern> Note that the output of this service is fully extensional. News sources also allow users to search for news items by keywords,19 using the following service: <function id="GetNewsAbout"> <params> <param> <xsd:simpleType ref="xsd:string" /> </param> </params> <result> <xsd:complexType ref="ItemList2" /> </result> </functionPattern> This service returns an RSS list of news items, of type ItemList2, where the items are given extensionally, except for the story, which can be intensional. The deﬁnition of the corresponding function pattern, intensionalStory is omitted. <xsd:complexType name="ItemList2"> <xsd:sequence> <xsd:element name="item" type="Item"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Item"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element ref="pubDate" type="xsd:dateTime"/> <xsd:element ref="description" type="xsd:string"/> <xsd:choice> <xsi:functionPattern ref="intensionalStory"/> <xsd:element name="story" type="xsd:string"/> </xsd:choice> </xsd:sequence> <xsd:attribute name="id" type="xsd:NMTOKEN"/> </xsd:complexType> A fully extensional variant of this service, aimed for instance at PDAs that download news for ofﬂine reading, is easily provided by employing the Schema Enforcement module to rewrite the previous output to one that complies to a fully extensional ItemList3 type, similar to the one above, except for the story that has to be extensional. 19 More complex query languages, such as the one proposed by Edutella could also be used (go online to http://edutella.jxta.org). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 36 • T. Milo et al. A more complex scenario allows readers to specify a desired output type at call time, as a parameter of the service call. If there exists a rewriting of the output that matches this schema, it will be applied before sending the result, otherwise an error message will be returned. Aggregators act as “superpeers” in the network. They know a number of news sources they can use to answer user queries. They also know other aggregators, which can relay the queries to additional news sources and other aggregators, transitively. Like news sources, they provide a getNewsAbout Web service, but allow for a more intensional output, of type ItemList, where news items can be either extensional or intensional. In the latter case they must match the intensionalNews function pattern, whose deﬁnition is omitted. <xsd:complexType name="ItemList"> <xsd:sequence> <xsd:choice> <xsi:functionPattern ref="intensionalNews"/> <xsd:element name="item" type="Item"/> </xsd:choice> </xsd:sequence> </xsd:complexType> When queried by simple news readers, the answer is rewritten, depending if the reader is a RSS customer or a PDA, into a Itemlist2 or Itemlist3 version, respectively. On the other hand, when queried by other aggregators that prefer compact intensional answers which can be easily forwarded to other aggrega- tors, no rewriting is performed, with the answer remaining as intensional as possible, preferably complying to the type below, which requires the information to be intensional. <xsd:complexType name="ItemList4"> <xsd:sequence> <xsi:functionPattern ref="intensionalNews"/> </xsd:sequence> </xsd:complexType> Note also that aggregators may have different capabilities. For instance, some of them may not be able to recursively invoke the service calls they get in intensional answers. This is captured by having them supply, as an input parameter, a precise type for the answer of getNewsAbout, that matches their capabilities (e.g., return me only service calls that return extensional data). 8.2 Intensional Input So far, we considered the intensional output of services. To illustrate the power of intensional input parameters, we deﬁne a continuous version of the getNewsAbout service provided by news sources and aggregators. Clients call this service only once, to subscribe to a news feed. Then, they periodically get new information that matches their query (a dual service exists, to unsubscribe). Here, the input parameter is allowed to be given intensionally, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 37 so that the service provider can probe it, adjusting the answer to the parameter’s current value. For instance, consider a mobile user whose physical location changes, and wants to get news about the town she is visiting. The zip code of this town can be provided by a Web service running on her device, namely a GPS service. A call to this service will be passed as an intensional query parameter, and will be called by the news source in order to periodically send her the relevant local information. This continuous news service is actually implemented using a wrapper around a noncontinuous getNewsAbout service, calling the latter periodically with the keyword parameter it received in the subscription. Since getNewsAbout doesn’t accept an intensional input parameter, the schema enforcement module rewrites the intensional parameter given in the subscription every time it has to be called. 8.3 Demonstration Setting To demonstrate this application [Abiteboul et al. 2003a], news sources were built as simple wrappers around RSS ﬁles provided by news websites such as Yahoo!News, BBC Word, the New York Times, and CNN. The news from these sources could also be queried through two aggregators providing the GetNewsAbout service, but customized with different output schemas. The cus- tomization of intensional input parameters was demonstrated using a contin- uous service, as explained above, by providing a call to a getFavoriteKeyword service as a parameter for the subscription. 9. CONCLUSION AND RELATED WORK As mentioned in the Introduction, XML documents with embedded calls to Web services are already present in several existing products. The idea of including function calls in data is certainly not a new one. Functions embedded in data were already present in relational systems [Molina et al. 2002] as stored pro- cedures. Also, method calls form a key component of object-oriented databases [Cattell 1996]. In the Web context, scripting languages such as PHP (see foot- note 2) or JSP (see footnote 1) have made popular the integration of processing inside HTML or XML documents. Combined with standard database interfaces such as JDBC and ODBC, functions are used to integrate results of queries (e.g., SQL queries) into documents. A representative example for this is Oracle XSQL (see footnote 17). Embedding Web service calls in XML documents is also done in popular products such as Microsoft Ofﬁce (Smart Tags) and Macromedia MX. While the static structure of such documents can be described by some DTD or XML Schema, our extension of XML Schema with function types is a ﬁrst step toward a more precise description of XML documents embedding compu- tation. Further work in that direction is clearly needed to better understand this powerful paradigm. There are a number of other proposals for typing XML documents, for example, Makoto [2001], Hosoya and Pierce [2000], and Cluet et al. [1998]. We selected XML Schema (see footnote 10) for several reasons. First, it is the standard recommended by the W3C for describing the struc- ture of XML documents. Furthermore, it is the typing language used in WSDL ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 38 • T. Milo et al. to deﬁne the signatures of Web services (see footnote 3). By extending XML Schema, we naturally introduce function types/patterns in WSDL service sig- natures. Finally, one aspect of XML Schema simpliﬁes the problem we study, namely, the unambiguity of XML Schema grammars. In many applications, it is necessary to screen queries and/or results ac- cording to speciﬁc user groups [Candan et al. 1996]. More speciﬁcally for us, embedded Web service calls in documents that are exchanged may be a se- rious cause of security violation. Indeed, this was one of the original mo- tivations for the work presented here. Controlling these calls by enforcing schemas for exchanged documents appeared to us as useful for building se- cure applications, and can be combined with other security and access models that were proposed for XML and Web services, for example, in Damiani et al. [2001] and WS-Security.20 However, further work is needed to investigate this aspect. The work presented here is part of the ActiveXML [Abiteboul et al. 2002, 2003b] (see also the Active XML homepage of the Web site: http://www.rocq. inria.fr/verso/Gemo/Projects/axml) project based on XML and Web services. We presented in this article what forms the core of the module that, in a peer, supports and controls the dialogue (via Web services) with the rest of the world. This particular module may be extended in several ways. First, one may intro- duce “automatic converters” capable of restructuring the data that is received to the format that was expected, and similarly for the data that is sent. Also, this module may be extended to act as a “negotiator” who could speak to other peers to agree with them on the intensional XML Schemas that should be used to exchange data. Finally, the module may be extended to include search capa- bilities, for example, UDDI style search (see footnote 4) to try to ﬁnd services on the Web that provide some particular information. In the global ActiveXML project, research is going on to extend the frame- work in various directions. In particular, we are working on distribution and replication of XML data and Web services [Abiteboul et al. 2003a]. Note that when some data may be found in different places and a service may be per- formed at different sites, the choice of which data to use and where to perform the service becomes an optimization issue. This is related to work on distributed database systems [Ozsu and Valduriez 1999] and to distributed computing at large. The novel aspect is the ability to exchange intensional information. This is in spirit of Jim and Suciu [2001], which considers also the exchange of inten- sional information in a distributed query processing setting. Intensional XML documents nicely ﬁt in the context of data integration, since an intensional part of an XML document may be seen as a view on some data source. Calls to Web services in XML data may be used to wrap Web sources [Garcia-Molina et al. 1997] or to propagate changes for warehouse maintenance [Zhuge et al. 1995]. Note that the control of whether to materialize data or not (studied here) provides some ﬂexible form of integration that is a hybrid of the warehouse model (all is materialized) and the mediator model (nothing is). 20 The WS-Security speciﬁcation. Go online to http://www.ibm.com/webservices/library/ ws-secure/. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Exchanging Intensional XML Data • 39 On the other hand, this is orthogonal to the issue of selecting the views to materialize in a warehouse, studied in, for example, Gupta [1997] and Yang et al. [1997]. To conclude, we mention some fundamental aspects of the problem we stud- ied. Although the k-depth/left-to-right restriction is not limiting in practice and the algorithm we implemented is fast enough, it would be interesting to under- stand the complexity and decidability barriers of (variants of) the problem. As we mentioned already, many results were found by Muscholl et al. [2004]. Namely, they proved the undecidability of the general safe rewriting problem for a context-free target language, and provided tight complexity bounds for several restricted cases. We already mentioned the connection to type theory and the novelty of our work in that setting, coming from the regular expressions in XML Schemas. Typing issues in XML Schema have recently motivated a number of interesting works such as Milo et al. [2000], which are based on tree automata. REFERENCES ABITEBOUL, S., AMANN, B., BAUMGARTEN, J., BENJELLOUN, O., NGOC, F. D., AND MILO, T. 2003a. Schema- driven customization of Web services. In Proceedings of VLDB. ABITEBOUL, S., BENJELLOUN, O., MANOLESCU, I., MILO, T., AND WEBER, R. 2002. Active XML: Peer-to- peer data and Web services integration (demo). In Proceedings of VLDB. ABITEBOUL, S., BONIFATI, A., COBENA, G., MANOLESCU, I., AND MILO, T. 2003b. Dynamic XML docu- ments with distribution and replication. In Proceedings of ACM SIGMOD. CANDAN, K. S., JAJODIA, S., AND SUBRAHMANIAN, V. S. 1996. Secure mediated databases. In Proceed- ings of ICDE. 28–37. CATTELL, R., Ed. 1996. The Object Database Standard: ODMG-93. Morgan Kaufman, San Francisco, CA. ´ CLUET, S., DELOBEL, C., SIMEON, J., AND SMAGA, K. 1998. Your mediators need data conversion! In Proceedings of ACM SIGMOD. 177–188. DAMIANI, E., DI VIMERCATI, S. D. C., PARABOSCHI, S., AND SAMARATI, P. 2001. Securing XML docu- ments. In Proceedings of EDBT. DOAN, A., DOMINGOS, P., AND HALEVY, A. Y. 2001. Reconciling schemas of disparate data sources: a machine-learning approach. In Proceedings of ACM SIGMOD. ACM Press, New York, NY, 509–520. GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM, J. 1997. The TSIMMIS approach to mediation: Data models and languages. J. Intel. Inform. Syst. 8, 117–132. GUPTA, H. 1997. Selection of views to materialize in a data warehouse. In Proceedings of ICDT. 98–112. HOPCROFT, J. E. AND ULLMAN, J. D. 1979. Introduction to Automata Theory, Languages and Com- putation. Addison-Wesley, Reading, MA. HOSOYA, H. AND PIERCE, B. C. 2000. XDuce: A typed XML processing language. In Proceedings of WebDB (Dallas, TX). JIM, T. AND SUCIU, D. 2001. Dynamically distributed query evaluation. In Proceedings of ACM PODS. 413–424. MAKOTO, M. 2001. RELAX (Regular Language description for XML). ISO/IEC Tech. Rep. ISO/IEC, Geneva, Switzerland. MILO, T., SUCIU, D., AND VIANU, V. 2000. Typechecking for XML transformers. In Proceedings of ACM PODS. 11–22. MITCHELL, J. C. 1990. Type systems for programming languages. In Handbook of Theoretical Computer Science: Volume B: Formal Models and Semantics, J. van Leeuwen, Ed. Elsevier, Amsterdam, The Netherlands, 365–458. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 40 • T. Milo et al. MOLINA, H., ULLMAN, J., AND WIDOM, J. 2002. Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs, NJ. MUSCHOLL, A., SCHWENTICK, T., AND SEGOUFIN, L. 2004. Active context-free games. In Proceed- ings of the 21st Symposium on Theoretical Aspects of Computer Science (STACS ’04; Le Comm, Montpelier, France, Mar. 25–27). NGOC, F. D. 2002. Validation de documents XML contenant des appels de services. M.S. thesis. CNAM. DEA SIR (in French) University of Paris VI, Paris, France. OZSU, T. AND VALDURIEZ, P. 1999. Principles of Distributed Database Systems (2nd ed.). Prentice- Hall, Englewood Cliffs, NJ. SEGOUFIN, L. 2003. Personal communication. YANG, J., KARLAPALEM, K., AND LI, Q. 1997. Algorithms for materialized view design in data ware- housing environment. In VLDB ’97: Proceedings of the 23rd International Conference on Very Large Data Bases. Morgan Kaufman Publishers, San Francisco, CA, 136–145. ZHUGE, Y., GARC´A-MOLINA, H., HAMMER, J., AND WIDOM, J. 1995. View maintenance in a warehous- ı ing environment. In Proceedings of ACM SIGMOD. 316–327. Received October 2003; accepted March 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems DIMITRIS PAPADIAS Hong Kong University of Science and Technology YUFEI TAO City University of Hong Kong GREG FU JP Morgan Chase and BERNHARD SEEGER Philipps University The skyline of a d -dimensional dataset contains the points that are not dominated by any other point on all dimensions. Skyline computation has recently received considerable attention in the database community, especially for progressive methods that can quickly return the initial re- sults without reading the entire database. All the existing algorithms, however, have some serious shortcomings which limit their applicability in practice. In this article we develop branch-and- bound skyline (BBS), an algorithm based on nearest-neighbor search, which is I/O optimal, that is, it performs a single access only to those nodes that may contain skyline points. BBS is simple to implement and supports all types of progressive processing (e.g., user preferences, arbitrary di- mensionality, etc). Furthermore, we propose several interesting variations of skyline computation, and show how BBS can be applied for their efﬁcient processing. Categories and Subject Descriptors: H.2 [Database Management]; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Skyline query, branch-and-bound algorithms, multidimen- sional access methods This research was supported by the grants HKUST 6180/03E and CityU 1163/04E from Hong Kong RGC and Se 553/3-1 from DFG. Authors’ addresses: D. Papadias, Department of Computer Science, Hong Kong University of Sci- ence and Technology, Clear Water Bay, Hong Kong; email: dimitris@cs.ust.hk; Y. Tao, Depart- ment of Computer Science, City University of Hong Kong, Tat Chee Avenue, Hong Kong; email: taoyf@cs.cityu.edu.hk; G. Fu, JP Morgan Chase, 277 Park Avenue, New York, NY 10172-0002; email: gregory.c.fu@jpmchase.com; B. Seeger, Department of Mathematics and Computer Science, Philipps University, Hans-Meerwein-Strasse, Marburg, Germany 35032; email: seeger@mathematik.uni- marburg.de. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0041 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 41–82. 42 • D. Papadias et al. Fig. 1. Example dataset and skyline. 1. INTRODUCTION The skyline operator is important for several applications involving multicrite- ria decision making. Given a set of objects p1 , p2 , . . . , pN , the operator returns all objects pi such that pi is not dominated by another object p j . Using the common example in the literature, assume in Figure 1 that we have a set of hotels and for each hotel we store its distance from the beach (x axis) and its price ( y axis). The most interesting hotels are a, i, and k, for which there is no point that is better in both dimensions. Borzsonyi et al. [2001] proposed an SQL syntax for the skyline operator, according to which the above query would be expressed as: [Select *, From Hotels, Skyline of Price min, Distance min], where min indicates that the price and the distance attributes should be minimized. The syntax can also capture different conditions (such as max), joins, group-by, and so on. For simplicity, we assume that skylines are computed with respect to min con- ditions on all dimensions; however, all methods discussed can be applied with any combination of conditions. Using the min condition, a point pi dominates1 another point p j if and only if the coordinate of pi on any axis is not larger than the corresponding coordinate of p j . Informally, this implies that pi is preferable to p j according to any preference (scoring) function which is monotone on all attributes. For instance, hotel a in Figure 1 is better than hotels b and e since it is closer to the beach and cheaper (independently of the relative importance of the distance and price attributes). Furthermore, for every point p in the skyline there exists a monotone function f such that p minimizes f [Borzsonyi et al. 2001]. Skylines are related to several other well-known problems, including convex hulls, top-K queries, and nearest-neighbor search. In particular, the convex hull contains the subset of skyline points that may be optimal only for linear pref- o erence functions (as opposed to any monotone function). B¨ hm and Kriegel [2001] proposed an algorithm for convex hulls, which applies branch-and- bound search on datasets indexed by R-trees. In addition, several main-memory 1 According to this deﬁnition, two or more points with the same coordinates can be part of the skyline. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 43 algorithms have been proposed for the case that the whole dataset ﬁts in mem- ory [Preparata and Shamos 1985]. Top-K (or ranked) queries retrieve the best K objects that minimize a speciﬁc preference function. As an example, given the preference function f (x, y) = x + y, the top-3 query, for the dataset in Figure 1, retrieves < i, 5 >, < h, 7 >, < m, 8 > (in this order), where the number with each point indicates its score. The difference from skyline queries is that the output changes according to the input function and the retrieved points are not guaranteed to be part of the skyline (h and m are dominated by i). Database techniques for top-K queries include Prefer [Hristidis et al. 2001] and Onion [Chang et al. 2000], which are based on prematerialization and convex hulls, respectively. Several methods have been proposed for combining the results of multiple top-K queries [Fagin et al. 2001; Natsev et al. 2001]. Nearest-neighbor queries specify a query point q and output the objects clos- est to q, in increasing order of their distance. Existing database algorithms as- sume that the objects are indexed by an R-tree (or some other data-partitioning method) and apply branch-and-bound search. In particular, the depth-ﬁrst al- gorithm of Roussopoulos et al. [1995] starts from the root of the R-tree and re- cursively visits the entry closest to the query point. Entries, which are farther than the nearest neighbor already found, are pruned. The best-ﬁrst algorithm of Henrich [1994] and Hjaltason and Samet [1999] inserts the entries of the visited nodes in a heap, and follows the one closest to the query point. The re- lation between skyline queries and nearest-neighbor search has been exploited by previous skyline algorithms and will be discussed in Section 2. Skylines, and other directly related problems such as multiobjective opti- mization [Steuer 1986], maximum vectors [Kung et al. 1975; Matousek 1991], and the contour problem [McLain 1974], have been extensively studied and nu- merous algorithms have been proposed for main-memory processing. To the best of our knowledge, however, the ﬁrst work addressing skylines in the context of databases was Borzsonyi et al. [2001], which develops algorithms based on block nested loops, divide-and-conquer, and index scanning. An improved version of block nested loops is presented in Chomicki et al. [2003]. Tan et al. [2001] pro- posed progressive (or on-line) algorithms that can output skyline points without having to scan the entire data input. Kossmann et al. [2002] presented an algo- rithm, called NN due to its reliance on nearest-neighbor search, which applies the divide-and-conquer framework on datasets indexed by R-trees. The exper- imental evaluation of Kossmann et al. [2002] showed that NN outperforms previous algorithms in terms of overall performance and general applicability independently of the dataset characteristics, while it supports on-line process- ing efﬁciently. Despite its advantages, NN has also some serious shortcomings such as need for duplicate elimination, multiple node visits, and large space require- ments. Motivated by this fact, we propose a progressive algorithm called branch and bound skyline (BBS), which, like NN, is based on nearest-neighbor search on multidimensional access methods, but (unlike NN) is optimal in terms of node accesses. We experimentally and analytically show that BBS outper- forms NN (usually by orders of magnitude) for all problem instances, while ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 44 • D. Papadias et al. Fig. 2. Divide-and-conquer. incurring less space overhead. In addition to its efﬁciency, the proposed algo- rithm is simple and easily extendible to several practical variations of skyline queries. The rest of the article is organized as follows: Section 2 reviews previous secondary-memory algorithms for skyline computation, discussing their advan- tages and limitations. Section 3 introduces BBS, proves its optimality, and an- alyzes its performance and space consumption. Section 4 proposes alternative skyline queries and illustrates their processing using BBS. Section 5 introduces the concept of approximate skylines, and Section 6 experimentally evaluates BBS, comparing it against NN under a variety of settings. Finally, Section 7 concludes the article and describes directions for future work. 2. RELATED WORK This section surveys existing secondary-memory algorithms for computing sky- lines, namely: (1) divide-and-conquer, (2) block nested loop, (3) sort ﬁrst skyline, (4) bitmap, (5) index, and (6) nearest neighbor. Speciﬁcally, (1) and (2) were pro- posed in Borzsonyi et al. [2001], (3) in Chomicki et al. [2003], (4) and (5) in Tan et al. [2001], and (6) in Kossmann et al. [2002]. We do not consider the sorted list scan, and the B-tree algorithms of Borzsonyi et al. [2001] due to their limited applicability (only for two dimensions) and poor performance, respectively. 2.1 Divide-and-Conquer The divide-and-conquer (D&C) approach divides the dataset into several par- titions so that each partition ﬁts in memory. Then, the partial skyline of the points in every partition is computed using a main-memory algorithm (e.g., Matousek [1991]), and the ﬁnal skyline is obtained by merging the partial ones. Figure 2 shows an example using the dataset of Figure 1. The data space is di- vided into four partitions s1 , s2 , s3 , s4 , with partial skylines {a, c, g }, {d }, {i}, {m, k}, respectively. In order to obtain the ﬁnal skyline, we need to remove those points that are dominated by some point in other partitions. Obviously all points in the skyline of s3 must appear in the ﬁnal skyline, while those in s2 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 45 are discarded immediately because they are dominated by any point in s3 (in fact s2 needs to be considered only if s3 is empty). Each skyline point in s1 is compared only with points in s3 , because no point in s2 or s4 can dominate those in s1 . In this example, points c, g are removed because they are dominated by i. Similarly, the skyline of s4 is also compared with points in s3 , which results in the removal of m. Finally, the algorithm terminates with the remaining points {a, i, k}. D&C is efﬁcient only for small datasets (e.g., if the entire dataset ﬁts in memory then the algorithm requires only one application of a main-memory skyline algorithm). For large datasets, the partitioning process requires read- ing and writing the entire dataset at least once, thus incurring signiﬁcant I/O cost. Further, this approach is not suitable for on-line processing because it cannot report any skyline until the partitioning phase completes. 2.2 Block Nested Loop and Sort First Skyline A straightforward approach to compute the skyline is to compare each point p with every other point, and report p as part of the skyline if it is not dominated. Block nested loop (BNL) builds on this concept by scanning the data ﬁle and keeping a list of candidate skyline points in main memory. At the beginning, the list contains the ﬁrst data point, while for each subsequent point p, there are three cases: (i) if p is dominated by any point in the list, it is discarded as it is not part of the skyline; (ii) if p dominates any point in the list, it is inserted, and all points in the list dominated by p are dropped; and (iii) if p is neither dominated by, nor dominates, any point in the list, it is simply inserted without dropping any point. The list is self-organizing because every point found dominating other points is moved to the top. This reduces the number of comparisons as points that dominate multiple other points are likely to be checked ﬁrst. A problem of BNL is that the list may become larger than the main memory. When this happens, all points falling in the third case (cases (i) and (ii) do not increase the list size) are added to a temporary ﬁle. This fact necessitates multiple passes of BNL. In particular, after the algorithm ﬁnishes scanning the data ﬁle, only points that were inserted in the list before the creation of the temporary ﬁle are guaranteed to be in the skyline and are output. The remaining points must be compared against the ones in the temporary ﬁle. Thus, BNL has to be executed again, this time using the temporary (instead of the data) ﬁle as input. The advantage of BNL is its wide applicability, since it can be used for any dimensionality without indexing or sorting the data ﬁle. Its main problems are the reliance on main memory (a small memory may lead to numerous iterations) and its inadequacy for progressive processing (it has to read the entire data ﬁle before it returns the ﬁrst skyline point). The sort ﬁrst skyline (SFS) variation of BNL alleviates these problems by ﬁrst sorting the entire dataset according to a (monotone) preference function. Candidate points are inserted into the list in ascending order of their scores, because points with lower scores are likely to dominate a large number of points, thus rendering the pruning more effective. SFS exhibits progressive behavior because the presorting ensures that a point p dominating another p must be visited before p ; hence we can immediately ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 46 • D. Papadias et al. Table I. The Bitmap Approach id Coordinate Bitmap Representation a (1, 9) (1111111111, 1100000000) b (2, 10) (1111111110, 1000000000) c (4, 8) (1111111000, 1110000000) d (6, 7) (1111100000, 1111000000) e (9, 10) (1100000000, 1000000000) f (7, 5) (1111000000, 1111110000) g (5, 6) (1111110000, 1111100000) h (4, 3) (1111111000, 1111111100) i (3, 2) (1111111100, 1111111110) k (9, 1) (1100000000, 1111111111) l (10, 4) (1000000000, 1111111000) m (6, 2) (1111100000, 11111111110) n (8, 3) (1110000000, 1111111100) output the points inserted to the list as skyline points. Nevertheless, SFS has to scan the entire data ﬁle to return a complete skyline, because even a skyline point may have a very large score and thus appear at the end of the sorted list (e.g., in Figure 1, point a has the third largest score for the preference function 0 · distance + 1 · price). Another problem of SFS (and BNL) is that the order in which the skyline points are reported is ﬁxed (and decided by the sort order), while as discussed in Section 2.6, a progressive skyline algorithm should be able to report points according to user-speciﬁed scoring functions. 2.3 Bitmap This technique encodes in bitmaps all the information needed to decide whether a point is in the skyline. Toward this, a data point p = ( p1 , p2 , . . . , pd ), where d is the number of dimensions, is mapped to an m-bit vector, where m is the total number of distinct values over all dimensions. Let ki be the total number of distinct values on the ith dimension (i.e., m = i=1∼d ki ). In Figure 1, for example, there are k1 = k2 = 10 distinct values on the x, y dimensions and m = 20. Assume that pi is the ji th smallest number on the ith axis; then it is represented by ki bits, where the leftmost (ki − ji + 1) bits are 1, and the remaining ones 0. Table I shows the bitmaps for points in Figure 1. Since point a has the smallest value (1) on the x axis, all bits of a1 are 1. Similarly, since a2 (= 9) is the ninth smallest on the y axis, the ﬁrst 10 − 9 + 1 = 2 bits of its representation are 1, while the remaining ones are 0. Consider that we want to decide whether a point, for example, c with bitmap representation (1111111000, 1110000000), belongs to the skyline. The right- most bits equal to 1, are the fourth and the eighth, on dimensions x and y, respectively. The algorithm creates two bit-strings, c X = 1110000110000 and cY = 0011011111111, by juxtaposing the corresponding bits (i.e., the fourth and eighth) of every point. In Table I, these bit-strings (shown in bold) contain 13 bits (one from each object, starting from a and ending with n). The 1s in the result of c X & cY = 0010000110000 indicate the points that dominate c, that is, c, h, and i. Obviously, if there is more than a single 1, the considered point ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 47 Table II. The Index Approach List 1 List 2 a (1, 9) minC = 1 k (9, 1) minC = 1 b (2, 10) minC = 2 i (3, 2), m (6, 2) minC = 2 c (4, 8) minC = 4 h (4, 3), n (8, 3) minC = 3 g (5, 6) minC = 5 l (10, 4) minC = 4 d (6, 7) minC = 6 f (7, 5) minC = 5 e (9, 10) minC = 9 is not in the skyline.2 The same operations are repeated for every point in the dataset to obtain the entire skyline. The efﬁciency of bitmap relies on the speed of bit-wise operations. The ap- proach can quickly return the ﬁrst few skyline points according to their inser- tion order (e.g., alphabetical order in Table I), but, as with BNL and SFS, it cannot adapt to different user preferences. Furthermore, the computation of the entire skyline is expensive because, for each point inspected, it must re- trieve the bitmaps of all points in order to obtain the juxtapositions. Also the space consumption may be prohibitive, if the number of distinct values is large. Finally, the technique is not suitable for dynamic datasets where insertions may alter the rankings of attribute values. 2.4 Index The index approach organizes a set of d -dimensional points into d lists such that a point p = ( p1 , p2 , . . . , pd ) is assigned to the ith list (1 ≤ i ≤ d ), if and only if its coordinate pi on the ith axis is the minimum among all dimensions, or formally, pi ≤ p j for all j = i. Table II shows the lists for the dataset of Figure 1. Points in each list are sorted in ascending order of their minimum coordinate (minC, for short) and indexed by a B-tree. A batch in the ith list consists of points that have the same ith coordinate (i.e., minC). In Table II, every point of list 1 constitutes an individual batch because all x coordinates are different. Points in list 2 are divided into ﬁve batches {k}, {i, m}, {h, n}, {l }, and { f }. Initially, the algorithm loads the ﬁrst batch of each list, and handles the one with the minimum minC. In Table II, the ﬁrst batches {a}, {k} have identical minC = 1, in which case the algorithm handles the batch from list 1. Processing a batch involves (i) computing the skyline inside the batch, and (ii) among the computed points, it adds the ones not dominated by any of the already-found skyline points into the skyline list. Continuing the example, since batch {a} contains a single point and no skyline point is found so far, a is added to the skyline list. The next batch {b} in list 1 has minC = 2; thus, the algorithm handles batch {k} from list 2. Since k is not dominated by a, it is inserted in the skyline. Similarly, the next batch handled is {b} from list 1, where b is dominated by point a (already in the skyline). The algorithm proceeds with batch {i, m}, computes the skyline inside the batch that contains a single point i (i.e., i dominates m), and adds i to the skyline. At this step, the algorithm does 2 Theresult of “&” will contain several 1s if multiple skyline points coincide. This case can be handled with an additional “or” operation [Tan et al. 2001]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 48 • D. Papadias et al. Fig. 3. Example of NN. not need to proceed further, because both coordinates of i are smaller than or equal to the minC (i.e., 4, 3) of the next batches (i.e., {c}, {h, n}) of lists 1 and 2. This means that all the remaining points (in both lists) are dominated by i, and the algorithm terminates with {a, i, k}. Although this technique can quickly return skyline points at the top of the lists, the order in which the skyline points are returned is ﬁxed, not supporting user-deﬁned preferences. Furthermore, as indicated in Kossmann et al. [2002], the lists computed for d dimensions cannot be used to retrieve the skyline on any subset of the dimensions because the list that an element belongs to may change according the subset of selected dimensions. In general, for supporting queries on arbitrary dimensions, an exponential number of lists must be precomputed. 2.5 Nearest Neighbor NN uses the results of nearest-neighbor search to partition the data universe recursively. As an example, consider the application of the algorithm to the dataset of Figure 1, which is indexed by an R-tree [Guttman 1984; Sellis et al. 1987; Beckmann et al. 1990]. NN performs a nearest-neighbor query (using an existing algorithm such as one of the proposed by Roussopoulos et al. [1995], or Hjaltason and Samet [1999] on the R-tree, to ﬁnd the point with the minimum distance (mindist) from the beginning of the axes (point o). Without loss of generality,3 we assume that distances are computed according to the L1 norm, that is, the mindist of a point p from the beginning of the axes equals the sum of the coordinates of p. It can be shown that the ﬁrst nearest neighbor (point i with mindist 5) is part of the skyline. On the other hand, all the points in the dominance region of i (shaded area in Figure 3(a)) can be pruned from further consideration. The remaining space is split in two partitions based on the coordinates (ix , i y ) of point i: (i) [0, ix ) [0, ∞) and (ii) [0, ∞) [0, i y ). In Figure 3(a), the ﬁrst partition contains subdivisions 1 and 3, while the second one contains subdivisions 1 and 2. The partitions resulting after the discovery of a skyline point are inserted in a to-do list. While the to-do list is not empty, NN removes one of the partitions 3 NN (and BBS) can be applied with any monotone function; the skyline points are the same, but the order in which they are discovered may be different. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 49 Fig. 4. NN partitioning for three-dimensions. from the list and recursively repeats the same process. For instance, point a is the nearest neighbor in partition [0, ix ) [0, ∞), which causes the insertion of partitions [0, ax ) [0, ∞) (subdivisions 5 and 7 in Figure 3(b)) and [0, ix ) [0, a y ) (subdivisions 5 and 6 in Figure 3(b)) in the to-do list. If a partition is empty, it is not subdivided further. In general, if d is the dimensionality of the data-space, a new skyline point causes d recursive applications of NN. In particular, each coordinate of the discovered point splits the corresponding axis, introducing a new search region towards the origin of the axis. Figure 4(a) shows a three-dimensional (3D) example, where point n with coordinates (nx , n y , nz ) is the ﬁrst nearest neighbor (i.e., skyline point). The NN algorithm will be recursively called for the partitions (i) [0, nx ) [0, ∞) [0, ∞) (Figure 4(b)), (ii) [0, ∞) [0, n y ) [0, ∞) (Figure 4(c)) and (iii) [0, ∞) [0, ∞) [0, nz ) (Figure 4(d)). Among the eight space subdivisions shown in Figure 4, the eighth one will not be searched by any query since it is dominated by point n. Each of the remaining subdivisions, however, will be searched by two queries, for example, a skyline point in subdivision 2 will be discovered by both the second and third queries. In general, for d > 2, the overlapping of the partitions necessitates dupli- cate elimination. Kossmann et al. [2002] proposed the following elimination methods: —Laisser-faire: A main memory hash table stores the skyline points found so far. When a point p is discovered, it is probed and, if it already exists in the hash table, p is discarded; otherwise, p is inserted into the hash table. The technique is straightforward and incurs minimum CPU overhead, but results in very high I/O cost since large parts of the space will be accessed by multiple queries. —Propagate: When a point p is found, all the partitions in the to-do list that contain p are removed and repartitioned according to p. The new partitions are inserted into the to-do list. Although propagate does not discover the same ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 50 • D. Papadias et al. skyline point twice, it incurs high CPU cost because the to-do list is scanned every time a skyline point is discovered. —Merge: The main idea is to merge partitions in to-do, thus reducing the num- ber of queries that have to be performed. Partitions that are contained in other ones can be eliminated in the process. Like propagate, merge also in- curs high CPU cost since it is expensive to ﬁnd good candidates for merging. —Fine-grained partitioning: The original NN algorithm generates d partitions after a skyline point is found. An alternative approach is to generate 2d nonoverlapping subdivisions. In Figure 4, for instance, the discovery of point n will lead to six new queries (i.e., 23 – 2 since subdivisions 1 and 8 cannot contain any skyline points). Although ﬁne-grained partitioning avoids dupli- cates, it generates the more complex problem of false hits, that is, it is possible that points in one subdivision (e.g., subdivision 4) are dominated by points in another (e.g., subdivision 2) and should be eliminated. According to the experimental evaluation of Kossmann et al. [2002], the performance of laisser-faire and merge was unacceptable, while ﬁne-grained partitioning was not implemented due to the false hits problem. Propagate was signiﬁcantly more efﬁcient, but the best results were achieved by a hybrid method combining propagate and laisser-faire. 2.6 Discussion About the Existing Algorithms We summarize this section with a comparison of the existing methods, based on the experiments of Tan et al. [2001], Kossmann et al. [2002], and Chomicki et al. [2003]. Tan et al. [2001] examined BNL, D&C, bitmap, and index, and suggested that index is the fastest algorithm for producing the entire skyline under all settings. D&C and bitmap are not favored by correlated datasets (where the skyline is small) as the overhead of partition-merging and bitmap- loading, respectively, does not pay-off. BNL performs well for small skylines, but its cost increases fast with the skyline size (e.g., for anticorrelated datasets, high dimensionality, etc.) due to the large number of iterations that must be performed. Tan et al. [2001] also showed that index has the best performance in returning skyline points progressively, followed by bitmap. The experiments of Chomicki et al. [2003] demonstrated that SFS is in most cases faster than BNL without, however, comparing it with other algorithms. According to the eval- uation of Kossmann et al. [2002], NN returns the entire skyline more quickly than index (hence also more quickly than BNL, D&C, and bitmap) for up to four dimensions, and their difference increases (sometimes to orders of magnitudes) with the skyline size. Although index can produce the ﬁrst few skyline points in shorter time, these points are not representative of the whole skyline (as they are good on only one axis while having large coordinates on the others). Kossmann et al. [2002] also suggested a set of criteria (adopted from Heller- stein et al. [1999]) for evaluating the behavior and applicability of progressive skyline algorithms: (i) Progressiveness: the ﬁrst results should be reported to the user almost instantly and the output size should gradually increase. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 51 (ii) Absence of false misses: given enough time, the algorithm should generate the entire skyline. (iii) Absence of false hits: the algorithm should not discover temporary skyline points that will be later replaced. (iv) Fairness: the algorithm should not favor points that are particularly good in one dimension. (v) Incorporation of preferences: the users should be able to determine the order according to which skyline points are reported. (vi) Universality: the algorithm should be applicable to any dataset distribu- tion and dimensionality, using some standard index structure. All the methods satisfy criterion (ii), as they deal with exact (as opposed to approximate) skyline computation. Criteria (i) and (iii) are violated by D&C and BNL since they require at least a scan of the data ﬁle before reporting skyline points and they both insert points (in partial skylines or the self-organizing list) that are later removed. Furthermore, SFS and bitmap need to read the entire ﬁle before termination, while index and NN can terminate as soon as all skyline points are discovered. Criteria (iv) and (vi) are violated by index because it outputs the points according to their minimum coordinates in some dimension and cannot handle skylines in some subset of the original dimensionality. All algorithms, except NN, defy criterion (v); NN can incorporate preferences by simply changing the distance deﬁnition according to the input scoring function. Finally, note that progressive behavior requires some form of preprocessing, that is, index creation (index, NN), sorting (SFS), or bitmap creation (bitmap). This preprocessing is a one-time effort since it can be used by all subsequent queries provided that the corresponding structure is updateable in the presence of record insertions and deletions. The maintenance of the sorted list in SFS can be performed by building a B+-tree on top of the list. The insertion of a record in index simply adds the record in the list that corresponds to its minimum coordinate; similarly, deletion removes the record from the list. NN can also be updated incrementally as it is based on a fully dynamic structure (i.e., the R-tree). On the other hand, bitmap is aimed at static datasets because a record insertion/deletion may alter the bitmap representation of numerous (in the worst case, of all) records. 3. BRANCH-AND-BOUND SKYLINE ALGORITHM Despite its general applicability and performance advantages compared to ex- isting skyline algorithms, NN has some serious shortcomings, which are de- scribed in Section 3.1. Then Section 3.2 proposes the BBS algorithm and proves its correctness. Section 3.3 analyzes the performance of BBS and illustrates its I/O optimality. Finally, Section 3.4 discusses the incremental maintenance of skylines in the presence of database updates. 3.1 Motivation A recursive call of the NN algorithm terminates when the corresponding nearest-neighbor query does not retrieve any point within the corresponding ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 52 • D. Papadias et al. Fig. 5. Recursion tree. space. Lets call such a query empty, to distinguish it from nonempty queries that return results, each spawning d new recursive applications of the algo- rithm (where d is the dimensionality of the data space). Figure 5 shows a query processing tree, where empty queries are illustrated as transparent cy- cles. For the second level of recursion, for instance, the second query does not return any results, in which case the recursion will not proceed further. Some of the nonempty queries may be redundant, meaning that they return sky- line points already found by previous queries. Let s be the number of skyline points in the result, e the number of empty queries, ne the number of nonempty ones, and r the number of redundant queries. Since every nonempty query either retrieves a skyline point, or is redundant, we have ne = s + r. Fur- thermore, the number of empty queries in Figure 5 equals the number of leaf nodes in the recursion tree, that is, e = ne · (d − 1) + 1. By combining the two equations, we get e = (s + r) · (d − 1) + 1. Each query must traverse a whole path from the root to the leaf level of the R-tree before it terminates; there- fore, its I/O cost is at least h node accesses, where h is the height of the tree. Summarizing the above observations, the total number of accesses for NN is: NANN ≥ (e + s + r) · h = (s + r) · h · d + h > s · h · d . The value s · h · d is a rather optimistic lower bound since, for d > 2, the number r of redundant queries may be very high (depending on the duplicate elimination method used), and queries normally incur more than h node accesses. Another problem of NN concerns the to-do list size, which can exceed that of the dataset for as low as three dimensions, even without considering redundant queries. Assume, for instance, a 3D uniform dataset (cardinality N ) and a sky- line query with the preference function f (x, y, z) = x. The ﬁrst skyline point n (nx , n y , nz ) has the smallest x coordinate among all data points, and adds partitions Px = [0, nx ) [0, ∞) [0, ∞), P y = [0, ∞) [0, n y ) [0, ∞), Pz = [0, ∞) [0, ∞) [0, nz ) in the to-do list. Note that the NN query in Px is empty because there is no other point whose x coordinate is below nx . On the other hand, the expected volume of P y (Pz ) is 1/2 (assuming unit axis length on all dimensions), because the nearest neighbor is decided solely on x coordinates, and hence n y (nz ) distributes uniformly in [0, 1]. Following the same reasoning, a NN in P y ﬁnds the second skyline point that introduces three new partitions such that one partition leads to an empty query, while the volumes of the other two are 1/ . P is handled similarly, after which the to-do list contains four partitions 4 z with volumes 1/4, and 2 empty partitions. In general, after the ith level of re- cursion, the to-do list contains 2i partitions with volume 1/2i , and 2i−1 empty ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 53 Fig. 6. R-tree example. partitions. The algorithm terminates when 1/2i < 1/N (i.e., i > log N ) so that all partitions in the to-do list are empty. Assuming that the empty queries are performed at the end, the size of the to-do list can be obtained by summing the number e of empty queries at each recursion level i: log N 2i−1 = N − 1. i=1 The implication of the above equation is that, even in 3D, NN may behave like a main-memory algorithm (since the to-do list, which resides in memory, is the same order of size as the input dataset). Using the same reasoning, for arbitrary dimensionality d > 2, e = ((d −1)log N ), that is, the to-do list may become orders of magnitude larger than the dataset, which seriously limits the applicability of NN. In fact, as shown in Section 6, the algorithm does not terminate in the majority of experiments involving four and ﬁve dimensions. 3.2 Description of BBS Like NN, BBS is also based on nearest-neighbor search. Although both algo- rithms can be used with any data-partitioning method, in this article we use R-trees due to their simplicity and popularity. The same concepts can be ap- plied with other multidimensional access methods for high-dimensional spaces, where the performance of R-trees is known to deteriorate. Furthermore, as claimed in Kossmann et al. [2002], most applications involve up to ﬁve di- mensions, for which R-trees are still efﬁcient. For the following discussion, we use the set of 2D data points of Figure 1, organized in the R-tree of Figure 6 with node capacity = 3. An intermediate entry ei corresponds to the minimum bounding rectangle (MBR) of a node Ni at the lower level, while a leaf entry corresponds to a data point. Distances are computed according to L1 norm, that is, the mindist of a point equals the sum of its coordinates and the mindist of a MBR (i.e., intermediate entry) equals the mindist of its lower-left corner point. BBS, similar to the previous algorithms for nearest neighbors [Roussopoulos o et al. 1995; Hjaltason and Samet 1999] and convex hulls [B¨ hm and Kriegel 2001], adopts the branch-and-bound paradigm. Speciﬁcally, it starts from the root node of the R-tree and inserts all its entries (e6 , e7 ) in a heap sorted ac- cording to their mindist. Then, the entry with the minimum mindist (e7 ) is “expanded”. This expansion removes the entry (e7 ) from the heap and inserts ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 54 • D. Papadias et al. Table III. Heap Contents Action Heap Contents S Access root <e7, 4><e6, 6> Ø Expand e7 <e3, 5><e6, 6><e5, 8><e4, 10> Ø Expand e3 <i, 5><e6, 6><h, 7><e5, 8> <e4, 10><g, 11> {i} Expand e6 <h, 7><e5 , 8><e1, 9><e4, 10><g, 11> {i} Expand e1 <a, 10><e4, 10><g, 11><b, 12><c, 12> {i, a} Expand e4 <k, 10> < g, 11>< b, 12>< c, 12>< l, 14> {i, a, k} Fig. 7. BBS algorithm. its children (e3 , e4 , e5 ). The next expanded entry is again the one with the min- imum mindist (e3 ), in which the ﬁrst nearest neighbor (i) is found. This point (i) belongs to the skyline, and is inserted to the list S of skyline points. Notice that up to this step BBS behaves like the best-ﬁrst nearest-neighbor algorithm of Hjaltason and Samet [1999]. The next entry to be expanded is e6 . Although the nearest-neighbor algorithm would now terminate since the mindist (6) of e6 is greater than the distance (5) of the nearest neighbor (i) already found, BBS will proceed because node N6 may contain skyline points (e.g., a). Among the children of e6 , however, only the ones that are not dominated by some point in S are inserted into the heap. In this case, e2 is pruned because it is dominated by point i. The next entry considered (h) is also pruned as it also is dominated by point i. The algorithm proceeds in the same manner until the heap becomes empty. Table III shows the ids and the mindist of the entries inserted in the heap (skyline points are bold). The pseudocode for BBS is shown in Figure 7. Notice that an entry is checked for dominance twice: before it is inserted in the heap and before it is expanded. The second check is necessary because an entry (e.g., e5 ) in the heap may become dominated by some skyline point discovered after its insertion (therefore, the entry does not need to be visited). Next we prove the correctness for BBS. LEMMA 1. BBS visits (leaf and intermediate) entries of an R-tree in ascend- ing order of their distance to the origin of the axis. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 55 Fig. 8. Entries of the main-memory R-tree. PROOF. The proof is straightforward since the algorithm always visits en- tries according to their mindist order preserved by the heap. LEMMA 2. Any data point added to S during the execution of the algorithm is guaranteed to be a ﬁnal skyline point. PROOF. Assume, on the contrary, that point p j was added into S, but it is not a ﬁnal skyline point. Then p j must be dominated by a (ﬁnal) skyline point, say, pi , whose coordinate on any axis is not larger than the corresponding coordinate of p j , and at least one coordinate is smaller (since pi and p j are different points). This in turn means that mindist( pi ) < mindist( p j ). By Lemma 1, pi must be visited before p j . In other words, at the time p j is processed, pi must have already appeared in the skyline list, and hence p j should be pruned, which contradicts the fact that p j was added in the list. LEMMA 3. Every data point will be examined, unless one of its ancestor nodes has been pruned. PROOF. The proof is obvious since all entries that are not pruned by an existing skyline point are inserted into the heap and examined. Lemmas 2 and 3 guarantee that, if BBS is allowed to execute until its ter- mination, it will correctly return all skyline points, without reporting any false hits. An important issue regards the dominance checking, which can be expen- sive if the skyline contains numerous points. In order to speed up this process we insert the skyline points found in a main-memory R-tree. Continuing the example of Figure 6, for instance, only points i, a, k will be inserted (in this order) to the main-memory R-tree. Checking for dominance can now be per- formed in a way similar to traditional window queries. An entry (i.e., node MBR or data point) is dominated by a skyline point p, if its lower left point falls inside the dominance region of p, that is, the rectangle deﬁned by p and the edge of the universe. Figure 8 shows the dominance regions for points i, a, k and two entries; e is dominated by i and k, while e is not dominated by any point (therefore is should be expanded). Note that, in general, most domi- nance regions will cover a large part of the data space, in which case there will be signiﬁcant overlap between the intermediate nodes of the main-memory ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 56 • D. Papadias et al. R-tree. Unlike traditional window queries that must retrieve all results, this is not a problem here because we only need to retrieve a single dominance re- gion in order to determine that the entry is dominated (by at least one skyline point). To conclude this section, we informally evaluate BBS with respect to the criteria of Hellerstein et al. [1999] and Kossmann et al. [2002], presented in Section 2.6. BBS satisﬁes property (i) as it returns skyline points instantly in ascending order of their distance to the origin, without having to visit a large part of the R-tree. Lemma 3 ensures property (ii), since every data point is examined unless some of its ancestors is dominated (in which case the point is dominated too). Lemma 2 guarantees property (iii). Property (iv) is also fulﬁlled because BBS outputs points according to their mindist, which takes into account all dimensions. Regarding user preferences (v), as we discuss in Section 4.1, the user can specify the order of skyline points to be returned by appropriate preference functions. Furthermore, BBS also satisﬁes property (vi) since it does not require any specialized indexing structure, but (like NN) it can be applied with R-trees or any other data-partitioning method. Furthermore, the same index can be used for any subset of the d dimensions that may be relevant to different users. 3.3 Analysis of BBS In this section, we ﬁrst prove that BBS is I/O optimal, meaning that (i) it visits only the nodes that may contain skyline points, and (ii) it does not access the same node twice. Then we provide a theoretical comparison with NN in terms of the number of node accesses and memory consumption (i.e., the heap versus the to-do list sizes). Central to the analysis of BBS is the concept of the skyline search region (SSR), that is, the part of the data space that is not dominated by any skyline point. Consider for instance the running example (with skyline points i, a, k). The SSR is the shaded area in Figure 8 deﬁned by the skyline and the two axes. We start with the following observation. LEMMA 4. Any skyline algorithm based on R-trees must access all the nodes whose MBRs intersect the SSR. For instance, although entry e in Figure 8 does not contain any skyline points, this cannot be determined unless the child node of e is visited. LEMMA 5. If an entry e does not intersect the SSR, then there is a skyline point p whose distance from the origin of the axes is smaller than the mindist of e. PROOF. Since e does not intersect the SSR, it must be dominated by at least one skyline point p, meaning that p dominates the lower-left corner of e. This implies that the distance of p to the origin is smaller than the mindist of e. THEOREM 6. The number of node accesses performed by BBS is optimal. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 57 PROOF. First we prove that BBS only accesses nodes that may contain sky- line points. Assume, to the contrary, that the algorithm also visits an entry (let it be e in Figure 8) that does not intersect the SSR. Clearly, e should not be accessed because it cannot contain skyline points. Consider a skyline point that dominates e (e.g., k). Then, by Lemma 5, the distance of k to the origin is smaller than the mindist of e. According to Lemma 1, BBS visits the entries of the R-tree in ascending order of their mindist to the origin. Hence, k must be processed before e, meaning that e will be pruned by k, which contradicts the fact that e is visited. In order to complete the proof, we need to show that an entry is not visited multiple times. This is straightforward because entries are inserted into the heap (and expanded) at most once, according to their mindist. Assuming that each leaf node visited contains exactly one skyline point, the number NABBS of node accesses performed by BBS is at most s · h (where s is the number of skyline points, and h the height of the R-tree). This bound corresponds to a rather pessimistic case, where BBS has to access a complete path for each skyline point. Many skyline points, however, may be found in the same leaf nodes, or in the same branch of a nonleaf node (e.g., the root of the tree!), so that these nodes only need to be accessed once (our experiments show that in most cases the number of node accesses at each level of the tree is much smaller than s). Therefore, BBS is at least d (= s·h·d /s·h) times faster than NN (as explained in Section 3.1, the cost NANN of NN is at least s · h · d ). In practice, for d > 2, the speedup is much larger than d (several orders of magnitude) as NANN = s · h · d does not take into account the number r of redundant queries. Regarding the memory overhead, the number of entries nheap in the heap of BBS is at most ( f − 1) · NABBS . This is a pessimistic upper bound, because it assumes that a node expansion removes from the heap the expanded entry and inserts all its f children (in practice, most children will be dominated by some discovered skyline point and pruned). Since for independent dimensions the expected number of skyline points is s = ((ln N )d −1 /(d − 1)!) (Buchta [1989]), nheap ≤ ( f − 1) · NABBS ≈ ( f − 1) · h · s ≈ ( f − 1) · h · (ln N )d −1 /(d − 1)!. For d ≥ 3 and typical values of N and f (e.g., N = 105 and f ≈ 100), the heap size is much smaller than the corresponding to-do list size, which as discussed in Section 3.1 can be in the order of (d − 1)log N . Furthermore, a heap entry stores d + 2 numbers (i.e., entry id, mindist, and the coordinates of the lower- left corner), as opposed to 2d numbers for to-do list entries (i.e., d -dimensional ranges). In summary, the main-memory requirement of BBS is at the same order as the size of the skyline, since both the heap and the main-memory R-tree sizes are at this order. This is a reasonable assumption because (i) skylines are normally small and (ii) previous algorithms, such as index, are based on the same principle. Nevertheless, the size of the heap can be further reduced. Consider that in Figure 9 intermediate node e is visited ﬁrst and its children (e.g., e1 ) are inserted into the heap. When e is visited afterward (e and e have the same mindist), e1 can be immediately pruned, because there must exist at least a (not yet discovered) point in the bottom edge of e1 that dominates e1 . A ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 58 • D. Papadias et al. Fig. 9. Reducing the size of the heap. similar situation happens if node e is accessed ﬁrst. In this case e1 is inserted into the heap, but it is removed (before its expansion) when e1 is added. BBS can easily incorporate this mechanism by checking the contents of the heap before the insertion of an entry e: (i) all entries dominated by e are removed; (ii) if e is dominated by some entry, it is not inserted. We chose not to implement this optimization because it induces some CPU overhead without affecting the number of node accesses, which is optimal (in the above example e1 would be pruned during its expansion since by that time e1 will have been visited). 3.4 Incremental Maintenance of the Skyline The skyline may change due to subsequent updates (i.e., insertions and dele- tions) to the database, and hence should be incrementally maintained to avoid recomputation. Given a new point p (e.g., a hotel added to the database), our incremental maintenance algorithm ﬁrst performs a dominance check on the main-memory R-tree. If p is dominated (by an existing skyline point), it is sim- ply discarded (i.e., it does not affect the skyline); otherwise, BBS performs a window query (on the main-memory R-tree), using the dominance region of p, to retrieve the skyline points that will become obsolete (i.e., those dominated by p). This query may not retrieve anything (e.g., Figure 10(a)), in which case the number of skyline points increases by one. Figure 10(b) shows another case, where the dominance region of p covers two points i, k, which are removed (from the main-memory R-tree). The ﬁnal skyline consists of only points a, p. Handling deletions is more complex. First, if the point removed is not in the skyline (which can be easily checked by the main-memory R-tree using the point’s coordinates), no further processing is necessary. Otherwise, part of the skyline must be reconstructed. To illustrate this, assume that point i in Figure 11(a) is deleted. For incremental maintenance, we need to compute the skyline with respect only to the points in the constrained (shaded) area, which is the region exclusively dominated by i (i.e., not including areas dominated by other skyline points). This is because points (e.g., e, l ) outside the shaded area cannot appear in the new skyline, as they are dominated by at least one other point (i.e., a or k). As shown in Figure 11(b), the skyline within the exclusive dominance region of i contains two points h and m, which substitute i in the ﬁnal ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 59 Fig. 10. Incremental skyline maintenance for insertion. Fig. 11. Incremental skyline maintenance for deletion. skyline (of the whole dataset). In Section 4.1, we discuss skyline computation in a constrained region of the data space. Except for the above case of deletion, incremental skyline maintenance in- volves only main-memory operations. Given that the skyline points constitute only a small fraction of the database, the probability of deleting a skyline point is expected to be very low. In extreme cases (e.g., bulk updates, large num- ber of skyline points) where insertions/deletions frequently affect the skyline, we may adopt the following “lazy” strategy to minimize the number of disk accesses: after deleting a skyline point p, we do not compute the constrained skyline immediately, but add p to a buffer. For each subsequent insertion, if p is dominated by a new point p , we remove it from the buffer because all the points potentially replacing p would become obsolete anyway as they are dom- inated by p (the insertion of p may also render other skyline points obsolete). When there are no more updates or a user issues a skyline query, we perform a single constrained skyline search, setting the constraint region to the union of the exclusive dominance regions of the remaining points in the buffer, which is emptied afterward. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 60 • D. Papadias et al. Fig. 12. Constrained query example. 4. VARIATIONS OF SKYLINE QUERIES In this section we propose novel variations of skyline search, and illustrate how BBS can be applied for their processing. In particular, Section 4.1 discusses constrained skylines, Section 4.2 ranked skylines, Section 4.3 group-by sky- lines, Section 4.4 dynamic skylines, Section 4.5 enumerating and K -dominating queries, and Section 4.6 skybands. 4.1 Constrained Skyline Given a set of constraints, a constrained skyline query returns the most in- teresting points in the data space deﬁned by the constraints. Typically, each constraint is expressed as a range along a dimension and the conjunction of all constraints forms a hyperrectangle (referred to as the constraint region) in the d -dimensional attribute space. Consider the hotel example, where a user is in- terested only in hotels whose prices ( y axis) are in the range [4, 7]. The skyline in this case contains points g , f , and l (Figure 12), as they are the most inter- esting hotels in the speciﬁed price range. Note that d (which also satisﬁes the constraints) is not included as it is dominated by g . The constrained query can be expressed using the syntax of Borzsonyi et al. [2001] and the where clause: Select *, From Hotels, Where Price∈[4, 7], Skyline of Price min, Distance min. In addition, constrained queries are useful for incremental maintenance of the skyline in the presence of deletions (as discussed in Section 3.4). BBS can easily process such queries. The only difference with respect to the original algorithm is that entries not intersecting the constraint region are pruned (i.e., not inserted in the heap). Table IV shows the contents of the heap during the processing of the query in Figure 12. The same concept can also be applied when the constraint region is not a (hyper-) rectangle, but an arbitrary area in the data space. The NN algorithm can also support constrained skylines with a similar modiﬁcation. In particular, the ﬁrst nearest neighbor (e.g., g ) is retrieved in the constraint region using constrained nearest-neighbor search [Ferhatosman- oglu et al. 2001]. Then, each space subdivision is the intersection of the origi- nal subdivision (area to be searched by NN for the unconstrained query) and the constraint region. The index method can beneﬁt from the constraints, by ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 61 Table IV. Heap Contents for Constrained Query Action Heap Contents S Access root <e7 , 4><e6 , 6> Ø Expand e7 <e3 , 5><e6 , 6><e4 , 10> Ø Expand e3 <e6 , 6> <e4 , 10><g, 11> Ø Expand e6 <e4 , 10><g, 11><e2 , 11> Ø Expand e4 <g, 11><e2 , 11><l, 14> {g} Expand e2 <f, 12><d, 13><l, 14> {g, f, l} starting with the batches at the beginning of the constraint ranges (instead of the top of the lists). Bitmap can avoid loading the juxtapositions (see Section 2.3) for points that do not satisfy the query constraints, and D&C may discard, during the partitioning step, points that do not belong to the constraint region. For BNL and SFS, the only difference with respect to regular skyline retrieval is that only points in the constraint region are inserted in the self-organizing list. 4.2 Ranked Skyline Given a set of points in the d -dimensional space [0, 1]d , a ranked (top-K ) sky- line query (i) speciﬁes a parameter K , and a preference function f which is monotone on each attribute, (ii) and returns the K skyline points p that have the minimum score according to the input function. Consider the running exam- ple, where K = 2 and the preference function is f (x, y) = x + 3 y 2 . The output skyline points should be < k, 12 >, < i, 15 > in this order (the number with each point indicates its score). Such ranked skyline queries can be expressed using the syntax of Borzsonyi et al. [2001] combined with the order by and stop after clauses: Select *, From Hotels, Skyline of Price min, Distance min, order by Price + 3·sqr(Distance), stop after 2. BBS can easily handle such queries by modifying the mindist deﬁnition to reﬂect the preference function (i.e., the mindist of a point with coordinates x and y equals x + 3 y 2 ). The mindist of an intermediate entry equals the score of its lower-left point. Furthermore, the algorithm terminates after exactly K points have been reported. Due to the monotonicity of f , it is easy to prove that the output points are indeed skyline points. The only change with respect to the original algorithm is the order of entries visited, which does not affect the correctness or optimality of BBS because in any case an entry will be considered after all entries that dominate it. None of the other algorithms can answer this query efﬁciently. Speciﬁcally, BNL, D&C, bitmap, and index (as well as SFS if the scoring function is different from the sorting one) require ﬁrst retrieving the entire skyline, sorting the skyline points by their scores, and then outputting the best K ones. On the other hand, although NN can be used with all monotone functions, its application to ranked skyline may incur almost the same cost as that of a complete skyline. This is because, due to its divide-and-conquer nature, it is difﬁcult to establish the termination criterion. If, for instance, K = 2, NN must perform d queries after the ﬁrst nearest neighbor (skyline point) is found, compare their results, and return the one with the minimum score. The situation is more complicated when K is large where the output of numerous queries must be compared. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 62 • D. Papadias et al. 4.3 Group-By Skyline Assume that for each hotel, in addition to the price and distance, we also store its class (i.e., 1-star, 2-star, . . . , 5-star). Instead of a single skyline covering all three attributes, a user may wish to ﬁnd the individual skyline in each class. Conceptually, this is equivalent to grouping the hotels by their classes, and then computing the skyline for each group; that is, the number of skylines equals the cardinality of the group-by attribute domain. Using the syntax of Borzsonyi et al. [2001], the query can be expressed as Select *, From Hotels, Skyline of Price min, Distance min, Class diff (i.e., the group-by attribute is speciﬁed by the keyword diff). One straightforward way to support group-by skylines is to create a sepa- rate R-tree for the hotels in the same class, and then invoke BBS in each tree. Separating one attribute (i.e., class) from the others, however, would compro- mise the performance of queries involving all the attributes.4 In the following, we present a variation of BBS which operates on a single R-tree that indexes all the attributes. For the above example, the algorithm (i) stores the skyline points already found for each class in a separate main-memory 2D R-tree and (ii) maintains a single heap containing all the visited entries. The difference is that the sorting key is computed based only on price and distance (i.e., exclud- ing the group-by attribute). Whenever a data point is retrieved, we perform the dominance check at the corresponding main-memory R-tree (i.e., for its class), and insert it into the tree only if it is not dominated by any existing point. On the other hand the dominance check for each intermediate entry e (per- formed before its insertion into the heap, and during its expansion) is more com- plicated, because e is likely to contain hotels of several classes (we can identify the potential classes included in e by its projection on the corresponding axis). First, its MBR (i.e., a 3D box) is projected onto the price-distance plane and the lower-left corner c is obtained. We need to visit e, only if c is not dominated in some main-memory R-tree corresponding to a class covered by e. Consider, for instance, that the projection of e on the class dimension is [2, 4] (i.e., e may contain only hotels with 2, 3, and 4 stars). If the lower-left point of e (on the price-distance plane) is dominated in all three classes, e cannot contribute any skyline point. When the number of distinct values of the group-by attribute is large, the skylines may not ﬁt in memory. In this case, we can perform the algorithm in several passes, each pass covering a number of continuous values. The processing cost will be higher as some nodes (e.g., the root) may be visited several times. It is not clear how to extend NN, D&C, index, or bitmap for group-by skylines ı beyond the na¨ve approach, that is, invoke the algorithms for every value of the group-by attribute (e.g., each time focusing on points belonging to a speciﬁc group), which, however, would lead to high processing cost. BNL and SFS can be applied in this case by maintaining separate temporary skylines for each class value (similar to the main memory R-trees of BBS). 4A 3D skyline in this case should maximize the value of the class (e.g., given two hotels with the same price and distance, the one with more stars is preferable). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 63 4.4 Dynamic Skyline Assume a database containing points in a d -dimensional space with axes d 1 , d 2 , . . . , d d . A dynamic skyline query speciﬁes m dimension functions f 1 , f 2 , . . . , f m such that each function f i (1 ≤ i ≤ m) takes as parameters the co- ordinates of the data points along a subset of the d axes. The goal is to return the skyline in the new data space with dimensions deﬁned by f 1 , f 2 , . . . , f m . Consider, for instance, a database that stores the following information for each hotel: (i) its x and (ii) y coordinates, and (iii) its price (i.e., the database contains three dimensions). Then, a user speciﬁes his/her current location (ux , u y ), and requests the most interesting hotels, where preference must take into consid- eration the hotels’ proximity to the user (in terms of Euclidean distance) and the price. Each point p with coordinates ( px , p y , pz ) in the original 3D space is transformed to a point p in the 2D space with coordinates ( f 1 ( px , p y ), f 2 ( pz )), where the dimension functions f 1 and f 2 are deﬁned as f 1 ( px , p y ) = ( px − ux )2 + ( p y − u y )2 , and f 2 ( pz ) = pz . The terms original and dynamic space refer to the original d -dimensional data space and the space with computed dimensions (from f 1 , f 2 , . . . , f m ), re- spectively. Correspondingly, we refer to the coordinates of a point in the original space as original coordinates, while to those of the point in the dynamic space as dynamic coordinates. BBS is applicable to dynamic skylines by expanding entries in the heap ac- cording to their mindist in the dynamic space (which is computed on-the-ﬂy when the entry is considered for the ﬁrst time). In particular, the mindist of a leaf entry (data point) e with original coordinates (ex , e y , ez ), equals (ex − ux )2 + (e y − u y )2 + ez . The mindist of an intermediate entry e whose MBR has ranges [ex0 , ex1 ] [e y0 , e y1 ] [ez0 , ez1 ] is computed as mindist([ex0 , ex1 ] [e y0 , e y1 ], (ux , u y )) + ez0 , where the ﬁrst term equals the mindist between point (ux , u y ) to the 2D rectangle [ex0 , ex1 ] [e y0 , e y1 ]. Furthermore, notice that the concept of dynamic skylines can be employed in conjunction with ranked and constraint queries (i.e., ﬁnd the top ﬁve hotels within 1 km, given that the price is twice as important as the distance). BBS can process such queries by ap- propriate modiﬁcation of the mindist deﬁnition (the z coordinate is multiplied by 2) and by constraining the search region ( f 1 (x, y) ≤ 1 km). Regarding the applicability of the previous methods, BNL still applies be- cause it evaluates every point, whose dynamic coordinates can be computed on-the-ﬂy. The optimizations, of SFS, however, are now useless since the order of points in the dynamic space may be different from that in the original space. D&C and NN can also be modiﬁed for dynamic queries with the transformations described above, suffering, however, from the same problems as the original al- gorithms. Bitmap and index are not applicable because these methods rely on pre-computation, which provides little help when the dimensions are deﬁned dynamically. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 64 • D. Papadias et al. 4.5 Enumerating and K -Dominating Queries Enumerating queries return, for each skyline point p, the number of points dominated by p. This information provides some measure of “goodness” for the skyline points. In the running example, for instance, hotel i may be more inter- esting than the other skyline points since it dominates nine hotels as opposed to two for hotels a and k. Let’s call num( p) the number of points dominated by point p. A straightforward approach to process such queries involves two steps: (i) ﬁrst compute the skyline and (ii) for each skyline point p apply a query win- dow in the data R-tree and count the number of points num( p) falling inside the dominance region of p. Notice that since all (except for the skyline) points are dominated, all the nodes of the R-tree will be accessed by some query. Further- more, due to the large size of the dominance regions, numerous R-tree nodes will be accessed by several window queries. In order to avoid multiple node vis- its, we apply the inverse procedure, that is, we scan the data ﬁle and for each point we perform a query in the main-memory R-tree to ﬁnd the dominance re- gions that contain it. The corresponding counters num( p) of the skyline points are then increased accordingly. An interesting variation of the problem is the K -dominating query, which retrieves the K points that dominate the largest number of other points. Strictly speaking, this is not a skyline query, since the result does not necessarily contain skyline points. If K = 3, for instance, the output should include hotels i, h, and m, with num(i) = 9, num(h) = 7, and num(m) = 5. In order to obtain the result, we ﬁrst perform an enumerating query that returns the skyline points and the number of points that they dominate. This information for the ﬁrst K = 3 points is inserted into a list sorted according to num( p), that is, list = < i, 9 >, < a, 2 >, < k, 2 >. The ﬁrst element of the list (point i) is the ﬁrst result of the 3-dominating query. Any other point potentially in the result should be in the (exclusive) dominance region of i, but not in the dominance region of a, or k(i.e., in the shaded area of Figure 13(a)); otherwise, it would dominate fewer points than a, or k. In order to retrieve the candidate points, we perform a local skyline query S in this region (i.e., a constrained query), after removing i from S and reporting it to the user. S contains points h and m. The new skyline S1 = (S − {i}) ∪ S is shown in Figure 13(b). Since h and m do not dominate each other, they may each dominate at most seven points (i.e., num(i) − 2), meaning that they are candidates for the 3-dominating query. In order to ﬁnd the actual number of points dominated, we perform a window query in the data R-tree using the dominance regions of h and m as query windows. After this step, < h, 7 > and < m, 5 > replace the previous candidates < a, 2 >, < k, 2 > in the list. Point h is the second result of the 3-dominating query and is output to the user. Then, the process is repeated for the points that belong to the dominance region of h, but not in the dominance regions of other points in S1 (i.e., shaded area in Figure 13(c)). The new skyline S2 = (S1 − {h}) ∪ {c, g } is shown in Figure 13(d). Points c and g may dominate at most ﬁve points each (i.e., num(h) − 2), meaning that they cannot outnumber m. Hence, the query terminates with < i, 9 >< h, 7 >< m, 5 > as the ﬁnal result. In general, the algorithm can be thought of as skyline “peeling,” since it computes local skylines at the points that have the largest dominance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 65 Fig. 13. Example of 3-dominating query. Figure 14 shows the pseudocode for K -dominating queries. It is worth point- ing out that the exclusive dominance region of a skyline point for d > 2 is not necessarily a hyperrectangle (e.g., in 3D space it may correspond to an “L-shaped” polyhedron derived by removing a cube from another cube). In this case, the constraint region can be represented as a union of hyperrect- angles (constrained BBS is still applicable). Furthermore, since we only care about the number of points in the dominance regions (as opposed to their ids), the performance of window queries can be improved by using aggre- gate R-trees [Papadias et al. 2001] (or any other multidimensional aggregate index). All existing algorithms can be employed for enumerating queries, since the only difference with respect to regular skylines is the second step (i.e., counting the number of points dominated by each skyline point). Actually, the bitmap approach can avoid scanning the actual dataset, because information about num( p) for each point p can be obtained directly by appropriate juxtapositions of the bitmaps. K -dominating queries require an effective mechanism for sky- line “peeling,” that is, discovery of skyline points in the exclusive dominance region of the last point removed from the skyline. Since this requires the ap- plication of a constrained query, all algorithms are applicable (as discussed in Section 4.1). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 66 • D. Papadias et al. Fig. 14. K -dominating BBS algorithm. Fig. 15. Example of 2-skyband query. 4.6 Skyband Query Similar to K nearest-neighbor queries (that return the K NNs of a point), a K -skyband query reports the set of points which are dominated by at most K points. Conceptually, K represents the thickness of the skyline; the case K = 0 corresponds to a conventional skyline. Figure 15 illustrates the result of a 2- skyband query containing hotels {a, b, c, g, h, i, k, m}, each dominated by at most two other hotels. A na¨ve approach to check if a point p with coordinates ( p1 , p2 , . . . , pd ) is ı in the skyband would be to perform a window query in the R-tree and count the number of points inside the range [0, p1 ) [0, p2 ) . . . [0, pd ). If this number is smaller than or equal to K , then p belongs to the skyband. Obviously, the approach is very inefﬁcient, since the number of window queries equals the cardinality of the dataset. On the other hand, BBS provides an efﬁcient way for processing skyband queries. The only difference with respect to conventional skylines is that an entry is pruned only if it is dominated by more than K discovered skyline points. Table V shows the contents of the heap during the processing of the query in Figure 15. Note that the skyband points are reported ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 67 Table V. Heap Contents of 2-Skyband Query Action Heap Contents S Access root <e7 , 4><e6 , 6> Ø Expand e7 <e3 , 5><e6 , 6><e5 , 8><e4 , 10> Ø Expand e3 <i, 5><e6 , 6><h, 7><e5 , 8> <e4 , 10><g, 11> {i} Expand e6 <h, 7><e5 , 8><e1 , 9><e4 , 10><e2 , 11><g, 11> {i, h} Expand e5 <m, 8><e1 , 9><e4 , 10><n, 11><e2 , 11><g, 11> {i, h, m} Expand e1 <a, 10><e4 , 10><n, 11><e2 , 11><g, 11><b, 12><c, 12> {i, h, m, a} Expand e4 <k, 10><n, 11><e2 , 11><g, 11><b, 12><c, 12><l, 14> {i, h, m, a, k, g, b, c} Table VI. Applicability Comparison D&C BNL SFS Bitmap Index NN BBS Constrained Yes Yes Yes Yes Yes Yes Yes Ranked No No No No No No Yes Group-by No Yes Yes No No No Yes Dynamic Yes Yes Yes No No Yes Yes K-dominating Yes Yes Yes Yes Yes Yes Yes K-skyband No Yes Yes No No No Yes in ascending order of their scores, therefore maintaining the progressiveness of the results. BNL and SFS can support K -skyband queries with similar modiﬁ- cations (i.e., insert a point in the list if it is dominated by no more than K other points). None of the other algorithms is applicable, at least in an obvious way. 4.7 Summary Finally, we close this section with Table VI, which summarizes the applicability of the existing algorithms for each skyline variation. A “no” means that the technique is inapplicable, inefﬁcient (e.g., it must perform a postprocessing step on the basic algorithm), or its extension is nontrivial. Even if an algorithm (e.g., BNL) is applicable for a query type (group-by skylines), it does not necessarily imply that it is progressive (the criteria of Section 2.6 also apply to the new skyline queries). Clearly, BBS has the widest applicability since it can process all query types effectively. 5. APPROXIMATE SKYLINES In this section we introduce approximate skylines, which can be used to pro- vide immediate feedback to the users (i) without any node accesses (using a histogram on the dataset), or (ii) progressively, after the root visit of BBS. The problem for computing approximate skylines is that, even for uniform data, we cannot probabilistically estimate the shape of the skyline based only on the dataset cardinality N . In fact, it is difﬁcult to predict the actual number of sky- line points (as opposed to their order of magnitude [Buchta 1989]). To illustrate this, Figures 16(a) and 16(b) show two datasets that differ in the position of a single point, but have different skyline cardinalities (1 and 4, respectively). Thus, instead of obtaining the actual shape, we target a hypothetical point p such that its x and y coordinates are the minimum among all the expected co- ordinates in the dataset. We then deﬁne the approximate skyline using the two ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 68 • D. Papadias et al. Fig. 16. Skylines of uniform data. line segments enclosing the dominance region of p. As shown in Figure 16(c), this approximation can be thought of as a “low-resolution” skyline. Next we compute the expected coordinates of p. First, for uniform distribu- tion, it is reasonable to assume that p falls on the diagonal of the data space (because the data characteristics above and below the diagonal are similar). Assuming, for simplicity, that the data space has unit length on each axis, we denote the coordinates of p as (λ, λ) with 0 ≤ λ ≤ 1. To derive the expected value for λ, we need the probability P{λ ≤ ξ } that λ is no larger than a speciﬁc value ξ . To calculate this, note that λ > ξ implies that all the points fall in the dominance region of (ξ , ξ ) (i.e., a square with length 1 − ξ ). For uniform data, a point has probability (1 − ξ )2 to fall in this region, and thus P{λ > ξ } (i.e., the probability that all points are in this region) equals [(1 − ξ )2 ] N . So, P {λ ≤ ξ } = 1 − (1 − ξ )2N , and the expected value of λ is given by 1 1 dP(λ ≤ ξ ) E(λ) = ξ· dξ = 2N ξ · (1 − ξ )2N −1 dξ . (5.1) dξ 0 0 Solving this integral, we have E(λ) = 1/(2N + 1). (5.2) Following similar derivations for d -dimensional spaces, we obtain E(λ) = 1/(d · N + 1). If the dimensions of the data space have different lengths, then ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 69 Fig. 17. Obtaining the approximate skyline for nonuniform data. the expected coordinate of the hypothetical skyline point on dimension i equals ALi /(d · N +1), where ALi is the length of the axis. Based on the above analysis, we can obtain the approximate skyline for arbitrary data distribution using a multidimensional histogram [Muralikrishna and DeWitt 1988; Acharya et al. 1999], which typically partitions the data space into a set of buckets and stores for each bucket the number (called density) of points in it. Figure 17(a) shows the extents of 6 buckets (b1 , . . . , b6 ) and their densities, for the dataset of Figure 1. Treating each bucket as a uniform data space, we compute the hypothetical skyline point based on its density. Then the approximate skyline of the original dataset is the skyline of all the hypothetical points, as shown in Figure 17(b). Since the number of hypothetical points is small (at most the number of buck- ets), the approximate skyline can be computed using existing main-memory algorithms (e.g., Kung et al. [1975]; Matousek [1991]). Due to the fact that his- tograms are widely used for selectivity estimation and query optimization, the extraction of approximate skylines does not incur additional requirements and does not involve I/O cost. Approximate skylines using histograms can provide some information about the actual skyline in environments (e.g., data streams, on-line processing sys- tems) where only limited statistics of the data distribution (instead of individual data) can be maintained; thus, obtaining the exact skyline is impossible. When the actual data are available, the concept of approximate skyline, combined with BBS, enables the “drill-down” exploration of the actual one. Consider, for instance, that we want to estimate the skyline (in the absence of histograms) by performing a single node access. In this case, BBS retrieves the data R-tree root and computes by Equation (5.2), for every entry MBR, a hypothetical sky- line point (i) assuming that the distribution in each MBR is almost uniform (a reasonable assumption for R-trees [Theodoridis et al. 2000]), and (ii) using the average node capacity and the tree level to estimate the number of points in the MBR. The skyline of the hypothetical points constitutes a rough esti- mation of the actual skyline. Figure 18(a) shows the approximate skyline after visiting the root entry as well as the real skyline (dashed line). The approx- imation error corresponds to the difference of the SSRs of the two skylines, that is, the area that is dominated by exactly one skyline (shaded region in Figure 18(a)). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 70 • D. Papadias et al. Fig. 18. Approximate skylines as a function of node accesses. The approximate version of BBS maintains, in addition to the actual skyline S, a set HS consisting of points in the approximate skyline. HS is used just for reporting the current skyline approximation and not to guide the search (the order of node visits remains the same as the original algorithm). For each inter- mediate entry found, if its hypothetical point p is not dominated by any point in HS, it is added into the approximate skyline and all the points dominated by p are removed from HS. Leaf entries correspond to actual data points and are also inserted in HS (provided that they are not dominated). When an entry is deheaped, we remove the corresponding (hypothetical or actual) point from HS. If a data point is added to S, it is also inserted in HS. The approximate skyline is progressively reﬁned as more nodes are visited, for example, when the second node N7 is deheaped, the hypothetical point of N7 is replaced with those of its children and the new HS is computed as shown in Figure 18(b). Similarly, the expansion of N3 will lead to the approximate skyline of Figure 18(c). At the termination of approximate BBS, the estimated skyline coincides with the actual one. To show this, assume, on the contrary, that at the termi- nation of the algorithm there still exists a hypothetical/actual point p in HS, which does not belong to S. It follows that p is not dominated by the actual skyline. In this case, the corresponding (intermediate or leaf) entry producing p should be processed, contradicting the fact that the algorithm terminates. Note that for computing the hypothetical point of each MBR we use Equa- tion (5.2) because it (i) is simple and efﬁcient (in terms of computation cost), ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 71 Fig. 19. Alternative approximations after visiting root and N7 . (ii) provides a uniform treatment of approximate skylines (i.e., the same as in the case of histograms), and (iii) has high accuracy (as shown in Section 6.8). Nevertheless, we may derive an alternative approximation based on the fact that each MBR boundary contains a data point. Assuming a uniform distribu- tion on the MBR projections and that no point is minimum on two different dimensions, this approximation leads to d hypothetical points per MBR such that the expected position of each point is 1/((d − 1) · N + 1). Figure 19(a) shows the approximate skyline in this case after the ﬁrst two node visits (root and N7 ). Alternatively, BBS can output an envelope enclosing the actual skyline, where the lower bound refers to the skyline obtained from the lower-left vertices of the MBRs and the upper bound refers to the skyline obtained from the upper-right vertices. Figure 19(b) illustrates the corresponding envelope (shaded region) after the ﬁrst two node visits. The volume of the envelope is an upper bound for the actual approximation error, which shrinks as more nodes are accessed. The concepts of skyline approximation or envelope permit the immediate visu- alization of information about the skyline, enhancing the progressive behavior of BBS. In addition, approximate BBS can be easily modiﬁed for processing the query variations of Section 4 since the only difference is the maintenance of the hypothetical points in HS for the entries encountered by the original algorithm. The computation of hypothetical points depends on the skyline variation, for example, for constrained skylines the points are computed by taking into ac- count only the node area inside the constraint region. On the other hand, the application of these concepts to NN is not possible (at least in an obvious way), because of the duplicate elimination problem and the multiple accesses to the same node(s). 6. EXPERIMENTAL EVALUATION In this section we verify the effectiveness of BBS by comparing it against NN which, according to the evaluation of Kossmann et al. [2002], is the most efﬁ- cient existing algorithm and exhibits progressive behavior. Our implementation of NN combined laisser-faire and propagate because, as discussed in Section 2.5, it gives the best results. Speciﬁcally, only the ﬁrst 20% of the to-do list was searched for duplicates using propagate and the rest of the duplicates were ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 72 • D. Papadias et al. Fig. 20. Node accesses vs. dimensionality d (N = 1M). handled with laisser-faire. Following the common methodology in the literature, we employed independent (uniform) and anticorrelated5 datasets (generated in the same way as described in Borzsonyi et al. [2001]) with dimensionality d in the range [2, 5] and cardinality N in the range [100K, 10M]. The length of each axis was 10,000. Datasets were indexed by R*-trees [Beckmann et al. 1990] with a page size of 4 kB, resulting in node capacities between 204 (d = 2) and 94 (d = 5). For all experiments we measured the cost in terms of node accesses since the diagrams for CPU-time are very similar (see Papadias et al. [2003]). Sections 6.1 and 6.2 study the effects of dimensionality and cardinality for conventional skyline queries, whereas Section 6.3 compares the progressive behavior of the algorithms. Sections 6.4, 6.5, 6.6, and 6.7 evaluate constrained, group-by skyline, K -dominating skyline, and K -skyband queries, respectively. Finally, Section 6.8 focuses on approximate skylines. Ranked queries are not included because NN is inapplicable, while the performance of BBS is the same as in the experiments for progressive behavior. Similarly, the cost of dynamic skylines is the same as that of conventional skylines in selected dimension projections and omitted from the evaluation. 6.1 The Effect of Dimensionality In order to study the effect of dimensionality, we used the datasets with cardi- nality N = 1M and varied d between 2 and 5. Figure 20 shows the number of node accesses as a function of dimensionality, for independent and anticorre- lated datasets. NN could not terminate successfully for d > 4 in case of inde- pendent, and for d > 3 in case of anticorrelated, datasets due to the prohibitive size of the to-do list (to be discussed shortly). BBS clearly outperformed NN and the difference increased fast with dimensionality. The degradation of NN was caused mainly by the growth of the number of partitions (i.e., each skyline point spawned d partitions), as well as the number of duplicates. The degradation of BBS was due to the growth of the skyline and the poor performance of R-trees 5 For anticorrelated distribution, the dimensions are linearly correlated such that, if pi is smaller than p j on one axis, then pi is likely to be larger on at least one other dimension (e.g., hotels near the beach are typically more expensive). An anticorrelated dataset has fractal dimensionality close to 1 (i.e., points lie near the antidiagonal of the space). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 73 Fig. 21. Heap and to-do list sizes versus dimensionality d (N = 1M). in high dimensions. Note that these factors also inﬂuenced NN, but their effect was small compared to the inherent deﬁciencies of the algorithm. Figure 21 shows the maximum sizes (in kbytes) of the heap, the to-do list, and the dataset, as a function of dimensionality. For d = 2, the to-do list was smaller than the heap, and both were negligible compared to the size of the dataset. For d = 3, however, the to-do list surpassed the heap (for independent data) and the dataset (for anticorrelated data). Clearly, the maximum size of the to-do list exceeded the main-memory of most existing systems for d ≥ 4 (anticorrelated data), which explains the missing numbers about NN in the diagrams for high dimensions. Notice that Kossmann et al. [2002] reported the cost of NN for returning up to the ﬁrst 500 skyline points using anticorrelated data in ﬁve dimensions. NN can return a number of skyline points (but not the complete skyline), because the to-do list does not reach its maximum size until a sufﬁcient number of skyline points have been found (and a large number of partitions have been added). This issue is discussed further in Section 6.3, where we study the sizes of the heap and to-do lists as a function of the points returned. 6.2 The Effect of Cardinality Figure 22 shows the number of node accesses versus the cardinality for 3D datasets. Although the effect of cardinality was not as important as that of dimensionality, in all cases BBS was several orders of magnitude faster than NN. For anticorrelated data, NN did not terminate successfully for N ≥ 5M, again due to the prohibitive size of the to-do list. Some irregularities in the diagrams (a small dataset may be more expensive than a larger one) are due to the positions of the skyline points and the order in which they were discovered. If, for instance, the ﬁrst nearest neighbor is very close to the origin of the axes, both BBS and NN will prune a large part of their respective search spaces. 6.3 Progressive Behavior Next we compare the speed of the algorithms in returning skyline points incre- mentally. Figure 23 shows the node accesses of BBS and NN as a function of the points returned for datasets with N = 1M and d = 3 (the number of points in the ﬁnal skyline was 119 and 977, for independent and anticorrelated datasets, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 74 • D. Papadias et al. Fig. 22. Node accesses versus cardinality N (d = 3). Fig. 23. Node accesses versus number of points reported (N = 1M, d = 3). respectively). Both algorithms return the ﬁrst point with the same cost (since they both apply nearest neighbor search to locate it). Then, BBS starts to grad- ually outperform NN and the difference increases with the number of points returned. To evaluate the quality of the results, Figure 24 shows the distribution of the ﬁrst 50 skyline points (out of 977) returned by each algorithm for the anticor- related dataset with N = 1M and d = 3. The initial skyline points of BBS are evenly distributed in the whole skyline, since they were discovered in the order of their mindist (which was independent of the algorithm). On the other hand, NN produced points concentrated in the middle of the data universe because the partitioned regions, created by new skyline points, were inserted at the end of the to-do list, and thus nearby points were subsequently discovered. Figure 25 compares the sizes of the heap and to-do lists as a function of the points returned. The heap reaches its maximum size at the beginning of BBS, whereas the to-do list reaches it toward the end of NN. This happens because before BBS discovered the ﬁrst skyline point, it inserted all the entries of the visited nodes in the heap (since no entry can be pruned by existing skyline points). The more skyline points were discovered, the more heap entries were pruned, until the heap eventually became empty. On the other hand, the to-do list size is dominated by empty queries, which occurred toward the late phases of NN when the space subdivisions became too small to contain any points. Thus, NN could still be used to return a number of skyline points (but not the complete skyline) even for relatively high dimensionality. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 75 Fig. 24. Distribution of the ﬁrst 50 skyline points (anticorrelated, N = 1M, d = 3). Fig. 25. Sizes of the heap and to-do list versus number of points reported (N = 1M, d = 3). 6.4 Constrained Skyline Having conﬁrmed the efﬁciency of BBS for conventional skyline retrieval, we present a comparison between BBS and NN on constrained skylines. Figure 26 shows the node accesses of BBS and NN as a function of the constraint region volume (N = 1M, d = 3), which is measured as a percentage of the volume of the data universe. The locations of constraint regions were uniformly generated and the results were computed by taking the average of 50 queries. Again BBS was several orders of magnitude faster than NN. The counterintuitive observation here is that constraint regions covering more than 8% of the data space are usually more expensive than regular sky- lines. Figure 27(a) veriﬁes the observation by illustrating the node accesses of BBS on independent data, when the volume of the constraint region ranges between 98% and 100% (i.e., regular skyline). Even a range very close to 100% is much more expensive than a conventional skyline. Similar results hold for NN (see Figure 27(b)) and anticorrelated data. To explain this, consider Figure 28(a), which shows a skyline S in a constraint region. The nodes that must be visited intersect the constrained skyline search region (shaded area) deﬁned by S and the constraint region. In this example, all four nodes e1 , e2 , e3 , e4 may contain skyline points and should be accessed. On the other hand, if S were a conventional skyline, as in Figure 28(b), nodes ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 76 • D. Papadias et al. Fig. 26. Node accesses versus volume of constraint region (N = 1M, d = 3). Fig. 27. Node accesses versus volume of constraint region 98–100% (independent, N = 1M, d = 3). e2 , e3 , and e4 could not exist because they should contain at least a point that dominates S. In general, the only data points of the conventional SSR (shaded area in Figure 28(b)) lie on the skyline, implying that, for any node MBR, at most one of its vertices can be inside the SSR. For constrained skylines there is no such restriction and the number of nodes intersecting the constrained SSR can be arbitrarily large. It is important to note that the constrained queries issued when a skyline point is removed during incremental maintenance (see Section 3.4) are always cheaper than computing the entire skyline from scratch. Consider, for instance, that the partial skyline of Figure 28(a) is computed for the exclusive dominance area of a deleted skyline point p on the lower-left corner of the constraint region. In this case nodes such as e2 , e3 , e4 cannot exist because otherwise they would have to contain skyline points, contradicting the fact that the constraint region corresponds to the exclusive dominance area of p. 6.5 Group-By Skyline Next we consider group-by skyline retrieval, including only BBS because, as dis- cussed in Section 4, NN is inapplicable in this case. Toward this, we generate datasets (with cardinality 1M) in a 3D space that involves two numerical di- mensions and one categorical axis. In particular, the number cnum of categories is a parameter ranging from 2 to 64 (cnum is also the number of 2D skylines returned by a group-by skyline query). Every data point has equal probability ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 77 Fig. 28. Nodes potentially intersecting the SSR. Fig. 29. BBS node accesses versus cardinality of categorical axis cnum (N = 1M, d = 3). to fall in each category, and, for all the points in the same category, their dis- tribution (on the two numerical axes) is either independent or anticorrelated. Figure 29 demonstrates the number of node accesses as a function of cnum . The cost of BBS increases with cnum because the total number of skyline points (in all 2D skylines) and the probability that a node may contain qualifying points in some category (and therefore it should be expanded) is proportional to the size of the categorical domain. 6.6 K -Dominating Skyline This section measures the performance of NN and BBS on K -dominating queries. Recall that each K -dominating query involves an enumerating query (i.e., a ﬁle scan), which retrieves the number of points dominated by each sky- line point. The K skyline points with the largest counts are found and the top-1 is immediately reported. Whenever an object is reported, a constrained skyline is executed to ﬁnd potential candidates in its exclusive dominance re- gion (see Figure 13). For each such candidate, the number of dominated points is retrieved using a window query on the data R-tree. After this process, the object with the largest count is reported (i.e., the second best object), another constrained query is performed, and so on. Therefore, the total number of con- strained queries is K − 1, and each such query may trigger multiple window queries. Figure 30 demonstrates the cost of BBS and NN as a function of K . The overhead of the enumerating and (multiple) window queries dominates the total cost, and consequently BBS and NN have a very similar performance. Interestingly, the overhead of the anticorrelated data is lower (than the in- dependent distribution) because each skyline point dominates fewer points ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 78 • D. Papadias et al. Fig. 30. NN and BBS node accesses versus number of objects to be reported for K -dominating queries (N = 1M, d = 2). Fig. 31. BBS node accesses versus “thickness” of the skyline for K -skyband queries (N = 1M, d = 3). (therefore, the number of window queries is smaller). The high cost of K -dominating queries (compared to other skyline variations) is due to the com- plexity of the problem itself (and not the proposed algorithm). In particular, a K -dominating query is similar to a semijoin and could be processed accordingly. For instance a nested-loops algorithm would (i) count, for each data point, the number of dominated points by scanning the entire database, (ii) sort all the points in descending order of the counts, and (iii) report the K points with the highest counts. Since in our case the database occupies more than 6K nodes, this algorithm would need to access 36E+6 nodes (for any K ), which is signiﬁcantly higher than the costs in Figure 30 (especially for low K ). 6.7 K -Skyband Next, we evaluate the performance of BBS on K -skyband queries (NN is inap- plicable). Figure 31 shows the node accesses as a function of K ranging from 0 (conventional skyline) to 9. As expected, the performance degrades as K in- creases because a node can be pruned only if it is dominated by more than K discovered skyline points, which becomes more difﬁcult for higher K . Further- more, the number of skyband points is signiﬁcantly larger for anticorrelated data, for example, for K = 9, the number is 788 (6778) in the independent (anticorrelated) case, which explains the higher costs in Figure 31(b). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 79 Fig. 32. Approximation error versus number of minskew buckets (N = 1M, d = 3). 6.8 Approximate Skylines This section evaluates the quality of the approximate skyline using a hypothet- ical point per bucket or visited node (as shown in the examples of Figures 17 and 18, respectively). Given an estimated and an actual skyline, the approx- imation error corresponds to their SSR difference (see Section 5). In order to measure this error, we used a numerical approach: (i) we ﬁrst generated a large number α of points (α = 104 ) uniformly distributed in the data space, and (ii) counted the number β of points that are dominated by exactly one skyline. The error equals β/α, which approximates the volume of the SSR difference divided by the volume of the entire data space. We did not use a relative error (e.g., volume of the SSR difference divided by the volume of the actual SSR) because such a deﬁnition is sensitive to the position of the actual skyline (i.e., a skyline near the origin of the axes would lead to higher error even if the SSR difference remains constant). In the ﬁrst experiment, we built a minskew [Acharya et al. 1999] histogram on the 3D datasets by varying the number of buckets from 100 to 1000, resulting in main-memory consumption in the range of 3K bytes (100) to 30K bytes (1000 buckets). Figure 32 illustrates the error as a function of the bucket number. For independent distribution, the error is very small (less than 0.01%) even with the smallest number of buckets because the rough “shape” of the skyline for a uniform dataset can be accurately predicted using Equation (5.2). On the other hand, anticorrelated data were skewed and required a large number of buckets for achieving high accuracy. Figure 33 evaluates the quality of the approximation as a function of node accesses (without using a histogram). As discussed in Section 5, the ﬁrst rough estimate of the skyline is produced when BBS visits the root entry and then the approximation is reﬁned as more nodes are accessed. For independent data, extremely accurate approximation (with error 0.01%) can be obtained immedi- ately after retrieving the root, a phenomenon similar to that in Figure 32(a). For anti-correlated data, the error is initially large (around 15% after the root visit), but decreases considerably with only a few additional node accesses. Par- ticularly, the error is less than 3% after visiting 30 nodes, and close to zero with around 100 accesses (i.e., the estimated skyline is almost identical to the actual ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 80 • D. Papadias et al. Fig. 33. BBS approximation error versus number of node accesses (N = 1M, d = 3). one with about 25% of the node accesses required for the discovery of the actual skyline). 7. CONCLUSION The importance of skyline computation in database systems increases with the number of emerging applications requiring efﬁcient processing of prefer- ence queries and the amount of available data. Consider, for instance, a bank information system monitoring the attribute values of stock records and an- swering queries from multiple users. Assuming that the user scoring functions are monotonic, the top-1 result of all queries is always a part of the skyline. Similarly, the top-K result is always a part of the K -skyband. Thus, the system could maintain only the skyline (or K -skyband) and avoid searching a poten- tially very large number of records. However, all existing database algorithms for skyline computation have several deﬁciencies, which severely limit their applicability. BNL and D&C are not progressive. Bitmap is applicable only for datasets with small attribute domains and cannot efﬁciently handle updates. Index cannot be used for skyline queries on a subset of the dimensions. SFS, like all above algorithms, does not support user-deﬁned preferences. Although NN was presented as a solution to these problems, it introduces new ones, namely, poor performance and prohibitive space requirements for more than three dimensions. This article proposes BBS, a novel algorithm that overcomes all these shortcomings since (i) it is efﬁcient for both progressive and com- plete skyline computation, independently of the data characteristics (dimen- sionality, distribution), (ii) it can easily handle user preferences and process numerous alternative skyline queries (e.g., ranked, constrained, approximate skylines), (iii) it does not require any precomputation (besides building the R-tree), (iv) it can be used for any subset of the dimensions, and (v) it has limited main-memory requirements. Although in this implementation of BBS we used R-trees in order to perform a direct comparison with NN, the same concepts are applicable to any data- partitioning access method. In the future, we plan to investigate alternatives (e.g., X-trees [Berchtold et al. 1996], and A-trees [Sakurai et al. 2000]) for high- dimensional spaces, where R-trees are inefﬁcient). Another possible solution for ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Progressive Skyline Computation in Database Systems • 81 high dimensionality would include (i) converting the data points to subspaces with lower dimensionalities, (ii) computing the skyline in each subspace, and (iii) merging the partial skylines. Finally, a topic worth studying concerns sky- line retrieval in other application domains. For instance, Balke et al. [2004] studied skyline computation for Web information systems considering that the records are partitioned in several lists, each residing at a distributed server. The tuples in every list are sorted in ascending order of a scoring function, which is monotonic on all attributes. Their processing method uses the main concept of the threshold algorithm [Fagin et al. 2001] to compute the entire skyline by reading the minimum number of records in each list. Another inter- esting direction concerns skylines in temporal databases [Salzberg and Tsotras 1999] that retain historical information. In this case, a query could ask for the most interesting objects at a past timestamp or interval. REFERENCES ACHARYA, S., POOSALA, V., AND RAMASWAMY, S. 1999. Selectivity estimation in spatial databases. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Philadelphia, PA, June 1–3). 13–24. BALKE, W., GUNZER, U., AND ZHENG, J. 2004. Efﬁcient distributed skylining for Web information sys- tems. In Proceedings of the International Conference on Extending Database Technology (EDBT; Heraklio, Greece, Mar. 14–18). 256–273. BECKMANN, N., KRIEGEL, H., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efﬁcient and robust access method for points and rectangles. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Atlantic City, NJ, May 23–25). 322–331. BERCHTOLD, S., KEIM, D., AND KRIEGEL, H. 1996. The X-tree: An index structure for high- dimensional data. In Proceedings of the Very Large Data Bases Conference (VLDB; Mumbai, India, Sep. 3–6). 28–39. ¨ BOHM, C. AND KRIEGEL, H. 2001. Determining the convex hull in large multidimensional databases. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK; Munich, Germany, Sep. 5–7). 294–306. BORZSONYI, S., KOSSMANN, D., AND STOCKER, K. 2001. The skyline operator. In Proceedings of the IEEE International Conference on Data Engineering (ICDE; Heidelberg, Germany, Apr. 2–6). 421–430. BUCHTA, C. 1989. On the average number of maxima in a set of vectors. Inform. Process. Lett., 33, 2, 63–65. CHANG, Y., BERGMAN, L., CASTELLI, V., LI, C., LO, M., AND SMITH, J. 2000. The Onion technique: In- dexing for linear optimization queries. In Proceedings of the ACM Conference on the Management of data (SIGMOD; Dallas, TX, May 16–18). 391–402. CHOMICKI, J., GODFREY, P., GRYZ, J., AND LIANG, D. 2003. Skyline with pre-sorting. In Proceedings of the IEEE International Conference on Data Engineering (ICDE; Bangalore, India, Mar. 5–8). 717–719. FAGIN, R., LOTEM, A., AND NAOR, M. 2001. Optimal aggregation algorithms for middleware. In Pro- ceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS; Santa Barbara, CA, May 21–23). 102–113. FERHATOSMANOGLU, H., STANOI, I., AGRAWAL, D., AND ABBADI, A. 2001. Constrained nearest neighbor queries. In Proceedings of the International Symposium on Spatial and Temporal Databases (SSTD; Redondo Beach, CA, July 12–15). 257–278. GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Boston, MA, June 18–21). 47– 57. HELLERSTEIN, J., ANVUR, R., CHOU, A., HIDBER, C., OLSTON, C., RAMAN, V., ROTH, T., AND HAAS, P. 1999. Interactive data analysis: The control project. IEEE Comput. 32, 8, 51– 59. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 82 • D. Papadias et al. HENRICH, A. 1994. A distance scan algorithm for spatial access structures. In Proceedings of the ACM Workshop on Geographic Information Systems (ACM GIS; Gaithersburg, MD, Dec.). 136–143. HJALTASON, G. AND SAMET, H. 1999. Distance browsing in spatial databases. ACM Trans. Database Syst. 24, 2, 265–318. HRISTIDIS, V., KOUDAS, N., AND PAPAKONSTANTINOU, Y. 2001. PREFER: A system for the efﬁcient execution of multi-parametric ranked queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; May 21–24). 259–270. KOSSMANN, D., RAMSAK, F., AND ROST, S. 2002. Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the Very Large Data Bases Conference (VLDB; Hong Kong, China, Aug. 20–23). 275–286. KUNG, H., LUCCIO, F., AND PREPARATA, F. 1975. On ﬁnding the maxima of a set of vectors. J. Assoc. Comput. Mach., 22, 4, 469–476. MATOUSEK, J. 1991. Computing dominances in En . Inform. Process. Lett. 38, 5, 277–278. MCLAIN, D. 1974. Drawing contours from arbitrary data points. Comput. J. 17, 4, 318–324. MURALIKRISHNA, M. AND DEWITT, D. 1988. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Chicago, IL, June 1–3). 28–36. NATSEV, A., CHANG, Y., SMITH, J., LI., C., AND VITTER. J. 2001. Supporting incremental join queries on ranked inputs. In Proceedings of the Very Large Data Bases Conference (VLDB; Rome, Italy, Sep. 11–14). 281–290. PAPADIAS, D., TAO, Y., FU, G., AND SEEGER, B. 2003. An optimal and progressive algorithm for skyline queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; San Diego, CA, June 9–12). 443–454. PAPADIAS, D., KALNIS, P., ZHANG, J., AND TAO, Y. 2001. Efﬁcient OLAP operations in spatial data warehouses. In Proceedings of International Symposium on Spatial and Temporal Databases (SSTD; Redondo Beach, CA, July 12–15). 443–459. PREPARATA, F. AND SHAMOS, M. 1985. Computational Geometry—An Introduction. Springer, Berlin, Germany. ROUSSOPOULOS, N., KELLY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; San Jose, CA, May 22–25). 71–79. SAKURAI, Y., YOSHIKAWA, M., UEMURA, S., AND KOJIMA, H. 2000. The A-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the Very Large Data Bases Conference (VLDB; Cairo, Egypt, Sep. 10–14). 516–526. SALZBERG, B. AND TSOTRAS, V. 1999. A comparison of access methods for temporal data. ACM Comput. Surv. 31, 2, 158–221. SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R+-tree: A dynamic index for multi- dimensional objects. In Proceedings of the Very Large Data Bases Conference (VLDB; Brighton, England, Sep. 1–4). 507–518. STEUER, R. 1986. Multiple Criteria Optimization. Wiley, New York, NY. TAN, K., ENG, P., AND OOI, B. 2001. Efﬁcient progressive skyline computation. In Proceedings of the Very Large Data Bases Conference (VLDB; Rome, Italy, Sep. 11–14). 301–310. THEODORIDIS, Y., STEFANAKIS, E., AND SELLIS, T. 2000. Efﬁcient cost models for spatial queries using R-trees. IEEE Trans. Knowl. Data Eng. 12, 1, 19–32. Received October 2003; revised April 2004; accepted June 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS ANDREW WITKOWSKI, SRIKANTH BELLAMKONDA, TOLGA BOZKAYA, NATHAN FOLKERT, ABHINAV GUPTA, JOHN HAYDU, LEI SHENG, and SANKAR SUBRAMANIAN Oracle Corporation Commercial relational database systems lack support for complex business modeling. ANSI SQL cannot treat relations as multidimensional arrays and deﬁne multiple, interrelated formulas over them, operations which are needed for business modeling. Relational OLAP (ROLAP) applications have to perform such tasks using joins, SQL Window Functions, complex CASE expressions, and the GROUP BY operator simulating the pivot operation. The designated place in SQL for calcula- tions is the SELECT clause, which is extremely limiting and forces the user to generate queries with nested views, subqueries and complex joins. Furthermore, SQL query optimizers are pre- occupied with determining efﬁcient join orders and choosing optimal access methods and largely disregard optimization of multiple, interrelated formulas. Research into execution methods has thus far concentrated on efﬁcient computation of data cubes and cube compression rather than on access structures for random, interrow calculations. This has created a gap that has been ﬁlled by spreadsheets and specialized MOLAP engines, which are good at speciﬁcation of formulas for modeling but lack the formalism of the relational model, are difﬁcult to coordinate across large user groups, exhibit scalability problems, and require replication of data between the tool and RDBMS. This article presents an SQL extension called SQL Spreadsheet, to provide array calculations over relations for complex modeling. We present optimizations, access structures, and execution models for processing them efﬁciently. Special attention is paid to compile time optimization for expensive operations like aggregation. Furthermore, ANSI SQL does not provide a good separation between data and computation and hence cannot support parameterization for SQL Spreadsheets models. We propose two parameterization methods for SQL. One parameterizes ANSI SQL view using subqueries and scalars, which allows passing data to SQL Spreadsheet. Another method presents parameterization of the SQL Spreadsheet formulas. This supports building stand-alone SQL Spreadsheet libraries. These models are then subject to the SQL Spreadsheet optimizations during model invocation time. Categories and Subject Descriptors: H.2.3. [Database Management]: Languages—Data manip- ulation languages (DML); query languages; H.2.4. [Database Management]: Systems—Query processing General Terms: Design, Languages Additional Key Words and Phrases: Excel, analytic computations, OLAP, spreadsheet 1. INTRODUCTION One of the most successful analytical tools for business data is the spreadsheet. A user can enter business data, deﬁne formulas over it using two-dimensional Authors’ addresses: Oracle Corporation, 500 Oracle Parkway, Redwood Shores, CA 94065; email: {andrew.witkowski,srikanth.bellamkonda,tolga.bozkaya,nathan.folkert,abhinav.gupta,john. haydu,lei.sheng,sankar.subramanian}@oracle.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0083 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 83–121. 84 • A. Witkowski et al. array abstractions, construct simultaneous equations with recursive models, pivot data and compute aggregates for selected cells, apply a rich set of business functions, etc. Spreadsheets also provide ﬂexible user interfaces like graphs and reports. Unfortunately, analytical usefulness of the RDBMS has not measured up to that of spreadsheets [Blattner 1999; Simon 2000] or specialized MOLAP tools like Microsoft Analytical Services [Peterson and Pinkelman 2000; Thomsen et al. 1999], Oracle Analytic Workspaces [OLAP Application Developer’s Guide 2004], and others [Balmin et al. 2000; Howson 2002]. It is cumbersome and in most cases inefﬁcient to perform array calculations in SQL—a fundamental problem resulting from lack of language constructs to treat relations as arrays and lack of efﬁcient random access methods for their access. To simulate array computations on a relation SQL users must resort to using multiple self-joins to align different rows, must use ANSI SQL Window functions to reach from one row into another, or must use ANSI SQL GROUP BY operator to pivot a table and simulate interrow with intercolumn computations. None of the operations is natural or efﬁcient for array computations with multiple formulas found in spreadsheets. Spreadsheets, for example Microsoft Excel [Simon 2000], provide an excel- lent user interface but have their own problems. They offer two-dimensional “row-column” addressing. Hence, it is hard to build a model where formulas reference data via symbolic references. In addition, they do not scale well when the data set is large. For example, a single sheet in a spreadsheet typically supports up to 64K rows with about 200 columns, and handling terabytes of sales data is practically impossible even when using multiple sheets. Further- more, spreadsheets do not support the parallel processing necessary to pro- cess terabytes of data in small windows of time. In collaborative analysis with multiple spreadsheets, it is nearly impossible to get a complete picture of the business by querying multiple, inconsistent spreadsheets each using its own layout and placement of data. There is no standard metadata or a uniﬁed ab- straction interrelating them akin to RDBMS dictionary tables and RDBMS relations. This article proposes spreadsheet-like computations in RDBMS through ex- tensions to SQL, leaving the user interface aspects to be handled by OLAP tools. Here is a glimpse of our proposal: — Relations can be viewed as n-dimensional arrays, and formulas can be deﬁned over the cells of these arrays. Cell addressing is symbolic, using dimensional columns. — The formulas can automatically be ordered based on the dependencies be- tween cells. — Recursive references and convergence conditions are supported, providing for a recursive execution model. — Densiﬁcation (ﬁlling gaps in sparse data) can be easily performed. — Formulas are encapsulated in a new SQL query clause. Their result is a relation and can be further used in joins, subqueries, etc. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 85 — The new clause supports logical partitioning of the data providing a natural mechanism of parallel execution. — Formulas support INSERT and UPDATE semantics as well as correlation between their left and right sides. This allows us to simulate the effect of multiple joins and UNIONs using a single access structure. Furthermore, our article addresses lack of parameterization models in ANSI SQL. The issue is critical for model building as this ANSI SQL shortcoming prevents us from constructing parameterized libraries of SQL Spreadsheet. We propose two new parameterization methods for SQL. One parameterizes ANSI SQL views with subqueries and scalars allowing passing data to inner query blocks and hence to SQL Spreadsheet. The second model is a parameterization of the SQL Spreadsheet formulas. We can declare a named set of formulas, called SQL Spreadsheet Procedure, operating on an N-dimensional array that can be invoked from an SQL Spreadsheet. The array is passed by reference to the SQL Spreadsheet Procedure. We support merging of formulas from SQL Spread- sheet Procedure to the main body of SQL Spreadsheet. This allows for global formula optimizations, like removal of unused formulas, etc. SQL Spreadsheet Procedures are useful for building standalone SQL Spreadsheet libraries. This article is organized as follows. Section 2 provides SQL language ex- tensions for spreadsheets. Section 3 provides motivating examples. Section 4 presents an overview of the evaluation of spreadsheets in SQL. Section 5 de- scribes the analysis of the spreadsheet clause and query optimizations with spreadsheets. Section 6 discusses our execution models. Section 7 describes our parameterization models. Section 8 reports results from performance experi- ments on spreadsheet queries, and Section 9 contains our conclusions. The elec- tronic appendix explains parallel execution of SQL Spreadsheets and presents our experimental results; it also discusses our future research in this area. 2. SQL EXTENSIONS FOR SPREADSHEETS 2.1 Notation In the following examples, we will use a fact table f (t, r, p, s, c) representing a data-warehouse of consumer-electronic products with three dimensions: time (t), region (r), and product ( p), and two measures: sales (s) and cost (c). 2.2 Spreadsheet Clause OLAP applications divide relational attributes into dimensions and measures. To model that, we introduce a new SQL query clause, called the spreadsheet clause, which identiﬁes, within the query result, PARTITION, DIMENSION, and MEASURES columns. The PARTITION (PBY) columns divide the relation into disjoint subsets. The DIMENSION (DBY) columns identify a unique row within each partition, and this row is called a cell. The MEASURES (MEA) columns identify expressions computed by the spreadsheet and are referenced by DBY columns. Following this, there is a sequence of formulas, each describing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 86 • A. Witkowski et al. a computation on the cells. Thus the structure of the spreadsheet clause is <existing parts of a query block> SPREADSHEET PBY (cols) DBY (cols) MEA (cols) <processing options> (<formula>, <formula>,.., <formula>) It is evaluated after joins, aggregations, window functions, and ﬁnal projec- tion, but before the ORDER BY clause. Cells are referenced using an array notation in which a measure is followed by square brackets holding dimension values. Thus s[‘vcr’, 2002] is a reference to the cell containing sales of the ‘vcr’ product in 2002. If the dimensions are uniquely qualiﬁed, the cell reference is called a single cell reference, for example, s[p=‘dvd’, t=2002]. If the dimensions are qualiﬁed by general predicates, the cell reference refers to a set of cells and is called a range reference, for example, s[p=‘dvd’, t,<2002]. Each formula represents an assignment and contains a left side that desig- nates the target cells and a right side that contains the expressions involving cells or ranges of cells within the partition. For example: SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[p=‘dvd’,t=2002] =s[p=‘dvd’,t=2001]*1.6, s[p=‘vcr’,t=2002] =s[p=‘vcr’,t=2000]+s[p=‘vcr’,t=2001], s[p=‘tv’, t=2002] =avg(s)[p=‘tv’,1992<t<2002] ) This query partitions table f by region r and deﬁnes that, within each re- gion, sales of ‘dvd’ in 2002 will be 60% higher than ‘dvd’ sales in 2001, sales of ‘vcr’ in 2002 will be the sum of ‘vcr’ sales in 2000 and 2001, and sales of ‘tv’ will be the average of ‘tv’ sales in the years between 1992 and 2002. As a shorthand, a positional notation exists, for example: s[‘dvd’,2002] instead of s[p=‘dvd’,t=2002]. The left side of a formula deﬁnes calculations that can span a range of cells. A new function cv() (abbreviation for current value) carries the current value of a dimension from the left side to the right side, thus effectively serving as a join between right and left side. The * operator denotes all values in the dimension. The following spreadsheet clause states that sales of every product in the ‘west’ region for year >2001 will be 20% higher than sales of the same product in the preceding year. Observe that region and product dimensions on the right side reference function cv() to carry dimension values from left to the right side. SPREADSHEET DBY (r, p, t) MEA (s) ( s[‘west’,*,t>2001] = 1.2*s[cv(r),cv(p),t=cv(t)-1] ) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 87 Formulas may specify a range of cells to be updated. A formula referring to multiple cells on the left side is called an existential formula. For existential formulas, the result may be order dependent. For example, the intention of the following query is to compute the sales of ‘vcr’ for all years before 2002 as an average of sales of 2 preceding years: SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘vcr’,t<2002]= avg(s)[‘vcr’,cv(t)-2<=t<cv(t)] ) But processing rows in ascending or descending order with regard to di- mension t produces different results as we are both updating and referencing measure s. To avoid ambiguity, the user can specify an order in which the rule should be evaluated: SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘vcr’, t<2002] ORDER BY t ASC = avg(s)[cv(p), cv(t)-2<=t<cv(t)] ) An innovative feature of SQL spreadsheet is the creation of new rows in the result set. Any formula with a single cell reference on left side can operate either in UPDATE or UPSERT (default) mode. The latter creates new cells within a partition if they do not exist; otherwise it updates them. UPDATE ignores nonexistent cells. For example, SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( UPSERT s[‘tv’, 2000] = s[‘black-tv’,2000] + s[‘white-tv’,2000] ) will create, for each region, a row with p=‘tv’ and t=2000 if this cell is not present in the input stream. Semantics for the UPSERT operation is obvious when the left side qualiﬁes a single cell as this cell is then either updated or inserted. An interesting issue is how to interpret UPSERT for an existential formula. For example, SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( UPSERT s[‘tv’,*] = s[‘black-tv’,cv()]+s[‘white-tv’,cv()] ) This creates a new member of the product dimension, the ‘tv’ member, for each of the values in the time dimension. In OLAP this is referred to as a calcu- lated member. In SQL Spreadsheet, the UPSERT operation where one dimen- sion d 1 is qualiﬁed by a constant while the remaining ones d 2 , . . . , d n by Boolean conditions c2 , . . . , cn are deﬁned to be a sequence of two operations: UPSERT and UPDATE. We ﬁrst determine the distinct values in the remaining dimensions: SELECT DISTINCT d 2 , . . . , d n FROM input set WHERE c2 , . . . , cn ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 88 • A. Witkowski et al. and perform upserts of these distinct values with constant on dimension d 1 . Then we execute the formula in the UPDATE mode updating the upserted values. In the above example, we (logically) perform these two operations: SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( UPSERT s[‘tv’, FOR t IN (SELECT DISTINCT t FROM input set)] = NULL, UPDATE s[‘tv’, *] =s[‘black-tv’,cv()] +s[‘white-tv’,cv()] ) This easily generalizes to cases when there is more than one dimension qual- iﬁed with constants while others are qualiﬁed by Boolean conditions. 2.3 Reference Spreadsheets OLAP applications frequently deal with objects of different dimensionality in a single query. For example, the sales table may have region (r), product ( p), and time (t) dimensions, while the budget allocation table has only a region (r) dimension. To account for that, our query block can have, in addition to the main spreadsheet, multiple read-only reference spreadsheets, which are n-dimensional arrays deﬁned over other query blocks. Reference spreadsheets, akin to main spreadsheets, have DBY and MEA clauses, indicating their dimen- sions and measures, respectively. For example, assume a budget table budget (r, p) containing predictions p for a sales increase for each region r. The following query predicts sales in 2002 in regions ‘east’ and ‘west’. For the ‘west’ region, the prediction is based on the prediction factor p from the budget reference table. SELECT r, t, s FROM f GROUP by r, t SPREADSHEET REFERENCE budget ON (SELECT r, p FROM budget) DBY(r) MEA(p) DBY (r, t) MEA (sum(s) s) ( s[‘west’,2002]= p[‘west’]*s[‘west’ ,2001], s[‘east’,2002]= s[‘east’,2001]+s[‘east’,2000] ) The purpose of a reference spreadsheet is similar to a relational join, but it allows us to perform, within a spreadsheet clause, multiple joins using the same access structures (e.g., a hash table—see Section 6.1). Thus self-joins within a spreadsheet can be cheaper compared to doing them outside. 2.4 Ordering the Evaluation of Formulas By default, formulas are evaluated based on the order of their dependencies, and we refer to it as the AUTOMATIC ORDER. For example in ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 89 SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘dvd’,2002] = s[‘dvd’,2000] + s[‘dvd’,2001] s[‘dvd’,2001] = 1000 ) the ﬁrst formula depends on the second and consequently we will evaluate the second one ﬁrst. However, there are scenarios in which lexicographical ordering (i.e., the order in formulas are speciﬁed) of evaluation is desired. For that, we provide an explicit processing option, SEQUENTIAL ORDER, as in the following: SPREADSHEET DBY(r,p,t) MEA(s) SEQUENTIAL ORDER (....<formulas>....) 2.5 ANSI Window Functions in SQL Spreadsheet Many of the ANSI window functions can be emulated using aggregates on the right side of the formulas or using an ORDER BY clause on their left side. How- ever, for user convenience, we also allow the explicit use of window functions on the right side of formulas. The window functions that are speciﬁed on the right side of a formula are computed over the range of cells deﬁned by the left side. For example, the following formula computes the 3-year moving sum of sales of each product for all times within a region (we have a 3-year moving average as the window function speciﬁes RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING, i.e., a total of 3 years): SPREADSHEET PBY(r) DBY (p, t) MEA (s, 0 mov sum) ( mov sum[*, *] = sum(s) OVER (PARTITION BY p ORDER BY t RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) ) 2.6 Cycles and Recursive Models Similar to existing spreadsheets, our computations may contain cycles, as in the formula s[1] = s[1]/2. Consequently, we have processing options to specify the number of iterations or the convergence criteria for cycles and recursion. The ITERATE (n) option requests iteration of the formulas ‘n’ times. The optional UNTIL condition will stop the iteration when the <condition> has been met, up to a maximum of n iterations as speciﬁed by ITERATE (n). The <condition> can reference cells before and after the iteration facilitating deﬁnition of convergence conditions. A helper function previous(<cell>) returns the value of <cell> at the start of each iteration. For example, SPREADSHEET DBY (x) MEA (s) ITERATE (10) UNTIL (PREVIOUS(s[1])-s[1] <= 1) (s[1] = s[1]/2) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 90 • A. Witkowski et al. will execute the formula s[1] = s[1]/2 until the convergence condition is met, up to a maximum of 10 iterations (in this case if initially s[1] is between 1024 and 2047, evaluation of the formulas will stop after 10 iterations). 2.7 Spreadsheet Processing Options and Miscellaneous Functions There are other processing options for the SQL spreadsheet in addition to the ones for ordering of formulas and termination of cycles. For example, we can specify UPDATE/UPSERT options as the default for the entire spreadsheet. The IGNORE NAV (where NAV refers to nonavailable values) option allows us to treat NULL values in numeric operations as 0, which is convenient for newly inserted cells with the UPSERT option. The new predicate <cell> IS PRESENT indicates if the row indicated by the <cell> existed before the execution of the spreadsheet clause and is convenient for determining upserted values. 2.8 Semantics of Updates in SQL Spreadsheets We note two important update properties of SQL Spreadsheet. First, SQL Spreadsheet is part of a query block and hence doesn’t cause any modiﬁca- tion to the stored relations. Users can explicitly use an UPDATE or MERGE statement to propagate changes made by the formulas to the target relations. This involves an explicit join of the query with the spreadsheet to the target relation. For example, to propagate a calculated member ‘tv’ from query {S1 on page 6} to the fact relation f, we could use an ANSI SQL MERGE statement (note that UPDATE will not work as it does not support insertion into the target table of nonjoining rows): MERGE INTO f USING ( SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( UPSERT s[‘tv’, *] = s[‘black-tv’,cv()] + s[‘white-tv’,cv()] ) ) v ON f.r = v.r AND f.p = v.p AND f.t = v.t WHEN MATCHED THEN UPDATE SET f.s = v.s WHEN NOT MATCHED THEN INSERT VALUES (v.r, v.p, v.t, v.s) Second, SQL Spreadsheet formulas compute measures which can later be used by other formulas, that is, formulas can operate on data produced by other formulas, and hence order of their execution is important. This is in contrast to the semantics of ANSI SQL UPDATE . . . WHERE . . . statement. In ANSI SQL, the WHERE condition is always applied to the (logical) copy of the target relation rather than to its updated values. This allows for cleaner but also less powerful semantics. In our case, changes made by prior formulas ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 91 are visible to the following ones to simulate classical spreadsheet and MOLAP tools. This makes ordering of formulas important and imposes restrictions on their optimizations like reordering or pruning of formulas. We elaborate on this in Section 5.1. 3. MOTIVATING EXAMPLE OF SPREADSHEET USAGE Here is an example demonstrating the expressive power of SQL Spreadsheet and its potential for efﬁcient computation as compared to the alternative avail- able in ANSI SQL. An analyst predicts sales for the year 2002. Based on business trends, sales of ‘tv’ in 2002 is the sales in 2001 scaled by the average increase between 1992 and 2001. Sales of ‘vcr’ is the sum of sales in 2000 and 2001. Sales of ‘dvd’ is the average of the three previous years. Finally, the analyst wants to introduce, in every region, a new dimension member ‘video’ for the year 2002, deﬁned as sales of ‘tv’ plus sales of ‘vcr’. Assuming that rows for ‘tv‘, ‘dvd’, and ‘vcr’ for year 2002 already exist, we express this as SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( F1: UPDATE s[‘tv’,2002] = s[‘tv’,2001] + Slope1 (s,t)[‘tv’,1992<=t<=2001]*s[‘tv’,2001], F2: UPDATE s[‘vcr’, 2002] = s[‘vcr’,2000]+s[‘vcr’,2001], F3: UPDATE s[‘dvd’,2002] = (s[‘dvd’,1999]+s[‘dvd’,2000]+s[‘dvd’,2001])/3, F4: UPSERT s[‘video’, 2002] = s[‘tv’,2002]+s[‘vcr’,2002] ) To express the above query in ANSI SQL, formula F1 would require an ag- gregate subquery plus a join to the fact table f; formula F2, a double self-join of the fact table; formula F3, a triple self join of the fact table; and formula F4, a union operation. Such a query would not only be difﬁcult to generate but would also execute inefﬁciently. For the equivalent query using the SQL Spreadsheet clause as shown above, we need to scan the data to generate a point-addressable access structure like a hash table or an index for all formulas only once. The slope function as expressed above requires a scan of the data to ﬁnd rows sat- isfying the predicate ‘1992<=t<=2001’. But if we can deduce from database constraints that t is an integer, then formula F1 is ﬁrst transformed into F1: UPDATE s[‘tv’,2002] = s[‘tv’,2001]+ slope(s,t)[‘tv’,t in (1992,...,2001)]* s[‘tv’,2001] This way, the access structure can be used for random, multiple accesses along the time dimension as opposed to a scan to ﬁnd the rows satisfying the predicate. Formulas F2, F3, and F4 can use the structure directly. The structure 1 The aggregate function slope() is a recent addition to ANSI SQL [Zemke et al. 1999] and de- notes linear regression slope (the ANSI name of this function is called regr slope() but we use the shortened name slope() in this document). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 92 • A. Witkowski et al. Fig. 1. Cycles in the spreadsheet graph. is then used multiple times, giving a performance advantage over the multiple joins required by the equivalent ANSI SQL alternative. In real applications, we expect hundreds of formulas, and consequently a single point-access structure in place of hundreds of joins provides a signiﬁcant performance advantage. As another example, consider a common ﬁnancial calculation: determining the maximum allowable mortgage payment for an individual. Assume that the person’s income is from two sources, salary and capital gains. Salary minus mortgage interest is taxed at 38%, and capital gains is taxed at 28%. Net income is salary plus capital gains minus interest expense minus tax. The maximum al- lowable mortgage interest expense (tax deductible) is 30% of net income. Given the person’s salary and capital gains and the rules above, we want to ﬁnd the individual’s net income, total taxes, and maximum allowable interest expense. To calculate this, we must solve three simultaneous equations. Assume a table ledger with two columns, account and balance, where each row in the table holds the balance for one account. Using this table, the calcu- lations described above can be performed in a single query: SELECT account, b FROM ledger SPREADSHEET DBY (account) MEA (balance b) RULES IGNORE NAV ITERATE (100) UNTIL (ABS(b[‘net’] - PREVIOUS(b[‘net’])) < 0.01) ( F1: b[‘interest’] = b[‘net’] * 0.30, F2: b[‘net’] = b[‘salary’] + b[‘capital gains’] - b[‘interest’] - b[‘tax’], F3: b[‘tax’] = (b[‘salary’]-b[‘interest’]) * 0.38 + b[‘capital gains’] * 0.28 ) Note two cycles in the above formulas—see Figure 1. Formula F1 depends on F2 and formula F2 depends on F1. Formula F2 also depends on F3 and F3 depends on F1. Since there are recursive references, the query is written using the ITERATE option with a condition to terminate the iteration. In this case, the query speciﬁes that processing will be terminated after iterating over the formulas 100 times or when the difference in the value of net income between the previous iteration and the current iteration is less than 0.01. Although it may be possible to express this complex calculation using a single ANSI SQL query, it is unlikely to perform well. Assume that the initial content of the ledger contains values for salary and capital gains (see the Input balance column in Table I). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 93 Table I. Input Ledger Account Input Balance Result Balance Salary 100,000.00 100,000.00 Capital gains 15,000.00 15,000.00 Net 0 61,382.80 Tax 0 35,202.36 Interest 0 18,414.83 After 26 iterations, we satisfy the convergence condition and ﬁnd values for taxes, interest, and net income; the result is shown in the Result balance column of Table I. 4. SQL SPREADSHEET EVALUATION OVERVIEW We divide SQL Spreadsheet evaluation into three broad stages. The ﬁrst stage is the spreadsheet analysis and optimization (see Section 5), which analyzes the formulas to determine if they are acyclic, which is impor- tant to determine the execution method. This stage also performs a number of formula optimizations like pruning of formulas, pushing predicates from the outer query blocks into the spreadsheet block, etc. The analysis is done using a graph representing dependencies between formulas, bounding rectangle the- ory deﬁning the scope of the outside ﬁlters, and known techniques for predicate transformations like predicate push and pull. The result of the analysis is a SQL Spreadsheet with transformed, more optimal formulas and a ﬂag indicat- ing whether the formulas are cyclic or acyclic. The second stage involves building a random access structure on the data coming to the spreadsheet. This structure is currently a hash table (see Section 6.1) but can be another structure like a B-Tree or Preﬁx Trees used for cube compression [Lakshmanan et al. 2003; Sismanis et al. 2002], which supports random cell access, partitioning of data, and data scans. The third stage (see Section 6.2) evaluates the formulas produced at the ﬁrst stage. We support three evaluation algorithms, one for spreadsheet with au- tomatic order and no cycles, one for spreadsheet with automatic order which supports runtime cycle detection, and one for sequential spreadsheet. The al- gorithms use the hash structure built in the second stage to execute formulas in groups (called levels) so that the scans required for aggregate evaluation are minimized. 5. SPREADSHEET ANALYSIS AND OPTIMIZATION The spreadsheet analysis determines the order of evaluation of formulas, prunes formulas whose results are fully ﬁltered out by outer queries, restricts the formulas whose results are partially ﬁltered, migrates predicates from outer queries into the inner WHERE clause to limit the data processed by the spread- sheet, and generates a ﬁlter condition to identify the cells that are required throughout the evaluation of the spreadsheet formulas. The analysis also determines one of two types of execution methods: one for acyclic and one for (potentially) cyclic formulas. Because of complex predicates ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 94 • A. Witkowski et al. in formulas, analysis cannot always ascertain acyclicity of formulas in the spreadsheet. Hence, we sometimes use an expensive cyclic execution method for an acyclic spreadsheet. 5.1 Formula Dependencies and Execution Order The order of evaluation of formulas is determined from their dependency graph. Formula F1 depends on F2 (written F2 → F1 ) if a cell evaluated by F2 is used by F1 . For example in F1: s[‘video’,2000]=s[‘tv’, 2000]+s[‘vcr’, 2000] F2 :s[‘vcr’, 2000]=s[‘vcr’,1998]+s[‘vcr’, 1999] F2 → F1 as F1 requires a cell s[‘vcr’,2000] computed by F2 . To form the → relation, for each formula F , we determine cells that are referenced on its right side, R(F), and cells that are modiﬁed on its left side, L(F). Obviously, F2 → F1 if and only if R(F1 ) intersects L(F2 ). In the presence of complex cell references, like s[‘tv’, t2 +t3 +t4<t5 ], it is hard to determine the intersection of predicates. In this case, we assume that the formula references all cells. This may result in the overestimation of the → relation, leading to spurious cycles in the dependency graph. The → relation results in a graph with formulas as nodes and their depen- dency relationships as directed edges. The graph is then analyzed for (partial) ordering. A spreadsheet formula can access a range of cells (e.g., an aggregate such as avg(s)[‘tv’,*] or left side of an existential formula such as s[*, *] = 10) and thus require a scan of data. If two formulas are independent that is, unrelated in the partial order derived from the graph, they can be evaluated concurrently using a single scan. For concurrent evaluation, formulas are grouped into enumerated levels such that each level contains independent formulas, and no formula in the level depends on a formula in a higher level. The path through the partial order with the maximum number of scans rep- resents the minimum number of total scans possible, since they are all related by the partial order. If we have an acyclic graph, then we can minimize the number of levels containing scans to this value. The following algorithm gener- ates the levels such that the number of scans is minimized (proof of minimality is available from the authors). Let G(F, E) be the graph of the → relation where F is the formulas and E is the → edges. We will call a formula with no incoming edges a source and formulas with only single cell references single refs: GenLevels(G) { LEVEL <- 1 WHILE (F is not empty) { Find the set FS of all the SOURCES in F IF (cycle is detected) break the cycle /* see below */ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 95 ELSE IF (FS contains any single refs) { assign single refs in FS to level LEVEL; F = F - {single refs} in FS } ELSE IF (FS contains only scans) { assign formulas in FS to level LEVEL; F = F - FS } LEVEL <- LEVEL + 1, } } Consider the following query. Here, the spreadsheet graph has one edge: F3 → F2. The algorithm will assign the point reference F3 to level 1 and the scan F2 to level 2, but will delay assigning the scan F1 until level 2 so that F1 and F2 can share a single scan. SELECT * FROM f GROUP BY p, t SPREADSHEET DBY(p,t) MEA(sum(s) s) ( F1: s[‘tv’, 2000] = sum(s)[‘tv’, 1990<t<2000], F2: s[‘vcr’,2000] = sum(s)[‘vcr’, 1995<t<2000], F3: s[‘vcr’,1999]=s[‘vcr’,1997]+s[‘vcr’,1998] ) The GenLevels algorithm presented above simpliﬁes the cyclic case. Before generating the levels, the graph is analyzed for strongly connected components using algorithms from Tarjan [1972]. We can then isolate cyclic subgraphs from acyclic parts of the graph and from other cyclic subgraphs. This is important because the computational complexity of cyclic evaluation is proportional to the total number of rows updated or upserted in a cycle (see the autocyclic algorithm in Section 6.2). After assigning levels to formulas a cyclic subgraph is dependent on, removing formulas from the subgraph and assigning them to individual levels in the same order until the subgraph is exhausted can break the cyclic subgraph. For spreadsheets with a sequential order of evaluation, the dependency edges created always point from an earlier formula to the latter formula. A sequential- order spreadsheet graph can therefore never be cyclic. We still generate levels in order to group the independent formulas together and, hence, minimize the number of scans that are required for the computation of aggregates and exis- tential rules in the spreadsheet. 5.2 Pruning Formulas We expect that, to encapsulate common computations, applications will gener- ate views containing spreadsheets with hundreds of formulas. Users querying ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 96 • A. Witkowski et al. these views will likely require only a subset of the result and, hence, put pred- icates over the views. This gives us an opportunity to prune formulas that compute cells discarded by these predicates. For example: SELECT * FROM ( SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPDATE ( F1: s[‘dvd’,2000]=s[‘dvd’, 1999]*1.2, F2: s[‘vcr’,2000]=s[‘vcr’,1998]+s[‘vcr’,1999], F3: s[‘tv’, 2000]=avg(s)[‘tv’, 1990<t<2000] ) ) WHERE p in (‘dvd’, ‘vcr’, ‘video’) The evaluation of formula F3 is unnecessary as the outer query ﬁlters out the cell that F3 evaluates. The above formulas are independent, and this makes the pruning process simple. Now, let’s say, we had a formula F4 that depends on F3, such as F4: s[‘video’,2000]=s[‘vcr’,2000]+s[‘tv’,2000] Then F3 cannot be pruned as it is referenced by F4. The evaluation of a formula becomes unnecessary when the following condi- tions are satisﬁed: — The cells it updates are not used in evaluation of any other formula. — The cells updated by the formula are ﬁltered out in the outer query block or the measure updated by the formula is never referenced in the outer query block. Identiﬁcation of formulas that can be pruned is done by the following algo- rithm based on the dependency graph G. Let sink be a formula with no outgoing edge, that is, one no other formula depends on. PruneFormulas(G) { Find a set FS of all SINKS WHILE (FS is not empty) { Pick a formula Fi from FS, FS = FS - {Fi } /* remove Fi from FS */ IF ( all the cells referenced on the left side of F are filtered out in the outer query block OR the measure updated by the left side of F is not referenced in the outer query block) { F = F - {Fi } /* delete Fi from list F*/ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 97 E = E - {all incoming edges into Fi }, IF deletion of F generates new ‘sink’ nodes insert them into the set FS } } } 5.3 Rewriting Formulas Pruning formulas alone is not sufﬁcient to avoid unnecessary computations dur- ing spreadsheet evaluation. In some cases, the results computed by a formula may be partially ﬁltered out in the outer query block. Consider the following query which predicts the sale of all products in 2002 to be twice the cost of the same product in 2002, and then selects the sale and cost values for ‘dvd’ and ‘vcr’ for years ≥ 2000: SELECT * FROM ( SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s,c) UPDATE ( F1: s[*,2002]=c[cv(p), 2002]*2, ) ) WHERE p in (‘dvd’,‘vcr’) and t ≥ 2000; The formula F1 cannot be pruned away as part of its result is needed in the outer query block. Still, we do not need to compute the “s” values for all products in 2002 as the outer query ﬁlters out all the rows except for products ‘dvd’ and ‘vcr’. Hence we rewrite the left side of formula F1 as follows to avoid unnecessary computation: F1’: s[p in (‘dvd’,‘vcr’),2002]= c[cv(p), 2002]*2 The rewriting of formulas is done with a small extension of the algorithm PruneFormulas. In the new PruneFormulas, we try to rewrite the formulas in all sink nodes that we cannot prune. Note that, similar to the pruning of a formula, the rewrite of a formula may also change the dependency graph (some incoming edges of the formula might be deleted), possibly leading to the generation of new sink nodes, so it is only natural that both rewrite and pruning of formulas are handled in the same procedure. 5.4 Pushing Predicates Through Spreadsheet Clauses Pushing predicates into an inner query block [Srivastava and Ramakrishnan 1992] and its generalization “predicate move-around” [Levy et al. 1994] is an important optimization and has been incorporated into queries with spread- sheets. We perform three types of pushing optimization: pushing on PBY and independent DBY dimensions, pushing based on bounding rectangle analysis, and pushing through reference spreadsheets. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 98 • A. Witkowski et al. Pushing predicates through the PBY expressions in or out of the query block is always correct as they ﬁlter entire partitions. For example, in SELECT * FROM ( SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPDATE ( F1:s[‘dvd’,2000]=s[‘dvd’,1999]+s[‘dvd’,1997], F2:s[‘vcr’,2000]=s[‘vcr’,1998]+s[‘vcr’,1999] ) ) WHERE r = ‘east’ and t = 2000 and p = ‘dvd’; we push the predicate r = ‘east’ through the spreadsheet clause into the WHERE clause of the inner query. Pushing can be extended to independent dimensions. A dimension d is called an independent dimension if the value of d referenced on the right side is the same as the value of d on the left side for every formula. For example, in the above spreadsheet, the left side of F1 refers to the same values of p on the right side. This is true for formula F2 as well, thereby making p an independent dimension. t, however is not an independent dimension. Observe that in the absence of UPSERT rules, independent dimensions are functionally equivalent to the partitioning dimensions and can be moved from the DBY to the PBY clause. For example, in the above spreadsheet, we could replace the PBY/DBY clauses with SPREADSHEET PBY(r, p) DBY (t) MEA (s) UPDATE Consequently, we can push predicate p = ‘dvd’ into the inner query. We also pull predicates on the PBY and independent DBY columns out of the query to effect the predicate move-around described in Levy et al. [1994]. The outer predicates on the DBY other (not independent) columns can also be pushed in, but we need to extend the predicates so they do not ﬁlter out the cells referenced by the right sides of the formulas. For each formula, we construct a predicate deﬁning the rectangle bounding the cells referenced on the right side. For example, for F2 these predicates are p = ‘vcr’ and t in (1998, 1999) and for F1 p = ‘dvd’ and t in (1997, 1999). Then a bounding rectangle for the entire spreadsheet is obtained using methods described in Guttman [1984] Beckmann et al. [1990], which is a union of bounding rectangles for each formula. This in our case is p in (‘vcr’, ‘dvd’) and t in (1997, 1998, 1999). Then the predicates on the DBY columns from the outer query are extended with the corresponding predicates from the spreadsheet bounding rectangle, and these are pushed into the query. In our example, we extend the outer predicate t = 2000 with t in (1997, 1998, 1999), which results in pushing t in (1997, 1998, 1999, 2000). The predicates on the DBY expressions in the outer query block are kept in place unless the pushdown ﬁlter is the same as the outer ﬁlter and there are no upsert formulas in the spreadsheet. We apply the above optimization if all formulas operate in the UPDATE mode or if the spreadsheet has no PBY clause. With a PBY clause, the pushed ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 99 predicate could eliminate an entire partition and upsert of new cells would never take place for it, resulting in missing rows in the output. For example, consider SELECT * FROM ( SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPSERT ( s[‘dvd’,2003] = s[‘tv’,2003]* 0.5 ) ) WHERE p IN (‘dvd’, ‘vcr’) Based on the bounding rectangle analysis, the unioned predicate p IN (‘dvd’, ‘vcr’, ‘tv’) is a candidate for pushing down. If, however, there is a re- gion, say ‘west’, with no ‘dvd’ or ‘vcr’ or ‘tv’ sales and the predicate is pushed down, the entire region is eliminated, and the new row (r=‘west’, p=‘dvd’, t=2003, s=null) will not be upserted, violating spreadsheet semantics. However, even in the presence of PBY and UPSERT formulas, the predicate can be pushed in many situations. If the upserted cells for the empty partition will be ﬁltered out by the outer query, then it doesn’t matter whether the rows for a partition are ﬁltered out before or after spreadsheet computation. For example, assume that the outside ﬁlter was p IN (‘dvd’,‘vcr’) AND s IS NOT NULL Since, region ‘west’ by assumption has no tv sales, the spreadsheet upserts the row (r=‘west’, p=‘dvd’, t=2003, s=null) that is subsequently elimi- nated by the outer ﬁlter s IS NOT NULL. Our analysis determines if upserted measures can assume null values and if an outside ﬁlter ﬁlters the null values of these measures. If so, we apply pushing predicates derived from bounding rectangle analysis. In practical scenarios, applications operate in upsert mode and are not interested in NULL measures, making this option useful. A challenging scenario arises when the bounding rectangle for a formula cannot be determined at optimization time since it may depend on a sub- query S whose bounds are known only after S’s execution. This is common in OLAP queries, which frequently inquire about the relationship of a mea- sure at a child level to that of its parent (e.g., sales of a state as a per- centage of sales of a country), or inquire about a prior value of a measure (e.g., sales in March 2002 vs. sales the same month a year ago or a quar- ter ago). These relationships are obtained by querying dimension tables. For example, assume that the primary key of time dimension time dt is month m and the table time dt stores the corresponding month a year ago as m yago, and the corresponding month a quarter ago as m qago. Note that “quar- ter ago” means three months ago, so quarter ago of 1999-01 is 1998-10 (see Table II). An analyst wants to compute for a product ‘dvd’ and months (1999-01, 1999-03) the ratio of each month’s sales to the sales in the corresponding months a year and quarter ago, respectively (r yago and r qago). Using SQL ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 100 • A. Witkowski et al. Table II. Mapping Between m and y ago/m qago m m yago m qago 1999-01 1998-01 1998-10 1999-02 1998-02 1998-11 1999-03 1998-03 1998-12 Sspreadsheet, this query, which we will call Q1, is Q1: SELECT p, m, s, r yago, r qago FROM ( SELECT p, m, s FROM f GROUP BY p, m SPREADSHEET REFERENCE prior ON (SELECT m, m yago, m qago FROM time dt) DBY(m) MEA(m yago, m qago) PBY(p) DBY (m) MEA (sum(s) s,r yago,r qago) ( F1: r yago[*] = s[cv(m)] / s[m yago[cv(m)]], F2: r qago[*] = s[cv(m)] / s[m qago[cv(m)]] ) ) WHERE p = ‘dvd’ and m IN (1999-01, 1999-03) A reference spreadsheet serves as a one-dimensional lookup table mapping month m to the corresponding month a year ago (m yago) and a quarter ago (m qago). An alternative formulation of the query using ANSI SQL requires the joins f >< time dt >< f >< f , where the ﬁrst join gives the month values a year and a quarter ago for each row in the fact table and the other two joins give the sales values in the same month, a quarter ago, and an year ago, respectively. The number of joins is reduced to one using a reference spreadsheet. The predicate p = ‘dvd’ on the PBY column can be pushed into the inner block. However, m is not an independent dimension, nor can bounding rect- angles be determined for it as the values m yago and m qago are unknown. Consequently, a restriction on m cannot be pushed in, resulting in all time pe- riods being pumped to the spreadsheet, out of which all except 1999-01 and 1999-03 are subsequently discarded in the outer query. Let’s call a dimension d a functionally independent dimension if, for every formula, the value of d referenced on the right side is either the same as the value of d on the left side or a function of the value of d on the left side via a reference spreadsheet. In query Q1 given above, m is a functionally independent dimension, as the right side uses m directly or uses a function of the value of m on the left side: m yago[cv(m)] and m qago[cv(m)]. We experimented with three transformations to push predicates through functionally independent dimensions. In the ﬁrst, called ref-sub-query push- ing, we add into the inner block a subquery predicate, which selects all values needed by the spreadsheet and the outer query. The transform is similar to the magic set transformation [Mumick et al. 1990] which pushes a query derived ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 101 from outer predicates into the inner block. In the above case, the outer query needs m IN (1999-01, 1999-03), and the spreadsheet needs these values plus their corresponding m yago and m qago values from the reference spreadsheet. These values can be obtained by constructing a subquery over the reference spreadsheet as shown in Q2: Q2: WITH ref sub-query AS (SELECT m, m yago, m qago FROM time dt WHERE m IN (1999-01, 1999-03)) SELECT m AS m value FROM ref sub-query UNION SELECT m yago AS m value FROM ref sub-query UNION SELECT m qago AS m value FROM ref sub-query and then pushing it into the inner block of the query: SELECT p, m, s, r yago, r qago FROM ( SELECT p, m, s FROM f WHERE m IN (SELECT m value FROM Q2 on page 20) GROUP BY p, m SPREADSHEET <.. as above in query Q1 on page 19 .. > ) WHERE p = ‘dvd’ and m IN (1999-01, 1999-03) In the second transformation, called extended pushing, we construct the pushed-in predicates by executing the reference spreadsheet query, obtaining the referenced values and building predicates on the dimension, and ﬁnally disjuncting them with the outer predicates. In the above case we execute SELECT DISTINCT m yago, m qago FROM time dt WHERE m IN (1999-01, 1999-03) to obtain the values for m yago and m qago corresponding to m IN (1999-01, 1999- 03). Let’s assume that the corresponding m yago is (1998-01, 1998-03) and m qago is (1998-10, 1998-12) that is, the ﬁrst and third month of the previous quarter. Finally, we push this predicate into the inner query: SELECT p, m, s, r yago, r qago FROM ( SELECT p, m, s FROM f WHERE m IN (1999-01, 1999-03, /* outer preds */ 1998-01, 1998-03, /* previous year */ 1998-10, 1998-12) /* previous quart */ GROUP BY p, m SPREADSHEET <.. as above in query Q1 on page 19.. > ) WHERE p = ‘dvd’ and m IN (1999-01, 1999-03) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 102 • A. Witkowski et al. In the third transformation, called formula unfolding, we transform the for- mulas by replacing the reference spreadsheet with its values. Similarly to the second transformation, we execute reference spreadsheet and obtain its mea- sure for each of the dimension values requested by the outer query. These values are then used to unfold the formulas. For example, for m = 1999-01, value of m yago = 1998-01, and m qago = 1998-10, and for m = 1999-03, value of m yago = 1998-03, and m qago = 1998-12. Thus formulas are unfolded as SELECT p, m, s, r yago, r qago FROM ( SELECT p, m, s FROM f GROUP BY p, m SPREADSHEET REFERENCE prior ON (SELECT m, m yago, m qago FROM time dt) DBY(m) MEA(m yago, m qago) PBY(p) DBY (m) MEA (sum(s) s,r yago,r qago) ( F1: r yago[1999-01] = s[1999-01] / s[1998-01], F1 : r yago[1999-03] = s[1999-03] / s[1998-03], F2: r qago[1999-01] = s[1999-01] / s[1998-10], F2 : r qago[1999-03] = s[1999-01] / s[1998-12] ) ) WHERE p = ‘dvd’ and m IN (1999-01, 1999-03) Following formula unfolding, we perform analysis of the bounding rectangles described above and push the resulting bounding predicate into the inner query. In our experiments (see Section 8), the extended pushing and formula un- folding transformations resulted in similar performance as in most cases they push in the same predicates. In comparison, the ref-sub-query push transform had inferior performance. The use of ref-sub-query gives the optimizer a choice of join method between the subquery and the main query block. The optimizer sometimes selects a more expensive join method, thereby slowing down the query (see experimental results in Section 8). 5.5 Optimizations of Aggregates SQL Spreadsheet allows us to express complex business models within a single query block. Frequently the models will include multiple aggregates on subsets of the data relative to the current row; hence their optimization is critical. Consider this query: SELECT r,p,t,s, ps FROM t SPREADSHEET PBY(r) DBY(p,t) MEA(s, t, 0 ps) UPDATE ( ps[*,*] = s[cv(p), cv(t)-1] * (1+slope(s,t)[cv(p), cv(t)-5 <= t <= cv(t)-1]) ) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 103 It computes the projected sales, ps, of each product for every year. ps is computed by multiplying the actual sales s from the previous year by the rate of increase of sales (expressed as the slope aggregate function) over last 5 years. The aggregate is relative to the current row: for each row on the left we get its product and within it compute slope for 5 previous years. With a naive execution, this query is expensive. The right side of the formula is computed for each row coming into spreadsheet. The right side has an aggre- gate function, which requires a full table scan of table t. Hence there are as many full table scans as rows in table t—a prohibitively expensive execution plan. Spreadsheet evaluation can be optimized by reducing the number of table scans. For each cell on the left side, that is, for each product and each year, we have to access sales for the previous 5 years of the product to compute the requested slope aggregate. Suppose that before evaluating the formula we partition data by product and order each partition by year. Then within each product partition we will consider a sliding window of 5 past years. Thus, for year 2000 we will look at years 1995–1999, for year 2001 at years 1996–2000, etc. As we slide the window we can compute, with a single scan of sorted data, the slope aggregate for each window frame. The slope can be expressed as sum and count aggregates and thus belongs to the family of algebraic aggregates [Gray et al. 1996] and hence can be maintained incrementally during sliding window operation. ANSI SQL provides window functions [Zemke et al. 1999] for that operation and many database systems (Oracle, DB2) already provide a native implementation for them. The slope aggregate slope(s,t)[cv(p), cv(t)-5 <= t <= cv(t)-1] can be rewritten using ANSI SQL window formulation as slope(s,t) OVER (PARTITION BY p ORDER BY t RANGE BETWEEN 5 PRECEDING AND 1 PRECEDING) We can rewrite an aggregate with a window function when (1) the for- mula is not self-cyclic, and (2) one dimension of the aggregate deﬁnes a window relative to the current row using cv() and all other dimensions are qualiﬁed by the values from the current row, that is, by cv(). Let’s de- note the other dimensions as Dcv. The Dcv dimensions are used to partition the data (see the product dimension above) and the dimension deﬁning the window is used for storing within the partitions (see the time dimension above). The algorithm GenLevels assigning formulas to execution levels places for- mulas with window functions at the same level if they can share a sort. For example, the two formulas in SPREADSHEET DBY(r,p,t) MEA(s, t, 0 r, 0 w) UPDATE ( w[*,*,*]= AVG(s)[cv(), cv(), cv(t)-5 <= t <= cv(t)], r[*,*,*] =SUM(s)[cv(), p =‘vcr’, 1999 < t <= 2000] ) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 104 • A. Witkowski et al. will be rewritten with two window functions: AVG(s) OVER (PARTITION BY r, t ORDER BY t RANGE BETWEEN 5 PRECEDING AND CURRENT), SUM(CASE p=‘vcr’ & 1999<t<=2000 THEN S ELSE NULL) OVER (PARTITION BY r) and a single sort on (r, p, t) will satisfy both formulas. 5.6 Optimization of Qualiﬁed Aggregates It is common for aggregates to apply to a predetermined set of cells that is much smaller than the total number of cells in the model. Consider aggregates applied to a window of cells around the current cell, as in this example of forecasting sales for DVDs in the next 2 years based on the 3-year moving average over the model: SPREADSHEET DBY (p, t) MEA (s, 0 mavg) ( mavg[‘dvd’, FOR t FROM 2000 TO 2001] = 1.05 * AVG(s)[cv(), cv()-3 <= t <= cv()-1] ) This aggregate would normally require a scan of all the rows in the access structure to determine the rows satisfying the predicate in the aggregate cell reference. If the aggregate set is signiﬁcantly smaller than the partition cur- rently being processed, it may be more efﬁcient to explicitly enumerate and look up each value that falls within this set rather than perform a scan. To allow this functionality, we provide the qualiﬁed loop operator that allows a user to specify an enumerated set on which to compute aggregates in the assignment expression. An equivalent expression for the forecast above (when years are positive integers) is SPREADSHEET DBY (p, t) MEA (s, 0 mavg) ( mavg[‘dvd’, FOR t FROM 2000 TO 2001] = 1.05 * AVG(s)[cv(), FOR t FROM cv()-3 TO cv()-1 INCREMENT 1] ) In this case, for each unfolded formula on the left side, there will be three direct cell lookups generated by the formula on the right side rather than a condition applied across a scan of the entire dataset. Such an aggregate is called a qualiﬁed aggregate. Each dimension in a qualiﬁed aggregate must be fully qualiﬁed as either a FOR loop or as a single-value equality expression. Individual cell values for lookup are generated by incrementing the qualiﬁed expressions from left to right through the cell index. The performance of qualiﬁed and existential aggregates is discussed in the Section 8. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 105 Table III. Summary of Formula Transformations Transformation Major Technique Pruning of formulas Determining formula sinks and rows ﬁltered by the outer query Rewriting of formulas Changing the scope of formulas based on the outer ﬁlters Pushing of Predicates Pushing through PBY, pushing through independent dimensions, bounding rectangles on outside ﬁlters Data-dependent pushing Ref-sub-query pushing, formula unfolding, extended pushing of predicates Optimization of Conversion of aggregates to window functions aggregates Optimization of qualiﬁed Using point access instead of scans for aggregates aggregates Table III summarizes the transformation strategies used for SQL Spread- sheet optimizations. 6. SQL SPREADSHEET EXECUTION Spreadsheet evaluation is handled just like any other operation in the RDBMS query evaluation engine. The spreadsheet evaluation operator takes a set of rows as input, maps these rows into a multidimensional access structure (a hash table) based on PBY and DBY speciﬁcations in the spreadsheet clause, then evaluates the formulas of the spreadsheet to upsert new cells or to modify the measures in the cells, and ﬁnally returns these cells as a set of output rows. If there are reference spreadsheets speciﬁed in the spreadsheet clause, the spreadsheet operator takes input rows for each reference spreadsheet from their respective query blocks and builds a hash table on each of them so that they can be “referenced” during formula evaluation. The hash tables created for reference spreadsheets are created as read-only and they are discarded at the end of spreadsheet computation. We elaborate on the evaluation steps of the spreadsheet operator below. 6.1 Access Structure For efﬁcient access to single cells (like s[p = ‘dvd’, t = 2000]), we build a two- level hash access structure. In the ﬁrst level, called the hash partition, data is hash-partitioned on the PBY columns. Please note that data for more than one spreadsheet partition may end up in the same hash partition. In the second level, a hash table is built on the PBY and DBY columns within each ﬁrst-level partition; hence multiple spreadsheet partitions exist within a hash partition. We use the term partitioning phase to describe splitting the data into hash partitions (the ﬁrst phase). This two-level scheme enables us to evaluate spreadsheets efﬁciently and to reduce the memory requirement as well. Memory required at any point during spreadsheet execution is equal to the size of the hash partition being operated on at that time. To minimize the size and build time of the hash access structure, we build the access structure only on rows required by the formulas as deﬁned by the spreadsheet bounding rectangle (see Section 4). The number of hash partitions is chosen based on the estimated data size and the amount of memory available. The goal is to have the largest hash partition ﬁt in memory. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 106 • A. Witkowski et al. After the partitioning phase, we go to the formula execution phase. In this phase, spreadsheet formulas are evaluated one spreadsheet partition at a time. We pin a hash partition in memory. As it can contain more than one spread- sheet partition, we consider one spreadsheet partition at a time and evaluate formulas within it. We repeat this for all the spreadsheet partitions of the hash partition and then move on the next hash partition. In some cases, as explained later, we are able to evaluate all formulas within a hash partition at once for better performance. In most cases, a hash partition ﬁts in memory, resulting in efﬁcient evalua- tion of the formulas. There are situations, such as data skew or shortage of run- time memory, when memory is not sufﬁcient to hold some hash partitions. When this happens, we build a disk-based hash table. It employs techniques such as a weighted LRU scheme for block replacement, pointer swizzling to make ref- erences lightweight, and write-back of only those disk blocks that are dirty. To overlap computation and I/O, we use asynchronous reads and writes when- ever possible during the construction and use of the hash access structure. During the partitioning phase, we issue asynchronous writes of full blocks to free them for new data. Similarly, asynchronous reads are issued during scan operations. The hash access structure supports operations such as probe, update, upsert, insert, and scans. A scan operation can return all records matching a given DBY key or return records within a hash or spreadsheet partition. As a part of these scan operations, the hash access structure also allows the current record to be updated. Update of the current row being scanned is very useful while evaluating existential formulas. By doing so, we avoid the additional lookup needed for performing the update. Collision occurs when records with different PBY and DBY keys get mapped to the same hash bucket. We handle collisions by chaining the colliding records in the hash bucket. Collisions degrade performance of the lookup operation. We try to reduce collisions by sizing the hash table in spreadsheet partitions to have N times (N = 2 by default) as many buckets as the number of records in the hash partition. We count the number of records within a hash partition during the initial partitioning step and use that number to size the hash table within the hash partition. Records within a hash bucket are clustered based on key values, thereby making scans of records with the same key value efﬁcient. We now describe the execution algorithms used in evaluating the spreadsheet queries. 6.2 Execution Formulas in SQL Spreadsheet operate in automatic order or sequential order. Figure 2 classiﬁes the spreadsheet based on the evaluation order and depen- dency analysis and identiﬁes the execution algorithm. There are three algo- rithms: Auto-Acyclic, Auto-Cyclic, and Sequential. 6.2.1 Automatic Order. The order of evaluation of formulas in an auto- matic order spreadsheet is given by their dependencies (see Section 5.1). We have two methods for its execution. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 107 Fig. 2. Classiﬁcation of spreadsheet evaluation. 6.2.1.1 Auto-Acyclic Algorithm. The Auto-Acyclic algorithm is used when there are no cycles detected in the formula dependency graph: Auto-Acyclic() { FOR each spreadsheet partition P { FOR level Li from L1 to Ln { /* LSi = set of formulas in Li with single cell refs on left side LEi = set of formulas in Li with existential conditions on left side First, evaluate all aggregates in set LSi , then all formulas in that set */ FOR each record r in P -- (Scan I) for each aggregate A in LSi apply r to A; FOR each formula F in LSi evaluate F; /* Evaluate all formulas in LEi */ FOR each record r in P -- (Scan II) { find formulas EF in LEi to be evaluated for r ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 108 • A. Witkowski et al. FOR each record r in P -- (Scan III) FOR each aggregate A in EF apply r to A; FOR each formula EF evaluate EF; } } } } Notice that all the aggregates at any level are computed before evaluation of formulas at that level so they are available for the formulas. This requires a scan of records in the partition for each level. In the absence of existential for- mulas, and the presence of only those aggregate functions for which an inverse is deﬁned (for example, SUM, COUNT, etc.), the aggregates for all the levels are computed in a single scan. With each formula we store a list of aggregates dependent on the cell being upserted (or updated) by it. It is possible to deter- mine such a list because there are only single cell references on the left side and these values can be substituted in the aggregate cell reference predicate to ﬁnd the dependent formulas. So, if a formula changes the value of its measure, the corresponding dependent aggregates are updated by applying the current value and inverse of the old value of the measure. In the above algorithm, we can also combine scan I with the scan II or scan III. An example of an acyclic spreadsheet: SELECT r, p, t, s FROM f SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘tv’, 2002] =s[‘tv’, 2001] * 1.1, s[‘vcr’,2002] =s[‘vcr’, 1998] + s[‘vcr’, 1999], s[‘dvd’,2002] =(s[‘dvd’,1997]+s[‘dvd’,1998])/2, s[*, 2003] =s[cv(p), 2002] * 1.2 ) The above query makes sales forecasts for years 2002 and 2003. The formulas are split into two levels. The ﬁrst level consists of the ﬁrst three formulas, pro- jecting sales for 2002, and the second level, dependent on the ﬁrst level, consists of the last formula, projecting sales for 2003. The Auto-Acyclic algorithm eval- uates formulas in the ﬁrst level before evaluating formulas in the second level. 6.2.1.2 Auto-Cyclic Algorithm. There are also automatic order spread- sheets which are either cyclic, or have complex predicates that make the exis- tence of cycles indeterminate. In such cases (see Section 5.1), the dependency analysis approximately groups the formulas into levels by ﬁnding sets of for- mulas comprising strongly connected components (SCCs), and assigning the formulas in an SCC to consecutive levels. The Auto-Cyclic algorithm evalu- ates formulas that are not contained in SCCs as in the acyclic case, but when ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 109 formulas in SCCs are encountered, it iterates over the consecutive SCC formu- las until a ﬁxed point is reached, but only up to a maximum of N iterations where N = number of cells upserted (or updated) in the ﬁrst iteration. If the spreadsheet was actually acyclic, the formulas will converge after at most N iterations. In the worst case, if the formulas were evaluated in exactly the oppo- site order of (real) dependency, each iteration will propagate one correct value to another formula, hence requiring N iterations. Therefore, to evaluate all acyclic spreadsheets which could not be classiﬁed as acyclic and limit the num- ber of iterations for cyclic spreadsheets, the maximum number of iterations for evaluation of formulas is ﬁxed at N . If the spreadsheet does not converge in N iterations, an error is returned to the user. To determine if the spreadsheet has converged after an iteration, a ﬂag is stored with the measure that is set whenever the measure is referenced while evaluating a formula. Later, an up- date of a measure, which has the ﬂag set, to a different value indicates that additional iterations are required to reach a ﬁxed point. Similarly, the inser- tion of a new cell (by an UPSERT formula) signals additional iterations. This technique requires resetting ﬂags for each measure after each iteration—an expensive proposition. Hence, instead of a single ﬂag, two ﬂags are stored, each one being used in alternate iterations—as one of the ﬂags is set, the other one can be cleared. 6.2.2 Sequential Order. In a sequential order spreadsheet, formulas are evaluated in the order they appear in the spreadsheet clause. The dependency analysis still groups the formulas into levels consisting of independent formu- las so that the number of scans required for the computation of aggregate func- tions is minimized. The algorithm is similar to Auto-Acyclic, but there may be multiple iterations as speciﬁed in the ITERATE spreadsheet processing option. 6.3 Parallel Execution of SQL Spreadsheet To improve the scalability of spreadsheet evaluation, formulas can be evalu- ated in parallel for different partitions. The technique for parallel evaluation of spreadsheet queries is covered in Witkowski et al. [2003] in its entirety and we omit it here for lack of space. 7. PARAMETERIZING SQL SPREADSHEET 7.1 Parameterization of the SQL Query Block Oracle users have two ways of abstracting and persisting complex computations—ANSI SQL views and functions. A disadvantage of views is that data cannot be passed to them. Thus, computation is always on a ﬁxed set of objects speciﬁed in the FROM clause of the view query. Allowing the view to be parameterized by making it possible to pass subqueries and scalars to it would signiﬁcantly expand its capabilities as a computational object. Functions, which can return row-sets and hence participate in further SQL processing, are the other way of expressing complex computations. Users have ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 110 • A. Witkowski et al. multiple implementation languages to choose from (C, Java, PLSQL), with Oracle PL/SQL being the most common. Functions implemented in PL/SQL can use a mix of imperative and declarative SQL styles of programming, but this ﬂexibility comes at the expense of suboptimal plans. An SQL query Q invoked from a procedural PL/SQL function F is optimized in isolation and does not participate in interquery optimization. For example, predicates out- side of F cannot be pushed into Q and Q is not merged with queries invoking F , etc. To alleviate these disadvantages, we propose to express functions declara- tively with SQL. An SQL-language function is a function whose body is an SQL query. Its parameters can either be scalars or subqueries producing row-sets. We support two types of SQL-language functions: strongly typed, where the type checking is done at the function creation time, and weakly typed, where type checking is deferred to invocation time. For example, CREATE FUNCTION region sales 2002 (f TABLE OF ROW (r VARCHAR, p VARCHAR, t INT, s NUMBER), region VARCHAR) RETURN MULTISET LANGUAGE SQL AS SELECT r, p, t, s FROM f param f WHERE r = region SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘vcr’,2002] =s[‘vcr’, 1998] + s[‘vcr’, 1999], s[‘dvd’,2002] =avg(s)[‘dvd’, 1990 < t < 2001], s[*, 2003] =s[cv(p), 2002] * 1.2 ) deﬁnes a strongly typed SQL-language function with two parameters: a sub- query f and a scalar region. The subquery parameter is deﬁned using the TABLE OF ROW clause describing f ’s schema. The resulting type of the function is de- rived from the SELECT list of the query and in this case is the same as input parameter f. This type can also be speciﬁed using the TABLE OF ROW clause in the RETURN subclause. The subclause RETURN MULTISET LANGUAGE SQL indicates that the function produces a row-set and is implemented in SQL. The function designer may not know in advance the data types of the pa- rameters, and for this case we provide weakly typed functions where a re- served type ANYTYPE delays type checking till the invocation time. For example, region sales 2002 can be weakly deﬁned as CREATE FUNCTION region sales 2002 (f TABLE OF ROW (r ANYTYPE,p ANYTYPE,t ANYTYPE,s ANYTYPE), region ANYTYPE) RETURN MULTISET LANGUAGE SQL AS <...> SQL-language functions are invoked by placing them in the FROM clauses of queries. For example, the following query ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 111 SELECT r, p, t, s FROM region sales 2002 ( (SELECT reg, prod, time, sale FROM t), ‘west’ ) WHERE p = ‘tv’; invokes the region sales 2002 function and passes it a subquery and scalar. The entire query is expanded by our view expansion to SELECT r, p, t, s FROM ( SELECT r, p, t, s FROM (SELECT reg r, prod p ,time t, sale s FROM t) WHERE r = ‘west’ SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘vcr’,2002] = s[‘vcr’, 1998] + s[‘vcr’, 1999], s[‘dvd’,2002] = avg(s)[‘dvd’, 1990 < t < 2001], s[*, 2003] = s[cv(p), 2002] * 1.2 ) ) WHERE p = ‘tv’; Following that, the dynamic optimizations described in Section 5 are applied, resulting in pruning the ﬁrst two rules, rewriting the third one, pushing pred- icate p=‘tv’ inside, and pushing the predicate t IN(2003, 2002) derived from the bounding rectangle analysis into the query block. This results in SELECT r, p, t, s FROM f WHERE r = ‘west’ AND p = ‘tv’ r = region AND t in (2003, 2002) SPREADSHEET PBY(r) DBY (p, t) MEA (s) ( s[‘tv’,2003] =s[cv(p), 2002] * 1.2 ) Observe that the resulting query beneﬁted greatly from interquery optimiza- tion, a feature not available in functions implemented procedurally, such as in C or in PL/SQL. 7.2 Parameterization of the SQL Spreadsheet Clause Parameterization of SQL-language functions allows us to build SQL mod- els, which preserve spreadsheet optimizations without knowing object names and schemas of user applications. However, it does not provide a framework for building user-deﬁned functions using SQL Spreadsheet’s most potent con- structs: representing a relation as an array and deﬁning formulas on it. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 112 • A. Witkowski et al. For this, we extend the concept of functions whose bodies are SQL queries to procedures whose bodies contain SQL Spreadsheet clauses. This is useful for implementing functions present in classic spreadsheets, like net present value (NPV or npv), that are not present in ANSI SQL. The SQL Spreadsheet procedure is a function whose body contains SQL Spreadsheet formulas and whose parameters are scalars and multidimen- sional, multimeasure arrays. The arrays can be declared as input, output, or input/output parameters denoted by IN, OUT, or INOUT following the Oracle PL/SQL convention. The subscript of the array is always an IN parameter. Like SQL-language functions, the declaration of arrays allows for strong and weak types. For example, the SQL Spreadsheet procedure CREATE PROCEDURE net present value (ARRAY DBY (i IN INTEGER) MEA (amount IN NUMBER, npv OUT NUMBER), rate NUMBER) LANGUAGE SQL SPREADSHEET AS RULES IGNORE NAV ( npv[1] = amount[1], npv[i > 1] ORDER BY i = amount[CV(i)/POWER(1+rate, CV(i)] + npv[CV(i) - 1] ) calculates net present value npv of amount for sequential time periods i based on amounti . (1 + rate)i Observe that, in the SQL formulation, the summation operator is replaced by looping over all values in the array in order (npv[*] ORDER BY i) and adding a previously calculated NPV value (npv[CV(i) -1]) to the one currently com- puted. The function accepts two parameters: an array dimensioned by an integer iwith two measures: amount (an IN parameter) and NPV (an OUT parameter), and a scalar parameter rate. The body of the function is the RULES subclause of SQL Spreadsheet, which implements the net present value given above. The SQL Spreadsheet procedure is invoked from the SQL Spreadsheet clause. The invoker maps the rectangular regions of the main or reference spreadsheet to the actual array parameters of the function. These regions are deﬁned using predicates on the DBY columns of the spreadsheet and then mapped to arrays indicating which columns of the regions form indexes and measures of the array. We explain this with an example. Consider a relational table cash ﬂow(year, period, prod, amount) expressing a cash ﬂow for electronic products in years 1999–2002. Years are assigned sequential time periods 1–4—see Table IV. This analysis is made from the time perspective of the ﬁrst day of 1999. For each product, there is an initial negative cash ﬂow at the end of 1999 representing the ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 113 Table IV. Cash Flow Table Year Period i Prod Amount Nvl Result 1999 1 vcr −100.00 −100.00 2000 2 vcr 12.00 −90.70 2001 3 vcr 10.00 −84.01 2002 4 vcr 20.00 −72.17 1999 1 dvd −200.00 −200.00 2000 2 dvd 22.00 −183.07 2001 3 dvd 12.00 −174.97 2002 4 dvd 14.00 −166.68 investment in products. The later years have positive cash ﬂows representing the sales of products. Assume that i and prod form the DBY clause of this SQL Spreadsheet: SPREADSHEET DBY (prod,i) MEA (year,amount,0 npv) () The (amount, npv) [‘vcr’, *] designates a rectangular region with two measures amount and npv within that spreadsheet. The ﬁrst dimension in this rectangle is qualiﬁed by a constant. Hence, we can map it to a one-dimensional array with prod dimension and two measures using the SQL CAST operator: CAST ((amount, npv)[‘vcr’, *] AS ARRAY DBY (i IN INTEGER) MEA (amount IN NUMBER,npv OUT NUMBER)) A default casting is also provided. If the region ﬁts the shape of the array, the CAST operator is not needed. In the (amount, npv) [‘vcr’, *] region, prod dimension is qualiﬁed to be a constant while the i dimension is unqualiﬁed. This can, by default, be mapped to one-dimensional array. Casting operations may be expensive if the array shape is not compatible with the spreadsheet frame. In this case, we build another random access hash structure for the array during runtime. If the array shape is compatible with spreadsheet frame, we reuse the hash access structure of the spreadsheet. In our example, the (amount, npv) [‘vcr’, *] region can reuse the spreadsheet access structure, increasing the efﬁciency of the computation. To calculate the net present value of ‘vcr’ and ‘dvd’ products, one would then write SELECT year, i, prod, amount, npv FROM cash flow SPREADSHEET DBY (prod, i) MEA (year, s, NULL npv) ( net present value((amount,npv)[‘vcr’,*], 0.14), net present value((amount,npv)[‘dvd’,*], 0.14) ) This is then expanded to the equivalent form of SELECT year, i, prod, amount, npv FROM cash flow ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 114 • A. Witkowski et al. SPREADSHEET DBY(prod,i) MEA(year,amount,null npv) IGNORE NAV ( npv[‘vcr’, 1] = amount[‘vcr’, 1], npv[‘vcr’, i > 1] ORDER BY i = amount[CV(prod), CV(i)]/POWER(1+rate, CV(i)) + npv[CV(prod), CV(i) - 1], npv[‘dvd’, 1] = amount[‘dvd’, 1], npv[‘dvd’, i > 1] ORDER BY i = amount[CV(prod), CV(i)]/POWER(1+rate, CV(i)) + npv[CV(prod), CV(i) - 1] ) The equivalent form is then subject to all the optimizations described in Section 5. In the above case, the bounding rectangle analysis will push predicate p IN (‘vcr’, ‘dvd’) into the WHERE clause of the query block. This would not be possible (or would be too hard) if the net present value function were implemented using a procedural language. The output of the query, using the amounts below with an annual interest rate of 14%, is shown in Table IV in the “NVL result” column. 8. EXPERIMENTAL RESULTS We conducted experiments on the APB benchmark database2 populated with 0.1 density data. The APB schema has a fact table with 4 hierarchical dimen- sions: channel with two levels, time with three levels, customer with three levels, and product with seven levels. We constructed a cube over the fact table and materialized it in the apb cube table. Like the fact table, the cube has four dimensions—t(ime), p(roduct), c(ustomer), h(channel), each represented as a single column with all hierarchical levels encoded into a single value. The cube had bitmap indexes on the dimension columns and had 22,721,998 rows. The experiments were conducted on a 12 CPU, 336-MHz, shared memory machine with a total of 12 GB of memory. The experiments report units of time rather than absolute time measures like seconds, as they were done on a commercial prototype still undergoing tuning. 8.1 Pushing Predicates Experiment We used a spreadsheet query calculating the ratio of sales for every product level to its ﬁrst, second, and third parents in the product hierarchy. The APB product hierarchy has seven levels: prod, class, group, family, line, division, and top. Thus, for a product in the prod level, we calculated the share of its sales relative to its corresponding class, group, and family levels. Assuming that the parent information of a product was stored in a dimension table product dt with columns p, parent1, parent2, parent3 (product, its parent, grandparent, 2 APB benchmark speciﬁcations. Go online to http://www.olapcouncil.org/research/APB1R2 spec.pdf. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 115 Fig. 3. Pushing predicates. and great-grandparent, respectively), the query had the form Q3: SELECT s, share 1, share 2, share 3, p, c, h, t, FROM apb cube SPREADSHEET REFERENCE ON (SELECT p, parent1, parent2, parent3 FROM product dt) DBY (p) MEA (parent1, parent2, parent3) PBY (c,h,t) DBY (p) MEA (s, 0 share 1, 0 share 2, 0 share 3) RULES UPDATE ( F1: share 1[*] = s[cv(p)] / s[parent1[cv(p)]] F2: share 2[*] = s[cv(p)] / s[parent2[cv(p)]] F3: share 3[*] = s[cv(p)] / s[parent3[cv(p)]] ) A hypothetical user indicates products of interest via a predicate on p in the outer query. We studied three algorithms (sub-query, extended-pushing, and formula-unfolding) for pushing predicates by changing the selectivity (fraction of rows selected) of the predicate. As shown in Figure 3, we observed 5 to 20 times the improvement in the query response time (serial execution) by pushing predicates as compared to not pushing them at all. In general, the improvement can be arbitrarily large. The extended-pushing and formula-unfolding algorithms performed al- most identically, as expected, and their response times were predictable. The sub-query pushing algorithm offered a surprise, as the response time curve was not smooth. For low selectivity of the predicates (up to 0.006), the opti- mizer chose a nested-loop join between the subquery and the apb cube (see the sub-query-nested loop curve). This was not the optimal choice and caused lin- ear degradation in performance up to three times over the extended-pushing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 116 • A. Witkowski et al. Fig. 4. Optimization of aggregates. method. Beyond the 0.006 selectivity, the optimizer chose the more optimal hash join. However, the response time was still 20% worse than the response time for the extended-pushing method when we forced the optimizer to always choose a hash join between the subquery and apb cube (see sub-query-forced hash graph). The response time for the subquery method was about 20% worse than extended-pushing for the entire range of investigated selectivity values. 8.2 Optimization of Aggregates Experiment We evaluated performance the transforming relative aggregates to their corre- sponding window aggregates by using an example query computing a moving average of the past 100 months: SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r) ( r[*]= avg(s)[ cv() - 1 < t <= cv() - 100] ) The aggregate above can be transformed to AVG(s) OVER (ORDER BY t RANGE BETWEEN 100 PRECEEDING AND 1 PRECEEDING) The average aggregate operated within a partition based on channel, cus- tomer, and product. We kept the size of the input data constant, but varied the number of months per partition. We compared the performance of the trans- formed formula—see the solid line in Figure 4—to the untransformed formula that used naive execution. The naive execution evaluated the aggregate as many times as the cordinality of the partition—see the dashed line in Figure 4. As expected, the performance of untransformed aggregation degraded linearly with the increasing cardinality of partitions. 8.3 Qualiﬁed Aggregates Experiment We support two ways of computing aggregates for discrete dimensions. The set to be aggregated can be explicitly enumerated or it can be expressed as a condition of the dimension. The ﬁrst formulation, called qualiﬁed aggregates, involves direct access to the spreadsheet cells and the second a scan of the partition. Here we show the tradeoffs between the formulations. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 117 Fig. 5. Aggregate using scan versus qualiﬁed aggregates. Consider an average of N time periods. Using qualiﬁed aggregates formula- tion, the computation can be expressed as SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r) ( r[1]= avg(s)[ FOR t FROM CV() TO CV() + N] ) In the second (which involves a scan), it can be expressed as SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r) ( r[1]= avg(s)[ CV() <= t < = CV() + N] ) The ﬁrst formulation performed better when number of cells accessed was a small fraction of the partition, as shown in Figure 5. We kept the size of the partition constant and varied N , which in the ﬁgure is expressed as percentage of a partition. Observe that there was a signiﬁcant range, up to 18% of the partition size, where qualiﬁed aggregates outperformed aggregates computed with a scan. This shows a need for an optimization where, for discrete dimensions, we automatically choose which form of the aggregate computation is most efﬁcient. We plan to include this in a future project. 8.4 Hash-Join Versus SQL Spreadsheet Experiment Many SQL Spreadsheet operations can be expressed with standard ANSI SQL using joins and UNIONs. For example, query Q3 discussed earlier can be ex- pressed using joins three self-joins of abp cube and a join to product dt: SELECT s, a1.s/a2.s AS share 1, a1.s/a3.s AS share 2, a1.s/a4.s AS share 3, p, c, h, t, FROM apb cube a1, apb cube a2 (+), apb cube a3 (+), apb cube a4 (+), product dt p (+) WHERE a1.p=p.p & a2.p=p.parent1 & a2.c=a1.c & a2.h=a1.h & a3.t=a1.t ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 118 • A. Witkowski et al. Fig. 6. Hash join versus SQL Spreadsheet function of #rules. Fig. 7. Scalability with number of formulas. a3.p=p.parent2 & a3.c=a1.c & a3.h=a1.h & a3.t=a1.t a4.p=p.parent3 & a4.c=a1.c & a4.h=a1.h & a4.t=a1.t The number of self-joins is equal to the number formulas (say N ), and all joins to the original apb cube (a1) are right outer joins (the right side of outer joins is marked with (+) in the FROM list). For hash joins, this requires construc- tion of N hash tables, while our SQL Spreadsheet needs only one hash access structure per spreadsheet. Consequently there is a breakeven point Ni, when the cost of the spreadsheet access structure is amortized, and SQL Spread- sheet outperforms the ANSI hash-join formulation, as shown in Figure 6. In the above query, Ni is 3 (i.e., three rules). Above 14 rules, spreadsheet execution is twice as fast as that using joins. In the experiment, joins and spreadsheet were processed serially and the access structures for both ﬁt in memory. 8.5 Access Method—Hash Table We tested the scalability of our execution methods as a function of the number of formulas and memory available for the hash structure. Figure 7 shows almost linear scalability between the response time of a spreadsheet and the number of formulas. Each formula came from query Q3, discussed earlier, and simulated a double join apb cube >< product dt >< apb cube. In the experiment, the physical memory was large enough to accom- modate every individual partition of the apb cube, which in our case was a maximum of 15 MB—about 20% of the cube. Figure 8 shows the performance of our access structure as a function of available memory. The memory size is expressed as a percentage of the size required to ﬁt the largest partition of data in the hash access structure in ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 119 Fig. 8. Scalability with size of physical memory. physical memory. Recall from Section 6 that we ﬁrst partition the data on the PBY columns, and process one partition at a time to execute the formulas. In the experiment, we executed a single formula, F1, from query Q3: F1: share 1[*] = s[cv(p)] / s[parent1[cv(p)]] The formula accesses, within each PBY (c,h,t) partition, sales for a product and its parent. If a partition does not ﬁt in memory we incur an I/O if a refer- enced cell is not cached. In a severe case of memory shortage, each reference may be a cache miss, reducing our access method to an uncached, nested-loop join. In the case of formula F1, which references a product and its parent, this occurs when the available memory is less than 30% of the largest partition— see Figure 8. Thus our method works very well and outperforms equivalent simulations of formulas with joins (for hash, sort, and nested-loop join meth- ods) when the PBY partitions ﬁt in memory, as in those cases we reduce the number of required joins. Note that the equivalent simulations must perform abp cube >< product dt >< abp cube, while with Spreadsheet we effectively build an access structure for only one join abp cube >< product dt. For ex- treme cases of memory shortage, we degrade to the equivalent performance of simulation with nested-loop joins. Observe that in these cases, hash join simu- lations would not perform better as they would need to spill all of their data to disk. 9. CONCLUSIONS This article extends SQL with a computational clause that allows us to treat a relation as a multidimensional array and specify a set of formulas over it. The formulas replace multiple joins and UNION operations that must be performed for equivalent computations with current ANSI SQL. This not only allows for ease of programming, but also offers the RDBMSs an opportunity to perform better optimizations, as there are fewer complex query blocks to optimize—an Achilles heel of many RDBMSs. We also create a single runtime access structure which replaces the multiple hash or sort structures needed for equivalent joins and UNIONs. Our intent is an eventual migration of certain classes of com- putations from classical spreadsheets into the RDBMS. Such migration would offer an unprecedented integration of business models, which are currently dis- tributed among thousands of incompatible and incomparable spreadsheets. In our model, the result of an SQL Spreadsheet is a relation with well-deﬁned ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 120 • A. Witkowski et al. semantics and can easily be compared to other SQL Spreadsheets via joins, unions, and other relational operations. The SQL Spreadsheet can be stored in a relational view and, hence, become known to tools through the RDBMS catalog, thereby enhancing their cooperation. ELECTRONIC APPENDIX An electronic appendix with an explanation for parallel execution SQL Spread- sheets and experimental results is available in The ACM Digital Library. REFERENCES BALMIN, A., PAPADIMITRIOU, T., AND PAPAKONSTANTINOU, Y. 2000. Hypothetical queries in an OLAP environment. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 220–231. BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efﬁcient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Atlantic City, NJ). 322–331. BELLO, R. G., ET AL. 1998. Materialized views in oracle. In Proceedings of the 24th International Conference on Very Large Data Bases (New York, NY). 659–664. BLAKELEY, J. A., LARSON, P., AND TOMPA, F. W. 1986. Efﬁciently updating materialized views. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Washington, DC). 61–71. BLATTNER, P. 1999. Microsoft Excel Functions in Practice. Que Publishing, Indianapolis, IN. GRAY, J., BOSWORTH, A., LAYMAN, A., AND PIRAHESH, H. 1996. Data cube: A relational operator generalizing group-by, cross tab and sub-totals. In Proceedings of the International Conference on Data Engineering (New Orleans, LA). 152–159. GUPTA, A., MUMICK, I. S., AND SUBRAHMANIAN, V. S. 1993. Maintaining views incrementally. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Washington, DC). 157–166. GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Boston, MA). 47–57. HOWSON, C. 2002. Business Objects: The Complete Reference. McGraw-Hill/Osborne, New York, NY. LAKSHMANAN, L., PEI, J., AND ZHAO, Y. 2003. QC-trees. Efﬁcient summary structure for semantic OLAP. In Proceedings of the ACM SIGMOD International Conference on Management of Data (San Diego, CA). 64–75. LEVY, A. Y., MUMICK, I. S., AND SAGIV, Y. 1994. Query optimization by predicate move-around. In Proceedings of the 20th International Conference on Very Large Data Bases (Santiago, Chile). 96–107. MUMICK, I. S., FINKELSTEIN, S., PIRAHESH, H., AND RAMAKRISHNAN, R. 1990. Magic is relevant. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Atlantic City, NJ). 247–258. Olap Application Developer’s Guide. 2004. Oracle Database 10g Release 1 (10.1) Documentation. 2004. Oracle, Redwood Shores, CA. PETERSON, T. AND PINKELMAN, J. 2000. Microsoft OLAP Unleashed. SAMS Publishing, Indianapolis, IN. SIMON, J. 2000. Excel 2000 in a Nutshell. O’Reilly & Associates, Sebastopol, CA. SISMANIS, Y., ROUSSOPOULOS, N., DELIGIANNAKIS, A., AND KOTIDIS Y. 2002. Dwarf: Shrinking the petacube. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Madison, WI). 464–475. SRIVASTAVA, D. AND RAMAKRISHNAN, R. 1992. Pushing constraint selections. In Proceedings of the Eleventh Symposium on Principles of Database Systems (PODS) (San Diego, CA). 301–315. TARJAN, R. 1972. Depth-ﬁrst search and linear graph algorithms. SIAM J. Comput. 1, 2, 146–160. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Advanced SQL Modeling in RDBMS • 121 THOMSEN, E., SPOFFORD, G., CHASE, D. 1999. Microsoft OLAP Solutions. John Willey & Sons, New York, NY. WITKOWSKI, A., BELLAMKONDA, B., BOZKAYA, T., DORMAN, G., FOLKERT, N., GUPTA, A., SHENG, L., AND SUBRAMANIAN, S. 2003. Spreadsheets in RDBMS for OLAP. In Proceedings of the ACM SIGMOD International Conference on Management of Data (San Diego, CA). 52–63. ZEMKE, F., KULKARNI, K., WITKOWSKI, A., AND LYLE, B. 1999. Introduction to OLAP function. Change proposal. ANS-NCTS H2-99-14 (April). Received November 2003; revised May 2004; accepted August 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. TinyDB: An Acquisitional Query Processing System for Sensor Networks SAMUEL R. MADDEN Massachusetts Institute of Technology MICHAEL J. FRANKLIN and JOSEPH M. HELLERSTEIN University of California, Berkeley and WEI HONG Intel Research We discuss the design of an acquisitional query processor for data collection in sensor networks. Ac- quisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to signiﬁcantly reduce power consumption over traditional passive sys- tems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues inﬂuence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide signiﬁcant reductions in power consumption on our sensor devices. Categories and Subject Descriptors: H.2.3 [Database Management]: Languages—Query lan- guages; H.2.4 [Database Management]: Systems—Distributed databases; query processing General Terms: Experimentation, Performance Additional Key Words and Phrases: Query processing, sensor networks, data acquisition 1. INTRODUCTION In the past few years, smart sensor devices have matured to the point that it is now feasible to deploy large, distributed networks of such devices [Pottie and Kaiser 2000; Hill et al. 2000; Mainwaring et al. 2002; Cerpa et al. 2001]. Sensor networks are differentiated from other wireless, battery-powered environments in that they consist of tens or hundreds of autonomous nodes that operate without human interaction (e.g., conﬁguration of network routes, recharging Authors’ addresses: S. R. Madden, Computer Science and Artiﬁcial Intelligence Lab, Massachusetts Institute of Technology, Room 32-G938, 32 Vassar Street, Cambridge, MA 02139; email: maddn@ csail.mit.edu; M. J. Franklin and J. M. Hellerstein, Soda Hall, University of California, Berkeley, Berkeley, CA 94720; email: {franklin,jmh}@cs.berkeley.edu; W. Hong, Intel Research, 2150 Shattuck Avenue, Penthouse Suite, Berkeley, CA 94704; email: wei.hong@intel.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0122 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 122–173. Acquisitional Query Processing In Sensor Networks • 123 of batteries, or tuning of parameters) for weeks or months at a time. Further- more, sensor networks are often embedded into some (possibly remote) physical environment from which they must monitor and collect data. The long-term, low-power nature of sensor networks, coupled with their proximity to physical phenomena, leads to a signiﬁcantly altered view of software systems compared to more traditional mobile or distributed environments. In this article, we are concerned with query processing in sensor networks. Researchers have noted the beneﬁts of a query processor-like interface to sensor networks and the need for sensitivity to limited power and computational re- sources [Intanagonwiwat et al. 2000; Madden and Franklin 2002; Bonnet et al. 2001; Yao and Gehrke 2002; Madden et al. 2002a]. Prior systems, however, tend to view query processing in sensor networks simply as a power-constrained ver- sion of traditional query processing: given some set of data, they strive to process that data as energy-efﬁciently as possible. Typical strategies include minimiz- ing expensive communication by applying aggregation and ﬁltering operations inside the sensor network—strategies that are similar to push-down techniques from distributed query processing that emphasize moving queries to data. In contrast, we advocate acquisitional query processing (ACQP), where we focus not only on traditional techniques but also on the signiﬁcant new query processing opportunity that arises in sensor networks: the fact that smart sen- sors have control over where, when, and how often data is physically acquired (i.e., sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to signiﬁcantly reduce power consumption compared to traditional passive systems that assume the a pri- ori existence of data. Acquisitional issues arise at all levels of query process- ing: in query optimization, due to the signiﬁcant costs of sampling sensors; in query dissemination, due to the physical colocation of sampling and process- ing; and, most importantly, in query execution, where choices of when to sample and which samples to process are made. We will see how techniques proposed in other research on sensor and power-constrained query processing, such as pushing down predicates and minimizing communication, are also important alongside ACQP and ﬁt comfortably within its model. We have designed and implemented a query processor for sensor networks that incorporates acquisitional techniques called TinyDB (for more informa- tion on TinyDB, see the TinyDB Home Page [Madden et al. 2003]). TinyDB is a distributed query processor that runs on each of the nodes in a sensor network. TinyDB runs on the Berkeley mote platform, on top of the TinyOS [Hill et al. 2000] operating system. We chose this platform because the hard- ware is readily available from commercial sources1 and the operating system is relatively mature. TinyDB has many of the features of a traditional query processor (e.g., the ability to select, join, project, and aggregate data), but, as we will discuss in this article, also incorporates a number of other features designed to minimize power consumption via acquisitional techniques. These techniques, taken in aggregate, can lead to orders of magnitude improvements 1 Crossbow,Inc. Wireless sensor networks (Mica Motes). Go online to http://www.xbow.com/ Products/Wireless_Sensor_Networks.htm. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 124 • S. R. Madden et al. Fig. 1. A query and results propagating through the network. in power consumption and increased accuracy of query results over nonacqui- sitional systems that do not actively control when and where data is collected. We address a number of questions related to query processing on sensor networks, focusing in particular on ACQP issues such as the following: (1) When should samples for a particular query be taken? (2) What sensor nodes have data relevant to a particular query? (3) In what order should samples for this query be taken, and how should sam- pling be interleaved with other operations? (4) Is it worth expending computational power or bandwidth to process and relay a particular sample? Of these issues, question (1) is uniquely acquisitional. We show how the re- maining questions can be answered by adapting techniques that are similar to those found in traditional query processing. Notions of indexing and optimiza- tion, in particular, can be applied to answer questions (2) and (3), and question (4) bears some similarity to issues that arise in stream processing and approx- imate query answering. We will address each of these questions, noting the unusual kinds of indices, optimizations, and approximations that are required under the speciﬁc constraints posed by sensor networks. Figure 1 illustrates the basic architecture that we follow throughout this article—queries are submitted at a powered PC (the basestation), parsed, op- timized, and sent into the sensor network, where they are disseminated and processed, with results ﬂowing back up the routing tree that was formed as the queries propagated. After a brief introduction to sensor networks in Section 2, the remainder of the article discusses each of these phases of ACQP: Section 3 covers our query language, Section 4 highlights optimization issues in power- sensitive environments, Section 5 discusses query dissemination, and, ﬁnally, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 125 Section 6 discusses our adaptive, power-sensitive model for query execution and result collection. 2. SENSOR NETWORK OVERVIEW We begin with an overview of some recent sensor network deployments, and then discuss properties of sensor nodes and sensor networks in general, provid- ing speciﬁc numbers from our experience with TinyOS motes when possible. A number of recent deployments of sensors have been undertaken by the sensor network research community for environmental monitoring purposes: on Great Duck Island [Mainwaring et al. 2002], off the coast of Maine, at James Reserve [Cerpa et al. 2001], in Southern California, at a vineyard in British Columbia [Brooke and Burrell 2003], and in the Coastal Redwood Forests of California [Madden 2003]. In these scenarios, motes collect light, temperature, humidity, and other environmental properties. On Great Duck Island, during the Summer of 2003, about 200 motes were placed in and around the burrows of Storm Petrels, a kind of endangered sea bird. Scientists used them to monitor burrow occupancy and the conditions surrounding burrows that are correlated with birds coming or going. Other notable deployments that are underway in- clude a network for earthquake monitoring [UC Berkeley 2001] and a network for building infrastructure monitoring and control [Lin et al. 2002].2 Each of these scenarios involves a large number of devices that need to last as long as possible with little or no human intervention. Placing new devices, or replacing or recharging batteries of devices in bird nests, earthquake test sites, and heating and cooling ducts is time consuming and expensive. Aside from the obvious advantages that a simple, declarative language provides over hand-coded, embedded C, researchers are particularly interested in TinyDB’s ability to acquire and deliver desired data while conserving as much power as possible and satisfying desired lifetime goals. We have deployed TinyDB in the redwood monitoring project [Madden 2003] described above, and are in the process of deploying it in Intel fabrication plants to collect vibration signals that can be used for early detection of equipment failures. Early deployments have been quite successful, producing months of lifetime from tiny batteries with about one-half the capacity of a single AA cell. 2.1 Properties of Sensor Devices A sensor node is a battery-powered, wireless computer. Typically, these nodes are physically small (a few cubic centimeters) and extremely low power (a few tens of milliwatts versus tens of watts for a typical laptop computer).3 Power is of utmost importance. If used naively, individual nodes will deplete their energy 2 Even in indoor infrastructure monitoring settings, there is great interest in battery powered devices, as running power wire can cost many dollars per device. 3 Recall that 1 W (a unit of power) corresponds to power consumption of 1 J (a unit of energy) per second. We sometimes refer to the current load of a device, because current is easy to measure directly; note that power (in watts) = current (in amps) * voltage (in volts), and that motes run at 3 V. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 126 • S. R. Madden et al. Fig. 2. Annotated motes. supplies in only a few days.4 In contrast, if sensor nodes are very spartan about power consumption, months or years of lifetime are possible. Mica motes, for example, when operating at 2% duty cycle (between active and sleep modes) can achieve lifetimes in the 6-month range on a pair of AA batteries. This duty cycle limits the active time to 1.2 s/min. There have been several generations of motes produced. Older, Mica motes have a 4-MHz, 8-bit Atmel microprocessor. Their RFM TR10005 radios run at 40 kbits/s over a single shared CSMA/CA (carrier-sense multiple-access, colli- sion avoidance) channel. Newer Mica2 nodes use a 7 MHz processor and a radio from ChipCon6 which runs at 38.4 kbits/s. Radio messages are variable size. Typically about twenty 50-byte messages (the default size in TinyDB) can be delivered per second. Like all wireless radios (but unlike a shared EtherNet, which uses the collision detection (CD) variant of CSMA), both the RFM and ChipCon radios are half-duplex, which means that they cannot detect collisions because they cannot listen to their own trafﬁc. Instead, they try to avoid col- lisions by listening to the channel before transmitting and backing off for a random time period when it is in use. A third mote, called the Mica2Dot, has similar hardware as the Mica2 mote, but uses a slower, 4-MHz, processor. A picture of a Mica and Mica2Dot mote are shown in Figure 2. Mica motes are visually very similar to Mica2 motes and are exactly the same form factor. Motes have an external 32-kHz clock that the TinyOS operating system can synchronize with neighboring motes to approximately +/− 1 ms. Time syn- chronization is important in a variety of contexts, for example: to ensure that readings can be correlated, to schedule communication, or to coordinate the waking and sleeping of devices. 4 Atfull power, a Berkeley Mica mote (see Figure 2) draws about 15 mA of current. A pair of AA batteries provides approximately 2200 mAh of energy. Thus, the lifetime of a Mica2 mote will be approximately 2200/15 = 146 h, or 6 days. 5 RFM Corporation. RFM TR1000 data sheet. Go online to http://www.rfm.com/products/data tr1000.pdf. 6 ChipCon Corporation. CC1000 single chip very low power RF transceiver data sheet. Go online to http://www.chipcon.com. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 127 Fig. 3. Phases of power consumption in TinyDB. 2.1.1 Power Consumption in Sensor Networks. Power consumption in sen- sor nodes can be roughly decomposed into phases, which we illustrate in Figure 3 via an annotated capture of an oscilloscope display showing current draw (which is proportional to power consumption) on a Mica mote running TinyDB. In “Snoozing” mode, where the node spends most of its time, the pro- cessor and radio are idle, waiting for a timer to expire or external event to wake the device. When the device wakes it enters the “Processing” mode, which consumes an order of magnitude more power than snooze mode, and where query results are generated locally. The mote then switches to a “Processing and Receiving” mode, where results are collected from neighbors over the ra- dio. Finally, in the “Transmitting” mode, results for the query are delivered by the local mote—the noisy signal during this period reﬂects switching as the receiver goes off and the transmitter comes on and then cycles back to a receiver-on, transmitter-off state. Theses oscilloscope measurements do not distinguish how power is used dur- ing the active phase of processing. To explore this breakdown, we conducted an analytical study of the power utilization of major elements of sensor network query processing; the results of this study are given in Appendix A. In short, we found that in a typical data collection scenario, with relatively power-hungry sensing hardware, about 41% of energy goes to communicating or running the CPU while communicating, with another 58% going to the sensors or to the CPU while sensing. The remaining 1% goes to idle-time energy consumption. 2.2 TinyOS TinyOS consists of a set of components for managing and accessing the mote hardware, and a “C-like” programming language called nesC. TinyOS has been ported to a variety of hardware platforms, including UC Berkeley’s Rene, Dot, Mica, Mica2, and Mica2Dot motes, the Blue Mote from Dust Inc.,7 and the MIT Cricket [Priyantha et al. 2000]. 7 Dust Inc. Go online to the company’s Web site. http://www.dust-inc.com. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 128 • S. R. Madden et al. The major features of TinyOS are the following: (1) a suite of software designed to simplify access to the lowest levels of hard- ware in an energy-efﬁcient and contention-free way, and (2) a programming model and the nesC language designed to promote exten- sibility and composition of software while maintaining a high degree of concurrency and energy efﬁciency; interested readers should refer to Gay et al. [2003]. It is interesting to note that TinyOS does not provide the traditional op- erating system features of process isolation or scheduling (there is only one application running at time), and does not have a kernel, protection domains, memory manager, or multithreading. Indeed, in many ways, TinyOS is simply a library that provides a number of convenient software abstractions, including components to modulate packets over the radio, read sensor values for different sensor hardware, synchronize clocks between a sender and receiver, and put the hardware into a low-power state. Thus, TinyOS and nesC provide a useful set of abstractions on top of the bare hardware. Unfortunately, they do not make it particularly easy to author soft- ware for the kinds of data collection applications considered in the beginning of Section 2. For example, the initial deployment of the Great Duck Island soft- ware, where the only behavior was to periodically broadcast readings from the same set of sensors over a single radio hop, consisted of more than 1000 lines of embedded C code, excluding any of the custom software components written to integrate the new kinds of sensing hardware used in the deployment. Fea- tures such as reconﬁgurability, in-network processing, and multihop routing, which are needed for long-term, energy-efﬁcient deployments, would require thousands of lines of additional code. Sensor networks will never be widely adopted if every application requires this level of engineering effort. The declarative model we advocate reduces these applications to a few short statements in a simple language; the acquisitional techniques discussed allow these queries to be executed efﬁciently.8 2.3 Communication in Sensor Networks Typical communication distances for low power wireless radios such as those used in motes and Bluetooth devices range from a few feet to around 100 ft, depending on transmission power and environmental conditions. Such short ranges mean that almost all real deployments must make use of multihop com- munication, where intermediate nodes relay information for their peers. On Mica motes, all communication is broadcast. The operating system provides a software ﬁlter so that messages can be addressed to a particular node, though if neighbors are awake, they can still snoop on such messages (at no additional en- ergy cost since they have already transferred the decoded message from the air). 8 The implementation of TinyDB consists of about 20,000 lines of C code, approximately 10,000 of which are for the low-level drivers to acquire and condition readings from sensors—none of which is the end-user is expected to have to modify or even look at. Compiled, this uses 58K of the 128K of available code space on current generations Motes. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 129 Nodes receive per-message, link-level acknowledgments indicating whether a message was received by the intended neighbor node. No end-to-end acknowl- edgments are provided. The requirement that sensor networks be low maintenance and easy to de- ploy means that communication topologies must be automatically discovered (i.e., ad hoc) by the devices rather than ﬁxed at the time of network deployment. Typically, devices keep a short list of neighbors who they have heard transmit recently, as well as some routing information about the connectivity of those neighbors to the rest of the network. To assist in making intelligent routing decisions, nodes associate a link quality with each of their neighbors. We describe the process of disseminating queries and collecting results in Section 5 below. As a basic primitive in these protocols, we use a routing tree that allows a basestation at the root of the network to disseminate a query and collect query results. This routing tree is formed by forwarding a routing request (a query in TinyDB) from every node in the network: the root sends a request, all child nodes that hear this request process it and forward it on to their children, and so on, until the entire network has heard the request. Each request contains a hop-count, or level indicating the distance from the broadcaster to the root. To determine their own level, nodes pick a parent node that is (by deﬁnition) one level closer to the root than they are. This parent will be responsible for forwarding the node’s (and its children’s) query results to the basestation. We note that it is possible to have several routing trees if nodes keep track of multiple parents. This can be used to support several simultaneous queries with different roots. This type of communication topology is common within the sensor network community [Woo and Culler 2001]. 3. ACQUISITIONAL QUERY LANGUAGE In this section, we introduce our query language for ACQP focusing on issues related to when and how often samples are acquired. Appendix B gives a com- plete syntactic speciﬁcation of the language; here, we rely primarily on example queries to illustrate the different language features. 3.1 Data Model In TinyDB, sensor tuples belong to a table sensors which, logically, has one row per node per instant in time, with one column per attribute (e.g., light, temper- ature, etc.) that the device can produce. In the spirit of acquisitional processing, records in this table are materialized (i.e., acquired) only as needed to satisfy the query, and are usually stored only for a short period of time or delivered directly out of the network. Projections and/or transformations of tuples form the sensors table may be stored in materialization points (discussed below). Although we impose the same schema on the data produced by every device in the network, we allow for the possibility of certain devices lacking certain physical sensors by allowing nodes to insert NULLs for attributes correspond- ing to missing sensors. Thus, devices missing sensors requested in a query will produce data for that query anyway, unless NULLs are explicitly ﬁltered out in the WHERE clause. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 130 • S. R. Madden et al. Physically, the sensors table is partitioned across all of the devices in the network, with each device producing and storing its own readings. Thus, in TinyDB, to compare readings from different sensors, those readings must be collected at some common node, for example, the root of the network. 3.2 Basic Language Features Queries in TinyDB, as in SQL, consist of a SELECT-FROM-WHERE-GROUPBY clause supporting selection, join, projection, and aggregation. The semantics of SELECT, FROM, WHERE, and GROUP BY clauses are as in SQL. The FROM clause may refer to both the sensors table as well as stored tables, which we call materialization points. Materialization points are created through special logging queries, which we describe below. They provide basic support for subqueries and windowed stream operations. Tuples are produced at well-deﬁned sample intervals that are a parame- ter of the query. The period of time between the start of each sample pe- riod is known as an epoch. Epochs provide a convenient mechanism for struc- turing computation to minimize power consumption. Consider the following query: SELECT nodeid, light, temp FROM sensors SAMPLE PERIOD 1s FOR 10s This query speciﬁes that each device should report its own identiﬁer (id), light, and temperature readings (contained in the virtual table sensors) once per second for 10 s. Results of this query stream to the root of the network in an online fashion, via the multihop topology, where they may be logged or output to the user. The output consists of a stream of tuples, clustered into 1-s time intervals. Each tuple includes a time stamp corresponding to the time it was produced. Nodes initiate data collection at the beginning of each epoch, as speciﬁed in the SAMPLE PERIOD clause. Nodes in TinyDB run a simple time synchronization protocol to agree on a global time base that allows them to start and end each epoch at the same time.9 When a query is issued in TinyDB, it is assigned an id that is returned to the issuer. This identiﬁer can be used to explicitly stop a query via a “STOP QUERY id” command. Alternatively, queries can be limited to run for a speciﬁc time period via a FOR clause (shown above,) or can include a stopping condition as an event (see below.) Note that because the sensors table is an unbounded, continuous data stream of values, certain blocking operations (such as sort and symmetric join) are not allowed over such streams unless a bounded subset of the stream, or window, is speciﬁed. Windows in TinyDB are deﬁned via materialization points over the sensor streams. Such materialization points accumulate a small buffer 9 Weuse a time-synchronization protocol that is quite similar to the one described in work by Ganeriwal et al. [2003]; typical time-synchronization error in TinyDB is about 10 ms. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 131 of data that may be used in other queries. Consider, as an example: CREATE STORAGE POINT recentlight SIZE 8 AS (SELECT nodeid, light FROM sensors SAMPLE PERIOD 10s) This statement provides a local (i.e., single-node) location to store a stream- ing view of recent data similar to materialization points in other stream- ing systems like Aurora, TelegraphCQ, or STREAM [Carney et al. 2002; Chandrasekaran et al. 2003; Motwani et al. 2003], or materialized views in conventional databases. Multiple queries may read a materialization point. Joins are allowed between two storage points on the same node, or between a storage point and the sensors relation, in which case sensors is used as the outer relation in a nested-loops join. That is, when a sensors tuple arrives, it is joined with tuples in the storage point at its time of arrival. This is effectively a landmark query [Gehrke et al. 2001] common in streaming systems. Consider, as an example: SELECT COUNT(*) FROM sensors AS s, recentLight AS rl WHERE rl.nodeid = s.nodeid AND s.light < rl.light SAMPLE PERIOD 10s This query outputs a stream of counts indicating the number of recent light readings (from zero to eight samples in the past) that were brighter than the current reading. In the event that a storage point and an outer query deliver data at different rates, a simple rate matching construct is provided that al- lows interpolation between successive samples (if the outer query is faster), via the LINEAR INTERPOLATE clause shown in Appendix B. Alternatively, if the inner query is faster, the user may specify an aggregation function to combine multiple rows via the COMBINE clause shown in Appendix B. 3.3 Aggregation Queries TinyDB also includes support for grouped aggregation queries. Aggregation has the attractive property that it reduces the quantity of data that must be transmitted through the network; other sensor network research has noted that aggregation is perhaps the most common operation in the domain [Intanagonwiwat et al. 2000; Yao and Gehrke 2002]. TinyDB includes a mech- anism for user-deﬁned aggregates and a metadata management system that supports optimizations over them, which we discuss in Section 4.1. The basic approach of aggregate query processing in TinyDB is as follows: as data from an aggregation query ﬂows up the tree, it is aggregated in-network according to the aggregation function and value-based partitioning speciﬁed in the query. 3.3.1 Aggregate Query Syntax and Semantics. Consider a user who wishes to monitor the occupancy of the conference rooms on a particular ﬂoor of a build- ing. She chooses to do this by using microphone sensors attached to motes, and ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 132 • S. R. Madden et al. looking for rooms where the average volume is over some threshold (assuming that rooms can have multiple sensors). Her query could be expressed as: SELECT AVG(volume),room FROM sensors WHERE floor = 6 GROUP BY room HAVING AVG(volume) > threshold SAMPLE PERIOD 30s This query partitions motes on the sixth ﬂoor according to the room where they are located (which may be a hard-coded constant in each device, or may be determined via some localization component available to the devices.) The query then reports all rooms where the average volume is over a speciﬁed threshold. Updates are delivered every 30 s. The query runs until the user deregisters it from the system. As in our earlier discussion of TinyDB’s query language, except for the SAMPLE PERIOD clause, the semantics of this statement are similar to SQL aggregate queries. Recall that the primary semantic difference between TinyDB queries and SQL queries is that the output of a TinyDB query is a stream of values, rather than a single aggregate value (or batched result). For these streaming queries, each aggregate record consists of one <group id,aggregate value> pair per group. Each group is time-stamped with an epoch number and the readings used to compute an aggregate record all belong to the same the same epoch. 3.3.2 Structure of Aggregates. TinyDB structures aggregates similarly to shared-nothing parallel database engines (e.g., Bancilhon et al. [1987]; Dewitt et al. [1990]; Shatdal and Naughton [1995]). The approach used in such systems (and followed in TinyDB) is to implement agg via three functions: a merging function f , an initializer i, and an evaluator, e. In general, f has the following structure: < z > = f (< x >, < y >), where < x > and < y > are multivalued partial state records, computed over one or more sensor values, representing the intermediate state over those val- ues that will be required to compute an aggregate. < z > is the partial-state record resulting from the application of function f to < x > and < y >. For example, if f is the merging function for AVERAGE, each partial state record will consist of a pair of values: SUM and COUNT, and f is speciﬁed as follows, given two state records < S1 , C1 > and < S2 , C2 >: f (< S1 , C1 >, < S2 , C2 >) = < S1 + S2 , C1 + C2 > . The initializer i is needed to specify how to instantiate a state record for a single sensor value; for an AVERAGE over a sensor value of x, the initializer i(x) returns the tuple < x, 1 >. Finally, the evaluator e takes a partial state record and computes the actual value of the aggregate. For AVERAGE, the evaluator e(< S, C >) simply returns S/C. These three functions can easily be derived for the basic SQL aggregates; in general, the only constraint is that the merging function be commutative and associative. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 133 TinyDB includes a simple facility for allowing programmers to extend the system with new aggregates by authoring software modules that implement these three functions. 3.4 Temporal Aggregates In addition to aggregates over values produced during the same sample interval (for an example, as in the COUNT query above), users want to be able to perform temporal operations. For example, in a building monitoring system for confer- ence rooms, users may detect occupancy by measuring maximum sound volume over time and reporting that volume periodically; for example, the query SELECT WINAVG(volume, 30s, 5s) FROM sensors SAMPLE PERIOD 1s will report the average volume over the last 30 s once every 5 s, sampling once per second. This is an example of a sliding-window query common in many streaming systems [Motwani et al. 2003; Chandrasekaran et al. 2003; Gehrke et al. 2001]. We note that the same semantics are available by running an aggregate query with SAMPLE PERIOD 5 s over a 30-s materialization point; temporal aggregates simply provide a more concise way of expressing these common operations. 3.5 Event-Based Queries As a variation on the continuous, polling based mechanisms for data acquisition, TinyDB supports events as a mechanism for initiating data collection. Events in TinyDB are generated explicitly, either by another query or by a lower-level part of the operating system (in which case the code that generates the event must have been compiled into the sensor node10 ). For example, the query: ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE PERIOD 2 s FOR 30 s could be used to report the average light and temperature level at sensors near a bird nest where a bird has just been detected. Every time a bird-detect event occurs, the query is issued from the detecting node and the average light and temperature are collected from nearby nodes once every 2 s for 30 s. In this case, we expect that bird-detection is done via some low-level operating system facility—e.g., a switch that is triggered when a bird enters its nest. Such events are central in ACQP, as they allow the system to be dormant until some external conditions occurs, instead of continually polling or blocking 10 TinyDB provides a special API for generating events; it is described in the TinyOS/TinyDB distribution as a part of the TinySchema package. As far as TinyDB is concerned, this API allows TinyDB to treat OS-deﬁned events as black-boxes that occur at any time; for example, events may periodically sample sensors using low-level OS APIs (instead of TinyDB) to determine if some condition is true. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 134 • S. R. Madden et al. Fig. 4. External interrupt driven event-based query (top) versus polling driven event-based query (bottom). on an iterator waiting for some data to arrive. Since most microprocessors in- clude external interrupt lines than can wake a sleeping device to begin process- ing, events can provide signiﬁcant reductions in power consumption, shown in Figure 4. This ﬁgure shows an oscilloscope plot of current draw from a device running an event-based query triggered by toggling a switch connected to an external interrupt line that causes the device to wake from sleep. Compare this to plot at the bottom of Figure 4, which shows an event-based query triggered by a second query that polls for some condition to be true. Obviously, the situation in the top plot is vastly preferable, as much less energy is spent polling. TinyDB supports such externally triggered queries via events, and such support is integral to its ability to provide low power processing. Events can also serve as stopping conditions for queries. Appending a clause of the form STOP ON EVENT(param) WHERE cond(param) will stop a continuous query when the speciﬁed event arrives and the condition holds. Besides the low-level API which can be used to allow software compo- nents to signal events (such as the bird-detect event above), queries may also signal events. For example, suppose we wanted to signal an event when- ever the temperature went above some threshold; we can write the following query: SELECT nodeid,temp WHERE temp > thresh OUTPUT ACTION SIGNAL hot(nodeid,temp) SAMPLE PERIOD 10s Clearly, we lose the power-saving advantages of having an event ﬁred di- rectly in response to a low-level interrupt, but we still retain the programmatic advantages of linking queries to the signaling of events. We describe the OUTPUT ACTION clause in more detail in Section 3.7 below. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 135 In the current implementation of TinyDB, events are only signaled on the local node—we do not currently provide a fully distributed event propagation system. Note, however, that queries started in response to a local event may be disseminated to other nodes (as in the example above). 3.6 Lifetime-Based Queries In lieu of an explicit SAMPLE PERIOD clause, users may request a speciﬁc query lifetime via a QUERY LIFETIME <x> clause, where <x> is a duration in days, weeks, or months. Specifying lifetime is a much more intuitive way for users to reason about power consumption. Especially in environmental monitoring sce- narios, scientiﬁc users are not particularly concerned with small adjustments to the sample rate, nor do they understand how such adjustments inﬂuence power consumption. Such users, however, are very concerned with the lifetime of the network executing the queries. Consider the query SELECT nodeid, accel FROM sensors LIFETIME 30 days This query speciﬁes that the network should run for at least 30 days, sampling light and acceleration sensors at a rate that is as quick as possible and still satisﬁes this goal. To satisfy a lifetime clause, TinyDB performs lifetime estimation. The goal of lifetime estimation is to compute a sampling and transmission rate given a number of Joules of energy remaining. We begin by considering how a single node at the root of the sensor network can compute these rates, and then discuss how other nodes coordinate with the root to compute their delivery rates. For now, we also assume that sampling and delivery rates are the same. On a single node, these rates can be computed via a simple cost-based formula, taking into account the costs of accessing sensors, selectivities of operators, expected communication rates and current battery voltage.11 We show below a lifetime computation for simple queries of the form: SELECT a1 , ... , anumSensors FROM sensors WHERE p LIFETIME l hours To simplify the equations in this example, we present a query with a single selection predicate that is applied after attributes have been acquired. The ordering of multiple predicates and interleaving of sampling and selection are discussed in detail in Section 4. Table I shows the parameters we use in this computation (we do not show processor costs since they will be negligible for the simple selection predicates we support, and have been subsumed into costs of sampling and delivering results). The ﬁrst step is to determine the available power ph per hour, ph = crem / l . 11 Throughout this section, we will use battery voltage as a proxy for remaining battery capacity, as voltage is an easy quantity to measure. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 136 • S. R. Madden et al. Table I. Parameters Used in Lifetime Estimation Parameter Description Units l Query lifetime goal hours crem Remaining battery capacity Joules En Energy to sample sensor n Joules Etrans Energy to transmit a single sample Joules Ercv Energy to receive a message Joules σ Selectivity of selection predicate C # of children routing through node Fig. 5. Predicted versus actual lifetime for a requested lifetime of 24 weeks (168 days). We then need to compute the energy to collect and transmit one sample, es , including the costs to forward data for its children: numSensors es = Es + (Ercv + Etrans ) × C + Etrans × σ. s=0 The energy for a sample is the cost to read all of the sensors at the node, plus the cost to receive results from children, plus the cost to transmit satisfying local and child results. Finally, we can compute the maximum transmission rate, T (in samples per hour), as T = ph /es . To illustrate the effectiveness of this simple estimation, we inserted a lifetime-based query (SELECT voltage, light FROM sensors LIFETIME x) into a sensor (with a fresh pair of AA batteries) and asked it to run for 24 weeks, which resulted in a sample rate of 15.2 s per sample. We measured the voltage on the device nine times over 12 days. The ﬁrst two readings were outside the range of the voltage detector on the mote (e.g., they read “1024”—the maximum value) so are not shown. Based on experiments with our test mote connected to a power supply, we expect it to stop functioning when its voltage reaches 350. Figure 5 shows the measured lifetime at each point in time, with a linear ﬁt of the data, versus the “expected voltage” which was computed using the cost ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 137 model above. The resulting linear ﬁt of voltage is quite close to the expected voltage. The linear ﬁt reaches V = 350 about 5 days after the expected voltage line. Given that it is possible to estimate lifetime on a single node, we now dis- cuss coordinating the transmission rate across all nodes in the routing tree. Since sensors need to sleep between relaying of samples, it is important that senders and receivers synchronize their wake cycles. To do this, we allow nodes to transmit only when their parents in the routing tree are awake and listen- ing (which is usually the same time they are transmitting.) By transitivity, this limits the maximum rate of the entire network to the transmission rate of the root of the routing tree. If a node must transmit slower than the root to meet the lifetime clause, it may transmit at an integral divisor of the root’s rate.12 To propagate this rate through the network, each parent node (including the root) includes its transmission rate in queries that it forwards to its children. The previous analysis left the user with no control over the sample rate, which could be a problem because some applications require the ability to mon- itor physical phenomena at a particular granularity. To remedy this, we allow an optional MIN SAMPLE RATE r clause to be supplied. If the computed sample rate for the speciﬁed lifetime is greater than this rate, sampling proceeds at the computed rate (since the alternative is expressible by replacing the LIFETIME clause with a SAMPLE PERIOD clause.) Otherwise, sampling is ﬁxed at a rate of r and the prior computation for transmission rate is done assuming a different rate for sampling and transmission. To provide the requested lifetime and sam- pling rate, the system may not be able to actually transmit all of the readings—it may be forced to combine (aggregate) or discard some samples; we discuss this situation (as well as other contexts where it may arise) in Section 6.3. Finally, we note that since estimation of power consumption was done us- ing simple selectivity estimation as well as cost-constants that can vary from node-to-node (see Section 4.1) and parameters that vary over time (such as number of children, C), we need to periodically reestimate power consumption. Section 6.4.1 discusses this runtime reestimation in more detail. 3.7 Types of Queries in Sensor Networks We conclude this section with a brief overview of some of the other types of queries supported by TinyDB. — Monitoring queries: Queries that request the value of one or more attributes continuously and periodically—for example, reporting the temperature in bird nests every 30 s; these are similar to the queries shown above. — Network health queries: Metaqueries over the network itself. Examples in- clude selecting parents and neighbors in the network topology or nodes with battery life less than some threshold. These queries are particularly impor- tant in sensor networks due to their dynamic and volatile nature. For exam- ple, the following query reports all sensors whose current battery voltage is 12 One possible optimization, which we do not explore, would involve selecting or reassigning the root to maximize transmission rate. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 138 • S. R. Madden et al. less than k: SELECT nodeid,voltage WHERE voltage < k FROM sensors SAMPLE PERIOD 10 minutes — Exploratory queries: One-shot queries examining the status of a particular node or set of nodes at a point in time. In lieu of the SAMPLE PERIOD clause, users may specify the keyword ONCE. For example: SELECT light,temp,volume WHERE nodeid = 5 FROM sensors ONCE — Nested queries: Both events and materialization points provide a form of nested queries. The TinyDB language does not currently support SQL-style nested queries, because the semantics of such queries are somewhat ill- deﬁned in a streaming environment: it is not clear when should the outer query be evaluated given that the inner query may be a streaming query that continuously accumulates results. Queries over materialization points allow users to choose when the query is evaluated. Using the FOR clause, users can build a materialization point that contains a single buffer’s worth of data, and can then run a query over that buffer, emulating the same ef- fect as a nested query over a static inner relation. Of course, this approach eliminates the possibility of query rewrite based optimizations for nested queries [Pirahesh et al. 1992], potentially limiting query performance. — Actuation queries: Users want to able to take some physical action in response to a query. We include a special OUTPUT ACTION clause for this purpose. For example, users in building monitoring scenarios might want to turn on a fan in response to temperature rising above some level: SELECT nodeid,temp FROM sensors WHERE temp > threshold OUTPUT ACTION power-on(nodeid) SAMPLE PERIOD 10s The OUTPUT ACTION clause speciﬁes an external command that should be in- voked in response to a tuple satisfying the query. In this case, the power-on command is a low-level piece of code that pulls an output pin on the micro- processor high, closing a relay circuit and giving power to some externally connected device. Note that a separate query could be issued to power-off the fan when the temperature fell below some other threshold. The OUTPUT ACTION suppresses the delivery of messages to the basestation. — Ofﬂine delivery: There are times when users want to log some phenomenon that happens faster than the data can be transmitted over the radio. TinyDB supports the logging of results to EEPROM for ofﬂine, non-real time delivery. This is implemented through the materialization point mechanism described above. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 139 Together, the these query types provide users of TinyDB with the mecha- nisms they need to build data collection applications on top of sensor networks. 4. POWER-BASED QUERY OPTIMIZATION Given our query language for ACQP environments, with special features for event-based processing and lifetime queries, we now turn to query processing issues. We begin with a discussion of optimization, and then cover query dis- semination and execution. We note that, based on the applications deployed so far, single table queries with aggregations seem to be the most pressing workload for sensor networks, and hence we focus primarily in this section on optimizations for acquisition, selection, and aggregation. Queries in TinyDB are parsed at the basestation and disseminated in a sim- ple binary format into the sensor network, where they are instantiated and executed. Before queries are disseminated, the basestation performs a simple query optimization phase to choose the correct ordering of sampling, selections, and joins. We use a simple cost-based optimizer to choose a query plan that will yield the lowest overall power consumption. Optimizing for power allows us to sub- sume issues of processing cost and radio communication, which both contribute to power consumption and so will be taken into account. One of the most inter- esting aspects of power-based optimization, and a key theme of acquisitional query processing, is that the cost of a particular plan is often dominated by the cost of sampling the physical sensors and transmitting query results, rather than the cost of applying individual operators. For this reason, we focus in this section on optimizations that reduce the number and costs of data acquisition. We begin by looking at the types of metadata stored by the optimizer. Our optimizer focuses on ordering joins, selections, and sampling operations that run on individual nodes. 4.1 Metadata Management Each node in TinyDB maintains a catalog of metadata that describes its local attributes, events, and user-deﬁned functions. This metadata is periodically copied to the root of the network for use by the optimizer. Metadata is registered with the system via static linking done at compile time using the TinyOS C-like programming language. Events and attributes pertaining to various operating system and TinyDB components are made available to queries by declaring them in an interface ﬁle and providing a small handler function. For example, in order to expose network topology to the query processor, the TinyOS Network component deﬁnes the attribute parent of type integer and registers a handler that returns the id of the node’s parent in the current routing tree. Event metadata consists of a name, a signature, and a frequency estimate that is used in query optimization (see Section 4.3 below.) User-deﬁned predi- cates also have a name and a signature, along with a selectivity estimate which is provided by the author of the function. Table II summarizes the metadata associated with each attribute, along with a brief description. Attribute metadata is used primarily in two contexts: ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 140 • S. R. Madden et al. Table II. Metadata Fields Kept with Each Attribute Metadata Description Power Cost to sample this attribute (in J) Sample time Time to sample this attribute (in s) Constant? Is this attribute constant-valued (e.g., id)? Rate of change How fast the attribute changes (units/s) Range Dynamic range of attribute values (pair of units) Table III. Summary of Power Requirements of Various Sensors Available for Motes Time per Startup Current Energy per Sensor Sample (ms) Time (ms) (mA) Sample (mJ) Weather board sensors Solar radiation [TAOS, Inc. 2002] 500 800 0.350 .525 Barometric pressure [Intersema 2002] 35 35 0.025 0.003 Humidity [Sensirion 2002] 333 11 .500 0.5 Surface temp. [Melexis, Inc. 2002] 0.333 2 5.6 0.0056 Ambient temp. [Melexis, Inc. 2002] 0.333 2 5.6 0.0056 Standard mica mote sensors Accelerometera 0.9 17 0.6 0.0048 (Passive) Thermistorb 0.9 0 0.033 0.00009 Magnetometerc [Honeywel, Inc.] .9 17 5 .2595 Other sensors Organic byproducts .9 >1000 5 >5 a Analog Devices, Inc. Adxl202e: Low-cost 2 g dual-axis accelerometer. Tech rep. Go online to http://products. analog.com/products/info.asp?product=ADXL202. b Atmel Corporation. Atmel ATMega 128 Microcontroller datasheet. Go online to http://www.atmel.com/atmel/ acrobat/doc2467.pdf. c Honeywell, Inc. Magnetic Sensor Specs HMC1002. Tech. rep. Go online to http://www.ssec.honeywell.com/ magnetic/spec_sheets/specs_1002.html. information about the cost, time to fetch, and range of an attribute is used in query optimization, while information about the semantic properties of at- tributes is used in query dissemination and result processing. Table III gives examples of power and sample time values for some actual sensors—notice that the power consumption and time to sample can differ across sensors by several orders of magnitude. The catalog also contains metadata about TinyDB’s extensible aggregate system. As with other extensible database systems [Stonebraker and Kemnitz 1991], the catalog includes names of aggregates and pointers to their code. Each aggregate consists of a triplet of functions, that initialize, merge, and update the ﬁnal value of partial aggregate records as they ﬂow through the system. As in the TAG [Madden et al. 2002a] article, aggregate authors must provide information about functional properties. In TinyDB, we currently require two: whether the aggregate is monotonic and whether it is exemplary or summary. COUNT is a monotonic aggregate as its value can only get larger as more values are aggregated. MIN is an exemplary aggregate, as it returns a single value from the set of aggregate values, while AVERAGE is a summary aggregate because it computes some property over the entire set of values. TinyDB also stores metadata information about the costs of processing and delivering data, which is used in query-lifetime estimation. The costs of these ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 141 phases in TinyDB were shown in Figure 3—they range from 2 mA while sleep- ing, to over 20 mA while transmitting and processing. Note that actual costs vary from mote to mote—for example, with a small sample of ﬁve motes (using the same batteries), we found that the average current with processor active varied from 13.9 to 17.6 mA (with the average being 15.66 mA). 4.2 Technique 1: Ordering of Sampling And Predicates Having described the metadata maintained by TinyDB, we now describe how it is used in query optimization. As shown in Section 2, sampling is often an expensive operation in terms of power. However, a sample from a sensor s must be taken to evaluate any predicate over the attribute sensors.s. If a predicate discards a tuple of the sensors table, then subsequent predicates need not examine the tuple—and hence the expense of sampling any attributes referenced in those subsequent predicates can be avoided. Thus these predicates are “expensive,” and need to be ordered carefully. The predicate ordering problem here is somewhat different than in the earlier literature (e.g., Hellerstein [1998]) because (a) an attribute may be referenced in multiple predicates, and (b) expensive predicates are only on a single table, sensors. The ﬁrst point introduces some subtlety, as it is not clear which predicate should be “charged” the cost of the sample. To model this issue, we treat the sampling of a sensor t as a separate “job” τ to be scheduled along with the predicates. Hence a set of predicates P = { p1 , . . . , pm } is rewritten as a set of operations S = {s1 , . . . , sn }, where P ⊂ S, and S − P = {τ1 , . . . , τn−m } contains one sampling operator for each distinct attribute referenced in P . The selectivity of sampling operators is al- ways 1. The selectivity of selection operators is derived by assuming that at- tributes have a uniform distribution over their range (which is available in the catalog).13 Relaxing this assumption by, for example, storing histograms or time-dependent functions per attribute remains an area of future work. The cost of an operator (predicate or sample) can be determined by consulting the metadata, as described in the previous section. In the cases we discuss here, selections and joins are essentially “free” compared to sampling, but this is not a requirement of our technique. We also introduce a partial order on S, where τi must precede p j if p j ref- erences the attribute sampled by τi . The combination of sampling operators and the dependency of predicates on samples captures the costs of sampling operators and the sharing of operators across predicates. The partial order induced on S forms a graph with edges from sampling oper- ators to predicates. This is a simple series-parallel graph. An optimal ordering of jobs with series-parallel constraints is a topic treated in the Operations Re- search literature that inspired earlier optimization work [Ibaraki and Kameda 1984; Krishnamurthy et al. 1986; Hellerstein 1998]; Monma and Sidney [1979] 13 Scientistsare particularly interested in monitoring the micro-climates created by plants and their biological processes. See Delin and Jackson [2000] and Cerpa et al. [2001]. An example of such a sensor is Figaro Inc’s H2 S sensor (Figaro, Inc. Tgs-825—special sensor for hydrogen sulﬁde. Tech rep. Go online to www.figarosensor.com). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 142 • S. R. Madden et al. presented the series-parallel algorithm using parallel chains, which gives an optimal ordering of the jobs in O(|S| log |S|) time. Besides predicates in the WHERE clause, expensive sampling operators must also be ordered appropriately with respect to the SELECT, GROUP BY, and HAVING clauses. As with selection predicates, we enhance the partial order such that τi must precede any aggregation, GROUP BY, or HAVING operator that uses i. Note that projections do not require access to the value of i, and thus do not need to be included in the partial order. Thus, the complete partial order is as follows: (1) acquisition of attribute a ≺ any operator that references a, (2) selection ≺ aggregation, GROUP BY, and HAVING, (3) GROUP BY ≺ aggregation and HAVING, (4) aggregation ≺ HAVING. Of course, the last three rules are also present in standard SQL. We also need to add the operators representing these clauses to S with the appropriate costs and selectivities; the process of estimating these values has been well studied in the database query optimization and cost estimation literature. As an example of this process, consider the query SELECT accel,mag FROM sensors WHERE accel > c1 AND mag > c2 SAMPLE PERIOD .1s The order of magnitude difference in per-sample costs shown in Table III for the accelerometer and magnetometer suggests that the power costs of plans for this query having different sampling and selection orders will vary sub- stantially. We consider three possible plans: in the ﬁrst, the magnetometer and accelerometer are sampled before either selection is applied. In the second, the magnetometer is sampled and the selection over its reading (which we call Smag ) is applied before the accelerometer is sampled or ﬁltered. In the third plan, the accelerometer is sampled ﬁrst and its selection (Saccel ) is applied before the magnetometer is sampled. This interleaving of sampling and processing introduces an additional is- sue with temporal semantics: in this case, for example, the magnetometer and accelerometer samples are not acquired at the same time. This may be prob- lematic for some queries, for example, if one is trying to temporally correlate high-frequency portions of signals from these two sensors. To address this con- cern, we include in our language speciﬁcation a NO INTERLEAVE clause, which forces all sensors to be turned on and sampled simultaneously at the beginning of each epoch (obviating the beneﬁt of the acquisitional techniques discussed in this section). We note that this clause may not lead to perfect synchronization of sampling, as different sensors take different amounts of time to power up and acquire readings, but will substantially improve temporal synchronization. Figure 6 shows the relative power costs of the latter two approaches, in terms of power costs to sample the sensors (we assume the CPU cost is the same for the two plans, so do not include it in our cost ratios) for different selectivity ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 143 Fig. 6. Ratio of costs of two acquisitional plans over differing-cost sensors. factors of the two selection predicates Saccel and Smag . The selectivities of these two predicates are shown on the x and y axes, respectively. Regions of the graph are shaded corresponding to the ratio of costs between the plan where the magnetometer is sampled ﬁrst (mag-ﬁrst) versus the plan where the ac- celerometer is sampled ﬁrst (accel-ﬁrst). As expected, these results show that the mag-ﬁrst plan is almost always more expensive than accel-ﬁrst. In fact, it can be an order of magnitude more expensive, when Saccel is much more selec- tive than Smag . When Smag is highly selective, however, it can be cheaper to sample the magnetometer ﬁrst, although only by a small factor. The maximum difference in relative costs represents an absolute difference of 255 µJ per sample, or 2.5 mW at a sample rate of 10 samples per second— putting the additional power consumption from sampling in the incorrect order on par with the power costs of running the radio or CPU for an entire second. 4.2.1 Exemplary Aggregate Pushdown. There are certain kinds of aggre- gate functions where the same kind of interleaving of sampling and processing can also lead to a performance savings. Consider the query SELECT WINMAX(light,8s,8s) FROM sensors WHERE mag > x SAMPLE PERIOD 1s In this query, the maximum of 8 s worth of light readings will be computed, but only light readings from sensors whose magnetometers read greater than x will be considered. Interestingly, it turns out that, unless the mag > x predicate is very selective, it will be cheaper to evaluate this query by checking to see if ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 144 • S. R. Madden et al. each new light reading is greater than the previous reading and then applying the selection predicate over mag, rather than ﬁrst sampling mag. This sort of reordering, which we call exemplary aggregate pushdown can be applied to any exemplary aggregate (e.g., MIN, MAX). Similar ideas have been explored in the deductive database community by Sudarshan and Ramakrishnan [1991]. The same technique can be used with nonwindowed aggregates when per- forming in-network aggregation. Suppose we are applying an exemplary ag- gregate at an intermediate node in the routing tree; if there is an expensive acquisition required to evaluate a predicate (as in the query above), then it may make sense to see if the local value affects the value of the aggregate before acquiring the attribute used in the predicate. To add support for exemplary aggregate pushdown, we need a way to eval- uate the selectivity of exemplary aggregates. In the absence of statistics that reﬂect how a predicate changes over time, we simply assume that the attributes involved in an exemplary aggregate (such as light in the query above) are sam- pled from the same distribution. Thus, for MIN and MAX aggregates, the likelihood that the second of two samples is less than (or greater than) the ﬁrst is 0.5. For n samples, the likelihood that the nth is the value reported by the aggregate is thus 1/.5n−1 . By the same reasoning, for bottom (or top)-k aggregates, assuming k < n, the nth sample will be reported with probability 1/.5n−k−1 . Given this selectivity estimate for an exemplary aggregate, S(a), over at- tribute a with acquisition cost C(a), we can compute the beneﬁt of exemplary aggregate pushdown. We assume the query contains some set of conjunctive predicates with aggregate selectivity P over several expensive acquisitional attributes with aggregate acquisition cost K . We assume the values of S(a), C(a), K , and P are available in the catalog. Then, the cost of evaluating the query without exemplary aggregate pushdown is K + P ∗ C(a) (1) and with pushdown it becomes C(a) + S(a) ∗ K . (2) When (2) is less than (1), there will be an expected beneﬁt to exemplary aggre- gate pushdown, and it should be applied. 4.3 Technique 2: Event Query Batching to Conserve Power As a second example of the beneﬁt of power-aware optimization, we consider the optimization of the query ON EVENT e(nodeid) SELECT a1 FROM sensors AS s WHERE s.nodeid = e.nodeid SAMPLE PERIOD d FOR k This query will cause an instance of the internal query (SELECT ...) to be started every time the event e occurs. The internal query samples results every d seconds for a duration of k seconds, at which point it stops running. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 145 Fig. 7. The cost of processing event-based queries as asynchronous events versus joins. Note that, according to this speciﬁcation of how an ON EVENT query is pro- cessed, it is possible for multiple instances of the internal query to be running at the same time. If enough such queries are running simultaneously, the beneﬁt of event-based queries (e.g., not having to poll for results) will be outweighed by the fact that each instance of the query consumes signiﬁcant energy sampling and delivering (independent) results. To alleviate the burden of running multiple copies of the same identical query, we employ a multiquery optimization tech- nique based on rewriting. To do this, we convert external events (of type e) into a stream of events, and rewrite the entire set of independent internal queries as a sliding window join between events and sensors, with a window size of k seconds on the event stream, and no window on the sensor stream. For example: SELECT s.a1 FROM sensors AS s, events AS e WHERE s.nodeid = e.nodeid AND e.type = e AND s.time - e.time <= k AND s.time > e.time SAMPLE PERIOD d We execute this query by treating it as a join between a materialization point of size k on events and the sensors stream. When an event tuple arrives, it is added to the buffer of events. When a sensor tuple s arrives, events older than k seconds are dropped from the buffer and s is joined with the remaining events. The advantage of this approach is that only one query runs at a time no matter how frequently the events of type e are triggered. This offers a large potential savings in sampling and transmission cost. At ﬁrst it might seem as though requiring the sensors to be sampled every d seconds irrespective of the contents of the event buffer would be prohibitively expensive. However, the check to see if the the event buffer is empty can be pushed before the sampling of the sensors, and can be done relatively quickly. Figure 7 shows the power tradeoff for event-based queries that have and have not been rewritten. Rewritten queries are labeled as stream join and nonrewrit- ten queries as async events. We measure the cost in mW of the two approaches ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 146 • S. R. Madden et al. Table IV. Parameters Used in Asynchronous Events Versus Stream-Join Study Parameter Description Value tsample Length of sample period 1/8 s nevents Number of events per second 0−5 (X axis) durevent Time for which events are active (FOR clause) 1, 3, or 5 s mWproc Processor power consumption 12 mW mssample Time to acquire a sample, including processing and ADC time 0.35 ms mWsample Power used while sampling, including processor 13 mW mJsample Energy per sample Derived mWidle Milliwatts used while idling Derived tidle Time spent idling per sample period (in seconds) Derived mJidle Energy spent idling Derived mscheck Time to check for enqueued event 0.02 ms (80 instrs) mJcheck Energy to check if an event has been enqueued Derived mWevents Total power used in asynchronous event mode Derived mWstream Join Total power used in stream-join mode Derived using a numerical model of power costs for idling, sampling and processing (in- cluding the cost to check if the event queue is nonempty in the event-join case), but excluding transmission costs to avoid complications of modeling differences in cardinalities between the two approaches. The expectation was that the asynchronous approach would generally transmit many more results. We var- ied the sample rate and duration of the inner query, and the frequency of events. We chose the speciﬁc parameters in this plot to demonstrate query optimization tradeoffs; for much faster or slower event rates, one approach tends to always be preferable. In this case, the stream-join rewrite is beneﬁcial as when events occur frequently; this might be the case if, for example, an event is triggered whenever a signal goes above or below a threshold with a signal that is sampled tens or hundreds of times per second; vibration monitoring applications tend to have this kind of behavior. Table IV summarizes the parameters used in this experiment; “derived” values are computed by the model below. Power consumption numbers and sensor timings are drawn from Table III and the Atmel 128 data sheet (see Atmel Corporation be cited in footnotes to Table III). The cost in milliwatts of the asynchronous events approach, mWevents , is mod- eled via the following equations: tidle = tsample − nevents × durevent × mssample /1000, mJidle = mWidle × tidle , mJsample = mWsample × mssample /1000, mWevents = (nevents × durevent × mJsample + mJidle )/tsample . The cost in milliwatts of the stream-join approach, mWstreamJoin , is then tidle = tsample − (mscheck + mssample )/1000, mJidle = mWidle × tidle , mJcheck = mWproc × mscheck /1000, mJsample = mWsample × mssamples /1000, mWstreamJoin = (mJcheck + mJsample + mJidle )/tsample . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 147 For very low event rates (fewer than one per second), the asynchronous events approach is sometimes preferable due to the extra overhead of empty-checks on the event queue in the stream-join case. However, for faster event rates, the power cost of this approach increases rapidly as independent samples are acquired for each event every few seconds. Increasing the duration of the inner query increases the cost of the asynchronous approach as more queries will be running simultaneously. The maximum absolute difference (of about 0.8 mW) is roughly comparable to one-quarter the power cost of the CPU or radio. Finally, we note that there is a subtle semantic change introduced by this rewriting. The initial formulation of the query caused samples in each of the internal queries to be produced relative to the time that the event ﬁred: for example, if event e1 ﬁred at time t, samples would appear at time t + d , t + 2d , . . . . If a later event e2 ﬁred at time t + i, it would produce a different set of samples at time t + i + d , t + i + 2d , . . . . Thus, unless i were equal to d (i.e., the events were in phase), samples for the two queries would be offset from each other by up to d seconds. In the rewritten version of the query, there is only one stream of sensor tuples which is shared by all events. In many cases, users may not care that tuples are out of phase with events. In some situations, however, phase may be very important. In such situa- tions, one way the system could improve the phase accuracy of samples while still rewriting multiple event queries into a single join is via oversampling, or acquiring some number of (additional) samples every d seconds. The in- creased phase accuracy of oversampling comes at an increased cost of ac- quiring additional samples (which may still be less than running multiple queries simultaneously). For now, we simply allow the user to specify that a query must be phase-aligned by specifying ON ALIGNED EVENT in the event clause. Thus, we have shown that there are several interesting optimization issues in ACQP systems; ﬁrst, the system must properly order sampling, selection, and aggregation to be truly low power. Second, for frequent event-based queries, rewriting them as a join between an event stream and the sensors stream can signiﬁcantly reduce the rate at which a sensor must acquire samples. 5. POWER-SENSITIVE DISSEMINATION AND ROUTING After the query has been optimized, it is disseminated into the network; dis- semination begins with a broadcast of the query from the root of the network. As each node hears the query, it must decide if the query applies locally and/or needs to be broadcast to its children in the routing tree. We say a query q ap- plies to a node n if there is a nonzero probability that n will produce results for q. Deciding where a particular query should run is an important ACQP-related decision. Although such decisions occur in other distributed query processing environments, the costs of incorrectly initiating queries in ACQP environments like TinyDB can be unusually high, as we will show. If a query does not apply at a particular node, and the node does not have any children for which the query applies, then the entire subtree rooted at that node can be excluded from the query, saving the costs of disseminating, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 148 • S. R. Madden et al. executing, and forwarding results for the query across several nodes, signiﬁ- cantly extending the node’s lifetime. Given the potential beneﬁts of limiting the scope of queries, the challenge is to determine when a node or its children need not participate in a particu- lar query. One situation arises with constant-valued attributes (e.g., nodeid or location in a ﬁxed-location network) with a selection predicate that indicates the node need not participate. We expect that such queries will be very common, especially in interactive workloads where users are exploring different parts of the network to see how it is behaving. Similarly, if a node knows that none of its children currently satisfy the value of some selection predicate, perhaps, be- cause they have constant (and known) attribute values outside the predicate’s range, it need not forward the query down the routing tree. To maintain infor- mation about child attribute values (both constant and changing), we propose a data structure called a semantic routing tree (SRT). We describe the proper- ties of SRTs in the next section, and brieﬂy outline how they are created and maintained. 5.1 Semantic Routing Trees An SRT is a routing tree (similar to the tree discussed in Section 2.3 above) designed to allow each node to efﬁciently determine if any of the nodes below it will need to participate in a given query over some constant attribute A. Tradi- tionally, in sensor networks, routing tree construction is done by having nodes pick a parent with the most reliable connection to the root (highest link quality). With SRTs, we argue that the choice of parent should include some considera- tion of semantic properties as well. In general, SRTs are most applicable when there are several parents of comparable link quality. A link-quality-based par- ent selection algorithm, such as the one described in Woo and Culler [2001], should be used in conjunction with the SRT to preﬁlter parents made available to the SRT. Conceptually, an SRT is an index over A that can be used to locate nodes that have data relevant to the query. Unlike traditional indices, however, the SRT is an overlay on the network. Each node stores a single unidimensional interval representing the range of A values beneath each of its children. When a query q with a predicate over A arrives at a node n, n checks to see if any child’s value of A overlaps the query range of A in q. If so, it prepares to receive results and forwards the query. If no child overlaps, the query is not forwarded. Also, if the query also applies locally (whether or not it also applies to any children) n begins executing the query itself. If the query does not apply at n or at any of its children, it is simply forgotten. Building an SRT is a two-phase process: ﬁrst the SRT build request is ﬂooded (retransmitted by every mote until all motes have heard the request) down the network. This request includes the name of the attribute A over which the tree should be built. As a request ﬂoods down the network, a node n may have several possible choices of parent, since, in general, many nodes in radio range may be closer to the root. If n has children, it forwards the request on to them and waits until they reply. If n has no children, it chooses a node p ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 149 Fig. 8. A semantic routing tree in use for a query. Gray arrows indicate ﬂow of the query down the tree; gray nodes must produce or forward results in the query. from available parents to be its parent, and then reports the value of A to p in a parent selection message. If n does have children, it records the child’s value of A along with its id. When it has heard from all of its children, it chooses a parent and sends a selection message indicating the range of values of A which it and its descendents cover. The parent records this interval with the id of the child node and proceeds to choose its own parent in the same manner, until the root has heard from all of its children. Because children can fail or move away, nodes also have a timeout which is the maximum time they will wait to hear from a child; after this period is elapsed, the child is removed from the child list. If the child reports after this timeout, it is incorporated into the SRT as if it were a new node (see Section 5.2 below). Figure 8 shows an SRT over the X coordinate of each node on an Cartesian grid. The query arrives at the root, is forwarded down the tree, and then only the gray nodes are required to participate in the query (note that node 3 must forward results for node 4, despite the fact that its own location precludes it from participation). SRTs are analogous to indices in traditional database systems; to create one in TinyDB, the CREATE SRT command can be used—its syntax is essentially similar to the CREATE INDEX command in SQL: CREATE SRT loc ON sensors (xloc,yloc) ROOT 0, where the ROOT annotation indicates the nodeid where the SRT should be rooted from—by default, the value will be 0, but users may wish to create SRTs rooted at other nodes to facilitate event-based queries that frequently radiate from a particular node. 5.2 Maintaining SRTs Even though SRTs are limited to constant attributes, some SRT maintenance must occur. In particular, new nodes can appear, link qualities can change, and existing nodes can fail. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 150 • S. R. Madden et al. Both node appearances and changes in link quality can require a node to switch parents. To do this, the node sends a parent selection message to its new parent, n. If this message changes the range of n’s interval, it notiﬁes its parent; in this way, updates can propagate to the root of the tree. To handle the disappearance of a child node, parents associate an active query id and last epoch with every child in the SRT (recall that an epoch is the period of time between successive samples). When a parent p forwards a query q to a child c, it sets c’s active query id to the id of q and sets its last epoch entry to 0. Every time p forwards or aggregates a result for q from c, it updates c’s last epoch with the epoch on which the result was received. If p does not hear c for some number of epochs t, it assumes c has moved away, and removes its SRT entry. Then, p sends a request asking its remaining children to retransmit their ranges. It uses this information to construct a new interval. If this new interval differs in size from the previous interval, p sends a parent selection message up the routing tree to reﬂect this change. We study the costs of SRT maintenance in Section 5.4 below. Finally, we note that, by using these maintenance rules, it is possible to support SRTs over nonconstant attributes, although if those attributes change quickly, the cost of propagating interval-range changes could be prohibitive. 5.3 Evaluation of Beneﬁt of SRTs The beneﬁt that an SRT provides is dependent on the quality of the clustering of children beneath parents. If the descendents of some node n are clustered around the value of the index attribute at n, then a query that applies to n will likely also apply to its descendents. This can be expected for location attributes, for example, since network topology is correlated with geography. We simulate the beneﬁts of an SRT because large networks of the type where we expect these data structures to be useful are just beginning to come online, so only a small-number of ﬁxed real-world topologies are available. We include in our simulation experiments using a connectivity data ﬁle collected from one such real-world deployment. We evaluate the beneﬁt of SRTs in terms of number of active nodes; inactive nodes incur no cost for a given query, expending energy only to keep their processors in an idle state and to listen to their radios for the arrival of new queries. We study three policies for SRT parent selection. In the ﬁrst, random ap- proach, each node picks a random parent from the nodes with which it can communicate reliably. In the second, closest-parent approach, each parent re- ports the value of its index attribute with the SRT-build request, and children pick the parent whose attribute value is closest to their own. In the clustered approach, nodes select a parent as in the closest-parent approach, except, if a node hears a sibling node send a parent selection message, it snoops on the message to determine its siblings parent and value. It then picks its own par- ent (which could be the same as one of its siblings) to minimize the spread of attribute values underneath all of its available parents. We studied these policies in a simple simulation environment—nodes were arranged on an n × n grid and were asked to choose a constant attribute value ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 151 from some distribution (which we varied between experiments). We used a per- fect (lossless) connectivity model where each node could talk to its immediate neighbors in the grid (so routing trees were n nodes deep), and each node had eight neighbors (with three choices of parent, on average). We compared the total number of nodes involved in range queries of different sizes for the three SRT parent selection policies to the best-case approach and the no SRT ap- proach. The best-case approach would only result if exactly those nodes that overlapped the range predicate were activated, which is not possible in our topologies but provides a convenient lower bound. In the no SRT approach, all nodes participate in each query. We experimented with several of sensor value distributions. In the ran- dom distribution, each constant attribute value was randomly and uniformly selected from the interval [0, 1000]. In the geographic distribution, (one- dimensional) sensor values were computed based on a function of a node’s x and y position in the grid, such that a node’s value tended to be highly corre- lated to the values of its neighbors. Finally, for the real distribution, we used a network topology based on data collected from a network of 54 motes deployed throughout the Intel-Research, Berkeley lab. The SRT was built over the node’s location in the lab, and the network connectivity was derived by identifying pairs of motes with a high probability of being able to successfully communicate with each other.14 Figure 9 shows the number of nodes that participate in queries over variably- sized query intervals (where the interval size is shown on the x axis) of the at- tribute space in a 20 × 20 grid. The interval for queries was randomly selected from the uniform distribution. Each point in the graph was obtained by averag- ing over ﬁve trials for each of the three parent selection policies in each of the sensor value distributions (for a total of 30 experiments).For each interval size s, 100 queries were randomly constructed, and the average number of nodes involved in each query was measured. For all three distributions, the clustered approach was superior to other SRT algorithms, beating the random approach by about 25% and the closest parent approach by about 10% on average. With the geographic and real distributions, the performance of the clustered approach is close to optimal: for most ranges, all of the nodes in the range tend to be colocated, so few intermediate nodes are required to relay information for queries in which they themselves are not participating. The fact that the results from real topology closely matches the geographic distribution, where sensors’ values and topology are perfectly correlated, is encouraging and suggests that SRTs will work well in practice. Figure 10 shows several visualizations of the topologies which are generated by the clustered (Figure 10(a)) and random (Figure 10(b)) SRT generation ap- proaches for an 8×8 network. Each node represents a sensor, labeled with its ID and the distribution of the SRT subtree rooted underneath it. Edges represent the routing tree. The gray nodes represent the nodes that would participate in 14 The probability threshold in this case was 25%, which is the same as the probability the TinyOS/TinyDB routing layer use to determine if a neighboring node is of sufﬁciently high quality to be considered as a candidate parent. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 152 • S. R. Madden et al. Fig. 9. Number of nodes participating in range queries of different sizes for different parent selec- tion policies in a semantic routing tree (20 × 20 grid, 400 nodes, each point average of 500 queries of the appropriate size). The three graphs represent three different sensor-value distributions; see the text for a description of each of these distribution types. the query 400 < A < 500. On this small grid, the two approaches perform sim- ilarly, but the variation in structure which results is quite evident—the random approach tends to be of more uniform depth, whereas the clustered approach leads to longer sequences of nodes with nearby values. Note that the labels in this ﬁgure are not intended to be readable—the important point is the overall pattern of nodes that are explored by the two approaches. 5.4 Maintenance Costs of SRTs As the previous results show, the beneﬁt of using an SRT can be substantial. There are, however, maintenance and construction costs associated with SRTs, as discussed above. Construction costs are comparable to those in conventional sensor networks (which also have a routing tree), but slightly higher due to ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 153 Fig. 10. Visualizations of the (a) clustered and (b) random topologies, with a query region overlaid on top of them. Node 0, the root in Figures 10(a) and 10(b), is at the center of the graph. the fact that parent selection messages are explicitly sent, whereas parents do not always require conﬁrmation from their children in other sensor network environments. We conducted an experiment to measure the cost of selecting a new parent, which requires a node to notify its old parent of its decision to move and send its attribute value to its new parent. Both the new and old parent must then update their attribute interval information and propagate any changes up the tree to the root of the network. In this experiment, we varied the probability with which any node switches parents on any given epoch from 0.001 to 0.2. We did not constrain the extent of the query in this case—all nodes were assumed to participate. Nodes were allowed to move from their current parent to an arbitrary new parent, and multiple nodes could move on a given epoch. The experimental parameters were the same as above. We measured the average number of maintenance messages generated by movement across the whole network. The results are shown in Figure 11. Each point represents the average of ﬁve trials, and each trial consists of 100 epochs. The three lines represent the three policies; the amount of movement varies along the x axis, and the number of maintenance messages per epoch is shown on the y axis. Without maintenance, each active node (within the query range) sends one message per epoch, instead of every node being required to transmit. Figure 11 suggests that for low movement rates, the maintenance costs of the SRT ap- proach are small enough that it remains attractive—if 1% of the nodes move on a given epoch, the cost is about 30 messages, which is substantially less than ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 154 • S. R. Madden et al. Fig. 11. Maintenance costs (in measured network messages) for different SRT parent selection policies with varying probabilities of node movement. Probabilities and costs are per epoch. Each point is the average of ﬁve runs, and where each run is 100 epochs long. the number of messages saved by using an SRT for most query ranges. If 10% of the nodes move, the maintenance cost grows to about 300, making the beneﬁt of SRTs less clear. To measure the amount of movement expected in practice, we measured movement rates in traces collected from two real-world monitoring deploy- ments; in both cases, the nodes were stationary but employed a routing al- gorithm that attempted to select the best parent over time. In the 3-month, 200-node Great Duck Island Deployment nodes switched parents between suc- cessive result reports with a 0.9% (σ = 0.9%) chance, on average. In the 54 node Intel-Berkeley lab dataset, nodes switched with a 4.3% (σ = 3.0%) chance. Thus, the amount of parent switching varies markedly from deployment to deploy- ment. One reason for the variation is that the two deployments use different routing algorithms. In the case of the Intel-Berkeley deployment, the algorithm was apparently not optimized to minimize the likelihood of switching. Figure 11 also shows that the different schemes for building SRTs result in different maintenance costs. This is because the average depth of nodes in the topologies varies from one approach to the other (7.67 in Random, 10.47 in Closest, and 9.2 in Clustered) and because the spread of values underneath a particular subtree varies depending on the approach used to build the tree. A deeper tree generally results in more messages being sent up the tree as path lengths increase. The closest parent scheme results in deep topologies because no preference is given towards parents with a wide spread of values, unlike the clustered approach which tends to favor selecting a parent that is a member of a pre-existing, wide interval. The random approach is shallower still because nodes simply select the ﬁrst parent that broadcasts, resulting in minimally deep trees. Finally, we note that the cost of joining the network is strictly dominated by the cost of moving parents, as there is no old parent to notify. Similarly, a node ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 155 disappearing is dominated by this movement cost, as there is no new parent to notify. 5.5 SRT Observations SRTs provide an efﬁcient mechanism for disseminating queries and collect- ing query results for queries over constant attributes. For attributes that are highly correlated amongst neighbors in the routing tree (e.g., location), SRTs can reduce the number of nodes that must disseminate queries and forward the continuous stream of results from children by nearly an order of magnitude. SRTs have the substantial advantage over a centralized index structure in that they do not require complete topology and sensor value information to be col- lected at the root of the network, which will be quite expensive to collect and will be difﬁcult to keep consistent as connectivity and sensor values change. SRT maintenance costs appear to be reasonable for at least some real-world deployments. Interestingly, unlike traditional routing trees in sensor networks, there is a substantial cost (in terms of network messages) for switching parents in an SRT. This suggests that one metric by which routing layer designers might evaluate their implementations is rate of parent-switching. For real-world deployments, we expect that SRTs will offer substantial ben- eﬁts. Although there are no benchmarks or deﬁnitive workloads for sensor net- work databases, we anticipate that many queries will be over narrow geographic areas—looking, for example, at single rooms or ﬂoors in a building, or nests, trees, or regions, in outdoor environments as on Great Duck Island; other re- searchers have noted the same need for constrained querying [Yao and Gehrke 2002; Mainwaring et al. 2002]. In a deployment like the Intel-Berkeley lab, if queries are over individual rooms or regions of the lab, Figure 9 shows that substantial performance gains can be had. For example, 2 of the 54 motes are in the main conference room; 7 of the 54 are in the seminar area; both of these queries can be evaluated using less that 30% of the network. We note two promising future extensions to SRTs. First, rather than storing just a single interval at every subtree, a variable number of intervals could be kept. This would allow nodes to more accurately summarize the range of values beneath them, and increase the beneﬁt of the approach. Second, when selecting a parent, even in the clustered approach, nodes do not currently have access to complete information about the subtree underneath a potential parent, par- ticularly as nodes move in the network or come and go. It would be interesting to explore a continuous SRT construction process, where parents periodically broadcast out updated intervals, giving current and potential children an option to move to a better subtree and improve the quality of the SRT. 6. PROCESSING QUERIES Once queries have been disseminated and optimized, the query processor be- gins executing them. Query execution is straightforward, so we describe it only brieﬂy. The remainder of the section is devoted to the ACQP-related issues of prioritizing results and adapting sampling and delivery rates. We present sim- ple schemes for prioritizing data in selection queries, brieﬂy discuss prioritizing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 156 • S. R. Madden et al. data in aggregation queries, and then turn to adaptation. We discuss two situ- ations in which adaptation is necessary: when the radio is highly contened and when power consumption is more rapid than expected. 6.1 Query Execution Query execution consists of a simple sequence of operations at each node during every epoch: ﬁrst, nodes sleep for most of an epoch; then they wake, sample sensors, apply operators to data generated locally and received from neighbors, and then deliver results to their parent. We (brieﬂy) describe ACQP-relevant issues in each of these phases. Nodes sleep for as much of each epoch as possible to minimize power con- sumption. They wake up only to sample sensors and relay and deliver results. Because nodes are time synchronized, parents can ensure that they awake to receive results when a child tries to propagate a message.15 The amount of time, tawake that a sensor node must be awake to successfully accomplish the latter three steps above is largely dependent on the number of other nodes transmitting in the same radio cell, since only a small number of messages per second can be transmitted over the single shared radio channel. We discuss the communication scheduling approach in more detail in the next section. TinyDB uses a simple algorithm to scale tawake based on the neighborhood size, which is measured by snooping on trafﬁc from neighboring nodes. Note, however, that there are situations in which a node will be forced to drop or com- bine results as a result of the either tawake or the sample interval being too short to perform all needed computation and communication. We discuss policies for choosing how to aggregate data and which results to drop in Section 6.3. Once a node is awake, it begins sampling and ﬁltering results according to the plan provided by the optimizer. Samples are taken at the appropriate (current) sample rate for the query, based on lifetime computations and information about radio contention and power consumption (see Section 6.4 for more information on how TinyDB adapts sampling in response to variations during execution). Filters are applied and results are routed to join and aggregation operators further up the query plan. Finally, we note that in event-based queries, the ON EVENT clause must be handled specially. When an event ﬁres on a node, that node disseminates the query, specifying itself as the query root. This node collects query results, and delivers them to the basestation or a local materialization point. 6.1.1 Communication Scheduling and Aggregate Queries. When process- ing aggregate queries, some care must be taken to coordinate the times when parents and children are awake, so that parent nodes have access to their chil- dren’s readings before aggregating. The basic idea is to subdivide the epoch into a number of intervals, and assign nodes to intervals based on their position in the routing tree. Because this mechanism makes relatively efﬁcient use of the 15 Ofcourse, there is some imprecision in time synchronization between devices. In general, we can tolerate a fair amount of imprecision by introducing a buffer period, such that parents wake up several milliseconds before and stay awake several milliseconds longer than their children. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 157 Fig. 12. Partial state records ﬂowing up the tree during an epoch using interval-based communication. radio channel and has good power consumption characteristics, TinyDB uses this scheduling approach for all queries (not just aggregates). In this slotted approach, each epoch is divided into a number of ﬁxed-length time intervals. These intervals are numbered in reverse order such that interval 1 is the last interval in the epoch. Then, each node is assigned to the interval equal to its level, or number of hops from the root, in the routing tree. In the interval preceding their own, nodes listen to their radios, collecting results from any child nodes (which are one level below them in the tree, and thus communicating in this interval). During a node’s interval, if it is aggregating, it computes the partial state record consisting of the combination of any child values it heard with its own local readings. After this computation, it transmits either its partial state record or raw sensor readings up the network. In this way, information travels up the tree in a staggered fashion, eventually reaching the root of the network during interval 1. Figure 12 illustrates this in-network aggregation scheme for a simple COUNT query that reports the number of nodes in the network. In the ﬁgure, time advances from left to right, and different nodes in the communication topology are shown along the y axis. Nodes transmit during the interval corresponding to their depth in the tree, so H, I, and J transmit ﬁrst, during interval 4, because they are at level 4. Transmissions are indicated by arrows from sender to receiver, and the numbers in circles on the arrows represent COUNTs contained within each partial state record. Readings from these three nodes are combined, via the COUNT merging function, at nodes G and F, both of which transmit new partial state records during interval 3. Readings ﬂow up the tree in this manner until they reach node A, which then computes the ﬁnal count of 10. Notice that motes are idle for a signiﬁcant portion of each epoch so they can enter a low power sleeping state. A detailed analysis of the accuracy and beneﬁt of this approach in TinyDB can be found in Madden [2003]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 158 • S. R. Madden et al. 6.2 Multiple Queries We note that, although TinyDB supports multiple queries running simulta- neously, we have not focused on multiquery optimization. This means that, for example, SRTs are shared between queries, but sample acquisition is not: if two queries need a reading within a few milliseconds of each other, this will cause both to acquire that reading. Similarly, there is no effort to optimize communi- cation scheduling between queries: transmissions of one query are scheduled independently from any other query. We hope to explore these issues as a part of our long-term sensor network research agenda. 6.3 Prioritizing Data Delivery Once results have been sampled and all local operators have been applied, they are enqueued onto a radio queue for delivery to the node’s parent. This queue contains both tuples from the local node as well as tuples that are being forwarded on behalf of other nodes in the network. When network contention and data rates are low, this queue can be drained faster than results arrive. However, because the number of messages produced during a single epoch can vary dramatically, depending on the number of queries running, the cardinality of joins, and the number of groups and aggregates, there are situations when the queue will overﬂow. In these situations, the system must decide if it should discard the overﬂow tuple, discard some other tuple already in the queue, or combine two tuples via some aggregation policy. The ability to make runtime decisions about the value of an individual data item is central to ACQP systems, because the cost of acquiring and delivering data is high, and because of these situations where the rate of data items ar- riving at a node will exceed the maximum delivery rate. A simple conceptual approach for making such runtime decisions is as follows: whenever the system is ready to deliver a tuple, send the result that will most improve the “qual- ity” of the answer that the user sees. Clearly, the proper metric for quality will depend on the application: for a raw signal, root-mean-square (RMS) error is a typical metric. For aggregation queries, minimizing the conﬁdence intervals of the values of group records could be the goal [Raman et al. 2002]. In other applications, users may be concerned with preserving frequencies, receiving statistical summaries (average, variance, or histograms), or maintaining more tenuous qualities such as signal “shape.” Our goal is not to fully explore the spectrum of techniques available in this space. Instead, we have implemented several policies in TinyDB to illustrate that substantial quality improvements are possible given a particular workload and quality metric. Generalizing concepts of quality and implementing and exploring more sophisticated prioritization schemes remains an area of future work. There is a large body of related work on approximation and compression schemes for streams in the database literature (e.g., Garofalakis and Gibbons [2001]; Chakrabarti et al. [2001]), although these approaches typically focus on the problem of building histograms or summary structures over the streams rather than trying to preserve the (in order) signal as best as possible, which ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 159 is the goal we tackle ﬁrst. Algorithms from signal processing, such as Fourier analysis and wavelets, are likely applicable, although the extreme memory and processor limitations of our devices and the online nature of our problem (e.g., choosing which tuple in an overﬂowing queue to evict) make them tricky to apply. We have begun to explore the use of wavelets in this context; see Hellerstein et al. [2003] for more information on our initial efforts. 6.3.1 Policies for Selection Queries. We begin with a comparison of three simple prioritization schemes, naive, winavg, and delta, for simple selection queries, turning our attention to aggregate queries in the next section. In the naive scheme no tuple is considered more valuable than any other, so the queue is drained in a FIFO manner and tuples are dropped if they do not ﬁt in the queue. The winavg scheme works similarly, except that instead of dropping results when the queue ﬁlls, the two results at the head of the queue are averaged to make room for new results. Since the head of the queue is now an average of multiple records, we associate a count with it. In the delta scheme, a tuple is assigned an initial score relative to its differ- ence from the most recent (in time) value successfully transmitted from this node, and at each point in time, the tuple with the highest score is delivered. The tuple with the lowest score is evicted when the queue overﬂows. Out of order delivery (in time) is allowed. This scheme relies on the intuition that the largest changes are probably interesting. It works as follows: when a tuple t with timestamp T is initially enqueued and scored, we mark it with the times- tamp R of this most recently delivered tuple r. Since tuples can be delivered out of order, it is possible that a tuple with a timestamp between R and T could be delivered next (indicating that r was delivered out of order), in which case the score we computed for t as well as its R timestamp are now incorrect. Thus, in general, we must rescore some enqueued tuples after every delivery. The delta scheme is similar to the value-deviation metric used in Garofalakis and Gibbons [2001] for minimizing deviation between a source and a cache although value-deviation does not include the possibility of out of order delivery. We compared these three approaches on a single mote running TinyDB. To measure their effect in a controlled setting, we set the sample rate to be a ﬁxed number K faster than the maximum delivery rate (such that 1 of every K tuples was delivered, on average) and compared their performance against several predeﬁned sets of sensor readings (stored in the EEPROM of the device). In this case, delta had a buffer of 5 tuples; we performed reordering of out of order tuples at the basestation. To illustrate the effect of winavg and delta, Figure 13 shows how delta and winavg approximate a high-periodicity trace of sensor readings generated by a shaking accelerometer. Notice that delta is considerably closer in shape to the original signal in this case, as it is tends to emphasize extremes, whereas average tends to dampen them. We also measured RMS error for this signal as well as two others: a square wave-like signal from a light sensor being covered and uncovered, and a slow sinusoidal signal generated by moving a magnet around a magnetometer. The error for each of these signals and techniques is shown in Table V. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 160 • S. R. Madden et al. Fig. 13. An acceleration signal (top) approximated by a delta (middle) and an average (bottom), K = 4. Table V. RMS Error for Different Prioritization Schemes and Signals (1000 Samples, Sample Interval = 64 ms) Accel. Light (Step) Magnetometer (Sinusoid) Winavg 64 129 54 Delta 63 81 48 Naive 77 143 63 Although delta appears to match the shape of the acceleration signal better, its RMS value is about the same as average’s (due to the few peaks that delta incorrectly merges together). Delta outperforms either other approach for the fast changing step-functions in the light signal because it does not smooth edges as much as average. We now turn our attention to result prioritization for aggregate queries. 6.3.2 Policies for Aggregate Queries. The previous section focused on pri- oritizing result collection in simple selection queries. In this section, we look instead at aggregate queries, illustrating a class of snooping based techniques ﬁrst described in the TAG system [Madden et al. 2002a] that we have imple- mented for TinyDB. We consider aggregate queries of the form SELECT f agg (a1 ) FROM sensors GROUP BY a2 SAMPLE PERIOD x Recall that this query computes the value of f agg applied to the value of a1 produced by each device every x seconds. Interestingly, for queries with few or no groups, there is a simple technique that can be used to prioritize results for several types of aggregates. This tech- nique, called snooping, allows nodes to locally suppress local aggregate values by listening to the answers that neighboring nodes report and exploiting the se- mantics of aggregate functions, and is also used in [Madden et al. 2002a]. Note that this snooping can be done for free due to the broadcast nature of the radio channel. Consider, for example, a MAX query over some attribute a—if a node n ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 161 Fig. 14. Snooping reduces the data nodes must send in aggregate queries. Here node 2’s value can be suppressed if it is less than the maximum value snooped from nodes 3, 4, and 5. hears a value of a greater than its own locally computed partial MAX, it knows that its local record is low priority, and assigns it a low score or suppresses it altogether. Conversely, if n hears many neighboring partial MAXs over a that are less than its own partial aggregate value, it knows that its local record is more likely to be a maximum, and assigns it a higher score. Figure 14 shows a simple example of snooping for a MAX query—node 2 is can score its own MAX value very low when it hears a MAX from node 3 that is larger than its own. This basic technique applies to all monotonic, exemplary aggregates: MIN, MAX, TOP-N, etc., since it is possible to deterministically decide whether a particu- lar local result could appear in the ﬁnal answer output at the top of the network. For dense network topologies where there is ample opportunity for snooping, this technique produces a dramatic reduction in communication, since at every intermediate point in the routing tree, only a small number of node’s values will actually need to be transmitted. It is also possible to glean some information from snooping in other aggre- gates as well—for example, in an AVERAGE query, nodes may rank their own results lower if they hear many siblings with similar sensor readings. For this approach to work, parents must cache a count of recently heard children and assume children who do not send a value for an average have the same value as the average of their siblings’ values, since otherwise outliers will be weighted disproportionately. This technique of assuming that missing values are the same as the average of other reported values can be used for many summary statistics: variance, sum, and so on. Exploring more sophisticated prioritization schemes for aggregate queries is an important area of future work. In the previous sections, we demonstrated how prioritization of results can be used improve the overall quality of that data that are transmitted to the root when some results must be dropped or aggregated. Choosing the proper policies to apply in general, and understanding how various existing approximation and prioritization schemes map into ACQP is an important future direction. 6.4 Adapting Rates and Power Consumption We saw in the previous sections how TinyDB can exploit query semantics to transmit the most relevant results when limited bandwidth or power is ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 162 • S. R. Madden et al. Fig. 15. Per-mote sample rate versus aggregate delivery rate. available. In this section, we discuss selecting and adjusting sampling and transmission rates to limit the frequency of network-related losses and ﬁll rates of queues. This adaptation is the other half of the runtime techniques in ACQP: because the system can adjust rates, signiﬁcant reductions can be made in the frequency with which data prioritization decisions must be made. These tech- niques are simply not available in non-acquisitional query processing systems. When initially optimizing a query, TinyDB’s optimizer chooses a trans- mission and sample rate based on current network load conditions, and re- quested sample rates and lifetimes. However, static decisions made at the start of query processing may not be valid after many days running the same continuous query. Just as adaptive query processing techniques like ed- dies [Avnur and Hellerstein 2000], Tukwila [Ives et al. 1999], and Query Scram- bling [Urhan et al. 1998] dynamically reorder operators as the execution envi- ronment changes, TinyDB must react to changing conditions—however, unlike in previous adaptive query processing systems, failure to adapt in TinyDB can cripple the system, reducing data ﬂow to a trickle or causing the system to severely miss power budget goals. We study the need for adaptivity in two contexts: network contention and power consumption. We ﬁrst examine network contention. Rather than simply assuming that a speciﬁc transmission rate will result in a relatively uncontested network channel, TinyDB monitors channel contention and adaptively reduces the number of packets transmitted as contention rises. This backoff is very important: as the four motes line of Figure 15 shows, if several nodes try to transmit at high rates, the total number of packets delivered is substantially less than if each of those nodes tries to transmit at a lower rate. Compare this line with the performance of a single node (where there is no contention)—a single node does not exhibit the same falling off because there is no contention (although the percentage of successfully delivered packets does fall off). Finally, the four motes adaptive line does not have the same precipitous performance because it is able to monitor the network channel and adapt to contention. Note that the performance of the adaptive approach is slightly less than the nonadaptive approach at four and eight samples per second as backoff begins ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 163 Fig. 16. Comparison of delivered values (bottom) versus actual readings for from two motes (left and right) sampling at 16 packets per second and sending simultaneously. Four motes were com- municating simultaneously when this data was collected. to throttle communication in this regime. However, when we compared the percentage of successful transmission attempts at eight packets per second, the adaptive scheme achieves twice the success rate of the nonadaptive scheme, suggesting the adaptation is still effective in reducing wasted communication effort, despite the lower utilization. The problem with reducing the transmission rate is that it will rapidly cause the network queue to ﬁll, forcing TinyDB to discard tuples using the semantic techniques for victim selection presented in Section 6.3 above. We note, however, that had TinyDB not chosen to slow its transmission rate, fewer total packets would have been delivered. Furthermore, by choosing which packets to drop using semantic information derived from the queries (rather than losing some random sample of them), TinyDB is able to substantially improve the quality of results delivered to the end user. To illustrate this in practice, we ran a selection query over four motes running TinyDB, asking them each to sample data at 16 samples per second, and compared the quality of the delivered results using an adaptive-backoff version of our delta approach to results over the same dataset without adaptation or result prioritization. We show here traces from two of the nodes on the left and right of Figure 16. The top plots show the performance of the adaptive delta, the middle plots show the nonadaptive case, and the bottom plots show the the original signals (which were stored in EEPROM to allow repeatable trials). Notice that the delta scheme does substantially better in both cases. 6.4.1 Measuring Power Consumption. We now turn to the problem of adapting tuple delivery rates to meet speciﬁc lifetime requirements in response to incorrect sample rates computed at query optimization time (see Section 3.6). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 164 • S. R. Madden et al. We ﬁrst note that, using the computations shown in Section 3.6, it is possible to compute a predicted battery voltage for a time t seconds into processing a query. The system can then compare its current voltage to this predicted voltage. By assuming that voltage decays linearly we can reestimate the power consump- tion characteristics of the device (e.g., the costs of sampling, transmitting, and receiving) and then rerun our lifetime calculation. By reestimating these pa- rameters, the system can ensure that this new lifetime calculation tracks the actual lifetime more closely. Although this calculation and reoptimization are straightforward, they serve an important role by allowing TinyDB motes to satisfy occasional ad hoc queries and relay results for other nodes without compromising lifetime goals of long- running monitoring queries. Finally, we note that incorrect measurements of power consumption may also be due to incorrect estimates of the cost of various phases of query processing, or may be as a result of incorrect selectivity estimation. We cover both by tuning sample rate. As future work, we intend to explore adaptation of optimizer esti- mates and ordering decisions (in the spirit of other adaptive work [Hellerstein et al. 2000]) and the effect of frequency of reestimation on lifetime. 7. SUMMARY OF ACQP TECHNIQUES This completes our discussion of the novel issues and techniques that arise when taking an acquisitional perspective on query processing. In summary, we ﬁrst discussed important aspects of an acquisitional query language, introduc- ing event and lifetime clauses for controlling when and how often sampling occurs. We then discussed query optimization with the associated issues of modeling sampling costs and ordering of sampling operators. We showed how event-based queries can be rewritten as joins between streams of events and sensor samples. Once queries have been optimized, we demonstrated the use of semantic routing trees as a mechanism for efﬁciently disseminating queries and collecting results. Finally, we showed the importance of prioritizing data according to quality and discussed the need for techniques to adapt the trans- mission and sampling rates of an ACQP system. Table VI lists the key new techniques we introduced, summarizing what queries they apply to and when they are most useful. 8. RELATED WORK There has been some recent publication in the database and systems commu- nities on query processing in sensor networks [Intanagonwiwat et al. 2000; Madden et al. 2002a; Bonnet et al. 2001; Madden and Franklin 2002; Yao and Gehrke 2002]. These articles noted the importance of power sensitivity. Their predominant focus to date has been on in-network processing—that is, the push- ing of operations, particularly selections and aggregations, into the network to reduce communication. We too endorse in-network processing, but believe that, for a sensor network system to be truly power sensitive, acquisitional issues of when, where, and in what order to sample and which samples to process must be considered. To our knowledge, no prior work addresses these issues. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 165 Table VI. Summary of Acquisitional Query Processing Techniques in TinyDB Technique (Section) Summary Event-based queries (3.5) Avoid polling overhead Lifetime queries (3.6) Satisfy user-speciﬁed longevity constraints Interleaving acquisition/predicates (4.2) Avoid unnecessary sampling costs in selection queries Exemplary aggregate pushdown (4.2.1) Avoid unnecessary sampling costs in aggregate queries Event batching (4.3) Avoid execution costs when a number of event queries ﬁre SRT (5.1) Avoid query dissemination costs or the inclusion of unneeded nodes in queries with predicates over constant attributes Communication scheduling (6.1.1) Disable node’s processors and radios during times of inactivity Data prioritization (6.3) Choose most important samples to deliver according to a user-speciﬁed prioritization function Snooping (6.3.2) Avoid unnecessary transmissions during aggregate queries Rate adaptation (6.4) Intentionally drop tuples to avoid saturating the radio channel, allowing most important tuples to be delivered There is a small body of work related to query processing in mobile environ- ments [Imielinski and Badrinath 1992; Alonso and Korth 1993]. This work has been concerned with laptop-like devices that are carried with the user, can be readily recharged every few hours, and, with the exception of a wireless network interface, basically have the capabilities of a wired, powered PC. Lifetime-based queries, notions of sampling the associated costs, and runtime issues regarding rates and contention were not considered. Many of the proposed techniques, as well as more recent work on moving object databases (such as Wolfson et al. [1999]), focus on the highly mobile nature of devices, a situation we are not (yet) dealing with, but which could certainly arise in sensor networks. Power-sensitive query optimization was proposed in Alonso and Ganguly [1993], although, as with the previous work, the focus was on optimizing costs in traditional mobile devices (e.g., laptops and palmtops), so concerns about the cost and ordering of sampling did not appear. Furthermore, laptop-style devices typically do not offer the same degree of rapid power-cycling that is available on embedded platforms like motes. Even if they did, their interactive, user- oriented nature makes it undesirable to turn off displays, network interfaces, etc., because they are doing more than simply collecting and processing data, so there are many fewer power optimizations that can be applied. Building an SRT is analogous to building an index in a conventional database system. Due to the resource limitations of sensor networks, the actual index- ing implementations are quite different. See Kossman [2000] for a survey of relevant research on distributed indexing in conventional database systems. There is also some similarity to indexing in peer-to-peer systems [Crespo and Garcia-Molina 2002]. However, peer-to-peer systems differ in that they are inexact and not subject to the same paucity of communications or storage ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 166 • S. R. Madden et al. infrastructure as sensor networks, so algorithms tend to be storage and com- munication heavy. Similar indexing issues also appear in highly mobile envi- ronments (like Wolfson et al. [1999] or Imielinski and Badrinath [1992]), but this work relies on a centralized location servers for tracking recent positions of objects. The observation that interleaving the fetching of attributes and application of operators also arises in the context of compressed databases [Chen et al. 2001], as decompression effectively imposes a penalty for fetching an individual at- tribute, so it is beneﬁcial to apply selections and joins on already decompressed or easy to decompress attributes. The ON EVENT and OUTPUT ACTION clauses in our query language are similar to constructs present in event-condition-action/active databases [Chakravarthy et al. 1994]. There is a long tradition of such work in the database community, and our techniques are much simpler in comparison, as we we have not focused on any of the difﬁcult issues associated with the semantics of event composition or with building a complete language for expressing and efﬁciently evaluating the triggering of composite events. Work on systems for efﬁciently determining when an event has ﬁred, such as Hanson [1996], could be useful in TinyDB. More recent work on continuous query systems [Liu et al. 1999; Chen et al. 2000] has described languages that provide for query processing in response to events or at regular intervals over time. This earlier work, as well as our own work on continuous query processing [Madden et al. 2002b], inspired the periodic and event-driven features of TinyDB. Approximate and best effort caches [Olston and Widom 2002], as well as sys- tems for online-aggregation [Raman et al. 2002] and stream query processing [Motwani et al. 2003; Carney et al. 2002], include some notion of data quality. Most of this other work has been focused on quality with respect to summaries, aggregates, or staleness of individual objects, whereas we focus on quality as a measure of ﬁdelity to the underlying continuous signal. Aurora [Carney et al. 2002] mentioned a need for this kind of metric, but proposed no speciﬁc ap- proaches. Work on approximate query processing [Garofalakis and Gibbons 2001] has included a scheme similar to our delta approach, as well as a sub- stantially more thorough evaluation of its merits, but did not consider out of order delivery. 9. CONCLUSIONS AND FUTURE WORK Acquisitional query processing provides a framework for addressing issues of when, where, and how often data is sampled and which data is delivered in dis- tributed, embedded sensing environments. Although other research has iden- tiﬁed the opportunities for query processing in sensor networks, this work is the ﬁrst to discuss these fundamental issues in an acquisitional framework. We identiﬁed several opportunities for future research. We are currently ac- tively pursuing two of these: ﬁrst, we are exploring how query optimizer statis- tics change in acquisitional environments and studying the role of online reop- timization in sample rate and operator orderings in response to bursts of data or unexpected power consumption. Second, we are pursuing more sophisticated ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 167 prioritization schemes, like wavelet analysis, that can capture salient proper- ties of signals other than large changes (as our delta mechanism does) as well as mechanisms to allow users to express their prioritization preferences. We believe that ACQP notions are of critical importance for preserving the longevity and usefulness of any deployment of battery powered sensing devices, such as those that are now appearing in biological preserves, roads, businesses, and homes. Without appropriate query languages, optimization models, and query dissemination and data delivery schemes that are cognizant of semantics and the costs and capabilities of the underlying hardware the success of such deployments will be limited. APPENDIX A. POWER CONSUMPTION STUDY This appendix details an analytical study of power consumption on a mote running a typical data collection query. In this study, we assume that each mote runs a very simple query that trans- mits one sample of (light, humidity) readings every minute. We assume each mote also listens to its radio for 2 s per 1-min period to receive results from neighboring devices and obtain access to the radio channel. We assume the following hardware characteristics: a supply voltage of 3 V, an Atmega128 pro- cessor (see footnote to Table III data on the processor) that can be set into power-down mode and runs off the internal oscillator at 4 MHz, the use of the Taos Photosynthetically Active Light Sensor [TAOS, Inc. 2002] and Sensirion Humidity Sensor [Sensirion 2002], and a ChipCon CC1000 Radio (see text foot- note 6 for data on this radio) transmitting at 433 MHz with 0-dBm output power and −110-dBm receive sensitivity. We further assume the radio can make use of its low-power sampling16 mode to reduce reception power when no other radios are communicating, and that, on average, each node has 10 neighbors, or other motes, within radio range, period, with one of those neighbors being a child in the routing tree. Radio packets are 50 bytes each, with a 20-byte preamble for synchronization. This hardware conﬁguration represents real-world settings of motes similar to values used in deployments of TinyDB in various environmen- tal monitoring applications. The percentage of total energy used by various components is shown in Table VII. These results show that the processor and radio together con- sume the majority of energy for this particular data collection task. Obviously, these numbers change as the number of messages transmitted per period in- creases; doubling the number of messages sent increases the total power uti- lization by about 19% as a result of the radio spending less time sampling the channel and more time actively receiving. Similarly, if a node must send ﬁve packets per sample period instead of one, its total power utilization rises by about 10%. 16 Thismode works by sampling the radio at a low frequency—say, once every k bit-times, where k is on the order of 100—and extending the synchronization header, or preamble, on radio packets to be at least k + bits, such that a radio using this low-power listening approach will still detect every packet. Once a packet is detected, the receiver begins packet reception at the normal rate. The cost of this technique is that it increases transmission costs signiﬁcantly. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 168 • S. R. Madden et al. Table VII. Expected Power Consumption for Major Hardware Components, a Query Reporting Light and Accelerometer Readings Once Every Minute Hardware Current (mA) Active Time (s) % Total Energy Sensing, humidity 0.50 0.34 1.43 Sensing, light 0.35 1.30 3.67 Communication, sending 10.40 0.03 2.43 (70 bytes @ 38.4 bps × 2 packets) Communication, receive packets 9.30 0.15 11.00 (70 bytes @ 38.4 bps × 10 packets) Communication, sampling channel 0.07 0.86 0.31 Processor, active 5.00 2.00 80.68 Processor, idle 0.001 58.00 0.47 Average current draw per second: 0.21 mA This table does not tell the entire story, however, because the processor must be active during sensing and communication, even though it has very little computation to perform.17 For example, in Table VII, 1.3 s are spent waiting for the light sensor to start and produce a sample,18 and another 0.029 s are spent transmitting. Furthermore, the media access control (MAC) layer on the radio introduces a delay proportional to the number of devices transmitting. To mea- sure this delay, we examined the average delay between 1700 packet arrivals on a network of 10 time-synchronized motes attempting to send at the same time. The minimum interpacket arrival time was about 0.06 s; subtracting the expected transmit time of a packet (0.007 s) suggests that, with 10 nodes, the average MAC delay will be at least (0.06 − 0.007) × 5) = 0.265 s. Thus, of the 2 s each mote is awake, about 1.6 s of that time is spent waiting for the sensors or radio. The total 2-s waking period is selected to allow for variation in MAC delays on individual sensors. Application computation is almost negligible for basic data collection sce- narios: we measured application processing time by running a simple TinyDB query that collects three data ﬁelds from the RAM of the processor (incurring no sensing delay) and transmits them over an uncontested radio channel (in- curring little MAC delay). We inserted into the query result a measure of the elapsed time from the start of processing until the moment the result begins to be transmitted. The average delay was less than 1/32 (0.03125) s, which is the minimum resolution we could measure. Thus, of the 81% of energy spent on the processor, no more than 1% of its cycles are spent in application processing. For the example given here at least 65% of this 81% is spent waiting for sensors, and another 8% waiting for the radio to send or receive. The remaining 26% of processing time is time to allow for multihop forwarding of messages and as slop in the event that MAC de- lays exceed the measured minimums given above. Summing the processor time 17 The requirement that the processor be active during these times is an artifact of the mote hard- ware. Bluetooth radios, for example, can negotiate channel access independently of the proces- sor. These radios, however, have signiﬁcantly higher power consumption than the mote radio; see Leopold et al. [2003] for a discussion of Bluetooth as a radio for sensor networks. 18 On motes, it is possible to start and sample several sensors simultaneously, so the delay for the light and humidity sensors are not additive. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 169 spent waiting to send or sending with the percent energy used by the radio itself, we get (0.26 + 0.08) × 0.80 + 0.02 + 0.11 + 0.003 = 0.41 This indicates that about 41% of power consumption in this simple data collec- tion task is due to communication. Similarly, in this example, the percentage of energy devoted to sensing can be computed by summing the energy spent waiting for samples with the energy costs of sampling: 0.65 ∗ 0.81 + 0.01 + 0.04 = 0.58. Thus, about 58% of the energy in this case is spent sensing. Obviously, the total percentage of time spent in sensing could be less if sensors that powered up more rapidly were used. When we discussed query optimization in TinyDB in Section 4, we saw a range of sensors with varying costs that would alter the percentages shown here. B. QUERY LANGUAGE This appendix provides a complete speciﬁcation of the syntax of the TinyDB query language as well as pointers to the parts of the text where these constructs are deﬁned. We will use {} to denote a set, [] to denote optional clauses, and <> to denote an expression, and italicized text to denote user-speciﬁed tokens such as aggregate names, commands, and arithmetic operators. The separator “|” indicates that one or the other of the surrounding tokens may appear, but not both. Ellipses (“. . . ”) indicate a repeating set of tokens, such as ﬁelds in the SELECT clause or tables in the FROM clause. B.1 Query Syntax The syntax of queries in the TinyDB query language is as follows: [ON [ALIGNED] EVENT event-type[{paramlist}] [ boolop event-type{paramlist} ... ]] SELECT [NO INTERLEAVE] <expr>| agg(<expr>) | temporal agg(<expr>), ... FROM [sensors | storage-point], ... [WHERE {<pred>}] [GROUP BY {<expr>}] [HAVING {<pred>}] [OUTPUT ACTION [ command | SIGNAL event({paramlist}) | (SELECT ... ) ] | [INTO STORAGE POINT bufname]] [SAMPLE PERIOD seconds [[FOR n rounds] | [STOP ON event-type [WHERE <pred>]]] [COMBINE { agg(<expr>)}] [INTERPOLATE LINEAR]] | [ONCE] | [LIFETIME seconds [MIN SAMPLE RATE seconds]] ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 170 • S. R. Madden et al. Table VIII. References to Sections in the Main Text Where Query Language Constructs are Introduced Language Construct Section ON EVENT Section 3.5 SELECT-FROM-WHERE Section 3 GROUP BY, HAVING Section 3.3.1 OUTPUT ACTION Section 3.7 SIGNAL <event> Section 3.5 INTO STORAGE POINT Section 3.2 SAMPLE PERIOD Section 3 FOR Section 3.2 STOP ON Section 3.5 COMBINE Section 3.2 ONCE Section 3.7 LIFETIME Section 3.6 Each of these constructs are described in more detail in the sections shown in Table VIII. B.2 Storage Point Creation and Deletion Syntax The syntax for storage point creation is CREATE [CIRCULAR] STORAGE POINT name SIZE [ ntuples | nseconds] [( ﬁeldname type [, ... , ﬁeldname type])] | [AS SELECT ... ] [SAMPLE PERIOD nseconds] and for deletion is DROP STORAGE POINT name Both of these constructs are described in Section 3.2. REFERENCES ALONSO, R. AND GANGULY, S. 1993. Query optimization in mobile environments. In Proceedings of the Workshop on Foundations of Models and Languages for Data and Objects. 1–17. ALONSO, R. AND KORTH, H. F. 1993. Database system issues in nomadic computing. In Proceedings of the ACM SIGMOD (Washington, DC). AVNUR, R. AND HELLERSTEIN, J. M. 2000. Eddies: Continuously adaptive query processing. In Pro- ceedings of ACM SIGMOD (Dallas, TX). 261–272. BANCILHON, F., BRIGGS, T., KHOSHAFIAN, S., AND VALDURIEZ, P. 1987. FAD, a powerful and simple database language. In Proceedings of VLDB. BONNET, P., GEHRKE, J., AND SESHADRI, P. 2001. Towards sensor database systems. In Proceedings of the Conference on Mobile Data Management. BROOKE, T. AND BURRELL, J. 2003. From ethnography to design in a vineyard. In Proceedings of the Design User Experiences (DUX) Conference. Case study. CARNEY, D., CENTIEMEL, U., CHERNIAK, M., CONVEY, C., LEE, S., SEIDMAN, G., STONEBRAKER, M., TATBUL, N., AND ZDONIK, S. 2002. Monitoring streams—a new class of data management applications. In Proceedings of VLDB. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 171 CERPA, A., ELSON, J., D. ESTRIN, GIROD, L., HAMILTON, M., AND ZHAO, J. 2001. Habitat monitoring: Application driver for wireless communications technology. In Proceedings of ACM SIGCOMM Workshop on Data Communications in Latin America and the Caribbean. CHAKRABARTI, K., GAROFALAKIS, M., RASTOGI, R., AND SHIM, K. 2001. Approximate query processing using wavelets. VLDB J. 10, 2-3 (Sep.), 199–223. CHAKRAVARTHY, S., KRISHNAPRASAD, V., ANWAR, E., AND KIM, S. K. 1994. Composite events for active databases: Semantics, contexts and detection. In Proceedings of VLDB. CHANDRASEKARAN, S., COOPER, O., DESHPANDE, A., FRANKLIN, M. J., HELLERSTEIN, J. M., HONG, W., KRISHNAMURTHY, S., MADDEN, S. R., RAMAN, V., REISS, F., AND SHAH, M. A. 2003. TelegraphCQ: Continuous dataﬂow processing for an uncertain world. In Proceedings of the First Annual Con- ference on Innovative Database Research (CIDR). CHEN, J., DEWITT, D., TIAN, F., AND WANG, Y. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of ACM SIGMOD. CHEN, Z., GEHRKE, J., AND KORN, F. 2001. Query optimization in compressed database systems. In Proceedings of ACM SIGMOD. CRESPO, A. AND GARCIA-MOLINA, H. 2002. Routing indices for peer-to-peer systems. In Proceedings of ICDCS. DELIN, K. A. AND JACKSON, S. P. 2000. Sensor web for in situ exploration of gaseous biosignatures. In Proceedings of the IEEE Aerospace Conference. DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER, D. A., BRICKER, A., HSIAO, H. I., AND RASMUSSEN, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Eng. 2, 1, 44– 62. GANERIWAL, S., KUMAR, R., ADLAKHA, S., AND SRIVASTAVA, M. 2003. Timing-sync protocol for sensor networks. In Proceedings of ACM SenSys. GAROFALAKIS, M. AND GIBBONS, P. 2001. Approximate query processing: Taming the terabytes! (tutorial). In Proceedings of VLDB. GAY, D., LEVIS, P., VON BEHREN, R., WELSH, M., BREWER, E., AND CULLER, D. 2003. The nesC language: A holistic approach to network embedded systems. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI). GEHRKE, J., KORN, F., AND SRIVASTAVA, D. 2001. On computing correlated aggregates over contin- ual data streams. In Proceedings of ACM SIGMOD Conference on Management of Data (Santa Barbara, CA). HANSON, E. N. 1996. The design and implementation of the ariel active database rule system. IEEE Trans. Knowl. Data Eng. 8, 1 (Feb.), 157–172. HELLERSTEIN, J., HONG, W., MADDEN, S., AND STANEK, K. 2003. Beyond average: Towards sophisti- cated sensing with queries. In Proceedings of the First Workshop on Information Processing in Sensor Networks (IPSN). HELLERSTEIN, J. M. 1998. Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23, 2, 113–157. HELLERSTEIN, J. M., FRANKLIN, M. J., CHANDRASEKARAN, S., DESHPANDE, A., HILDRUM, K., MADDEN, S., RAMAN, V., AND SHAH, M. 2000. Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull. 23, 2, 7–18. HILL, J., SZEWCZYK, R., WOO, A., HOLLAR, S., AND PISTER, D. C. K. 2000. System architecture direc- tions for networked sensors. In Proceedings of ASPLOS. IBARAKI, T. AND KAMEDA, T. 1984. On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst. 9, 3, 482–502. IMIELINSKI, T. AND BADRINATH, B. 1992. Querying in highly mobile distributed environments. In Proceedings of VLDB (Vancouver, B.C., Canada). INTANAGONWIWAT, C., GOVINDAN, R., AND ESTRIN, D. 2000. Directed diffusion: A scalable and robust communication paradigm for sensor networks. In Proceedings of MobiCOM (Boston, MA). INTERSEMA. 2002. MS5534A barometer module. Tech. rep. (Oct.). Go online to http://www. intersema.com/pro/module/file/da5534.pdf. IVES, Z. G., FLORESCU, D., FRIEDMAN, M., LEVY, A., AND WELD, D. S. 1999. An adaptive query execution system for data integration. In Proceedings of ACM SIGMOD. KOSSMAN, D. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (Dec.), 422–46. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 172 • S. R. Madden et al. KRISHNAMURTHY, R., BORAL, H., AND ZANIOLO, C. 1986. Optimization of nonrecursive queries. In Proceedings of VLDB. 128–137. LEOPOLD, M., DYDENSBORG, M., AND BONNET, P. 2003. Bluetooth and sensor networks: A reality check. In Proceedings of ACM Conference on Sensor Networks (SenSys). LIN, C., FEDERSPIEL, C., AND AUSLANDER, D. 2002. Multi-sensor single actuator control of HVAC sys- tems. In Proceedings of the International Conference for Enhanced Building Operations (Austin, TX, Oct. 14–18). LIU, L., PU, C., AND TANG, W. 1999. Continual queries for internet-scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. (special Issue on Web technology) 11, 4 (July), 610–628. MADDEN, S. 2003. The design and evaluation of a query processing architecture for sensor net- works. Ph.D. dissertation. University of California, Berkeley, Berkeley, CA. MADDEN, S. AND FRANKLIN, M. J. 2002. Fjording the stream: An architechture for queries over streaming sensor data. In Proceedings of ICDE. MADDEN, S., FRANKLIN, M. J., HELLERSTEIN, J. M., AND HONG, W. 2002a. TAG: A Tiny AGgregation service for ad-hoc sensor networks. In Proceedings of OSDI. MADDEN, S., HONG, W., FRANKLIN, M., AND HELLERSTEIN, J. M. 2003. TinyDB Web page. Go online to http://telegraph.cs.berkeley.edu/tinydb. MADDEN, S., SHAH, M. A., HELLERSTEIN, J. M., AND RAMAN, V. 2002b. Continously adaptive contin- uous queries over data streams. In Proceedings of ACM SIGMOD (Madison, WI). MAINWARING, A., POLASTRE, J., SZEWCZYK, R., AND CULLER, D. 2002. Wireless sensor networks for habitat monitoring. In Proceedings of ACM Workshop on Sensor Networks and Applications. MELEXIS, INC. 2002. MLX90601 infrared thermopile module. Tech. rep. (Aug.). Go online to http: //www.melexis.com/prodfiles/mlx90601.pdf. MONMA, C. L. AND SIDNEY, J. 1979. Sequencing with series parallel precedence constraints. Math. Operat. Rese. 4, 215–224. MOTWANI, R., WIDOM, J., ARASU, A., BABCOCK, B., S.BABU, DATA, M., OLSTON, C., ROSENSTEIN, J., AND VARMA, R. 2003. Query processing, approximation and resource management in a data stream management system. In Proceedings of the First Annual Conference on Innovative Database Research (CIDR). OLSTON, C. AND WIDOM, J. 2002. In best effort cache sychronization with source cooperation. In Proceedings of SIGMOD. PIRAHESH, H., HELLERSTEIN, J. M., AND HASAN, W. 1992. Extensible/rule based query rewrite opti- mization in starburst. In Proceedings of ACM SIGMOD. 39–48. POTTIE, G. AND KAISER, W. 2000. Wireless integrated network sensors. Commun. ACM 43, 5 (May), 51–58. PRIYANTHA, N. B., CHAKRABORTY, A., AND BALAKRISHNAN, H. 2000. The cricket location-support sys- tem. In Proceedings of MOBICOM. RAMAN, V., RAMAN, B., AND HELLERSTEIN, J. M. 2002. Online dynamic reordering. VLDB J. 9, 3. SENSIRION. 2002. SHT11/15 relative humidity sensor. Tech. rep. (June). Go online to http://www. sensirion.com/en/pdf/Datasheet_SHT1x_SHT7x_0206.pdf. SHATDAL, A. AND NAUGHTON, J. 1995. Adaptive parallel aggregation algorithms. In Proceedings of ACM SIGMOD. STONEBRAKER, M. AND KEMNITZ, G. 1991. The POSTGRES next-generation database management system. Commun. ACM 34, 10, 78–92. SUDARSHAN, S. AND RAMAKRISHNAN, R. 1991. Aggregation and relevance in deductive databases. In Proceedings of VLDB. 501–511. TAOS, INC. 2002. TSL2550 ambient light sensor. Tech. rep. (Sep.). Go online to http://www. taosinc.com/images/product/document/tsl2550.pdf. UC BERKELEY. 2001. Smart buildings admit their faults. Web page. Lab notes: Research from the College of Engineering, UC Berkeley. Go online to http://coe.berkeley.edu/labnotes/1101. smartbuildings.html. URHAN, T., FRANKLIN, M. J., AND AMSALEG, L. 1998. Cost-based query scrambling for initial delays. In Proceedings of ACM SIGMOD. WOLFSON, O., SISTLA, A. P., XU, B., ZHOU, J., AND CHAMBERLAIN, S. 1999. DOMINO: Databases fOr MovINg Objects tracking. In Proceedings of ACM SIGMOD (Philadelphia, PA). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Acquisitional Query Processing In Sensor Networks • 173 WOO, A. AND CULLER, D. 2001. A transmission control scheme for media access in sensor networks. In Proceedings of ACM Mobicom. YAO, Y. AND GEHRKE, J. 2002. The cougar approach to in-network query processing in sensor networks. In SIGMOD Rec. 13, 3 (Sept.), 9–18. Received October 2003; revised June 2004; accepted September 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core RONALD FAGIN, PHOKION G. KOLAITIS, and LUCIAN POPA IBM Almaden Research Center Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reﬂects the source data as accurately as possible. Given a source instance, there may be many solutions to the data exchange problem, that is, many target in- stances that satisfy the constraints of the data exchange problem. In an earlier article, we iden- tiﬁed a special class of solutions that we call universal. A universal solution has homomorphisms into every possible solution, and hence is a “most general possible” solution. Nonetheless, given a source instance, there may be many universal solutions. This naturally raises the question of whether there is a “best” universal solution, and hence a best solution for data exchange. We an- swer this question by considering the well-known notion of the core of a structure, a notion that was ﬁrst studied in graph theory, and has also played a role in conjunctive-query processing. The core of a structure is the smallest substructure that is also a homomorphic image of the struc- ture. All universal solutions have the same core (up to isomorphism); we show that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data ex- change. We investigate the computational complexity of producing the core. Well-known results by Chandra and Merlin imply that, unless P = NP, there is no polynomial-time algorithm that, given a structure as input, returns the core of that structure as output. In contrast, in the context of data exchange, we identify natural and fairly broad conditions under which there are polynomial- time algorithms for computing the core of a universal solution. We also analyze the computational complexity of the following decision problem that underlies the computation of cores: given two graphs G and H, is H the core of G? Earlier results imply that this problem is both NP-hard and coNP-hard. Here, we pinpoint its exact complexity by establishing that it is a DP-complete problem. Finally, we show that the core is the best among all universal solutions for answering ex- istential queries, and we propose an alternative semantics for answering queries in data exchange settings. Categories and Subject Descriptors: H.2.5 [Heterogeneous Databases]: Data Translation; H.2.4 [Systems]: Relational Databases; H.2.4 [Systems]: Query Processing General Terms: Algorithms, Theory Additional Key Words and Phrases: Certain answers, conjunctive queries, core, universal so- lutions, dependencies, chase, data exchange, data integration, computational complexity, query answering P. G. Kolaitis is on leave from the University of California, Santa Cruz, Santa Cruz, CA; he is partially supported by NSF Grant IIS-9907419. A preliminary version of this article appeared on pages 90–101 of Proceedings of the ACM Sympo- sium on Principles of Database Systems (San Diego, CA). Authors’ addresses: Foundation of Computer Science, IBM Almaden Research Center, Department K53/B2, 650 Harry Road, San Jose, CA 95120; email: {fagin,kolaitis,lucian}@almaden.ibm.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0174 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 174–210. Data Exchange: Getting to the Core • 175 1. INTRODUCTION AND SUMMARY OF RESULTS 1.1 The Data Exchange Problem Data exchange is the problem of materializing an instance that adheres to a target schema, given an instance of a source schema and a speciﬁcation of the relationship between the source schema and the target schema. This problem arises in many tasks requiring data to be transferred between independent ap- plications that do not necessarily adhere to the same data format (or schema). The importance of data exchange was recognized a long time ago; in fact, an early data exchange system was EXPRESS [Shu et al. 1977] from the 1970s, whose main functionality was to convert data between hierarchical schemas. The need for data exchange has steadily increased over the years and, actually, has become more pronounced in recent years, with the proliferation of Web data in various formats and with the emergence of e-business applications that need to communicate data yet remain autonomous. The data exchange problem is related to the data integration problem in the sense that both problems are concerned with management of data stored in heterogeneous formats. The two problems, however, are different for the following reasons. In data exchange, the main focus is on actually materializing a target instance that reﬂects the source data as accurately as possible; this can be a serious challenge, due to the inher- ent underspeciﬁcation of the relationship between the source and the target. In contrast, a target instance need not be materialized in data integration; the main focus there is on answering queries posed over the target schema using views that express the relationship between the target and source schemas. In a previous paper [Fagin et al. 2003], we formalized the data exchange problem and embarked on an in-depth investigation of the foundational and algorithmic issues that surround it. Our work has been motivated by practi- cal considerations arising in the development of Clio [Miller et al. 2000; Popa et al. 2002] at the IBM Almaden Research Center. Clio is a prototype system for schema mapping and data exchange between autonomous applications. A data exchange setting is a quadruple (S, T, st , t ), where S is the source schema, T is the target schema, st is a set of source-to-target dependencies that express the relationship between S and T, and t is a set of dependencies that express constraints on T. Such a setting gives rise to the following data exchange prob- lem: given an instance I over the source schema S, ﬁnd an instance J over the target schema T such that I together with J satisfy the source-to-target dependencies st , and J satisﬁes the target dependencies t . Such an instance J is called a solution for I in the data exchange setting. In general, many differ- ent solutions for an instance I may exist. Thus, the question is: which solution should one choose to materialize, so that it reﬂects the source data as accurately as possible? Moreover, can such a solution be efﬁciently computed? In Fagin et al. [2003], we investigated these issues for data exchange settings in which S and T are relational schemas, st is a set of tuple-generating depen- dencies (tgds) between S and T, and t is a set of tgds and equality-generating dependencies (egds) on T. We isolated a class of solutions, called universal so- lutions, possessing good properties that justify selecting them as the semantics ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 176 • R. Fagin et al. of the data exchange problem. Speciﬁcally, universal solutions have homomor- phisms into every possible solution; in particular, they have homomorphisms into each other, and thus are homomorphically equivalent. Universal solutions are the most general among all solutions and, in a precise sense, they represent the entire space of solutions. Moreover, as we shall explain shortly, universal solutions can be used to compute the “certain answers” of queries q that are unions of conjunctive queries over the target schema. The set certain(q, I ) of certain answers of a query q over the target schema, with respect to a source instance I , consists of all tuples that are in the intersection of all q(J )’s, as J varies over all solutions for I (here, q(J ) denotes the result of evaluating q on J ). The notion of the certain answers originated in the context of incomplete databases (see van der Meyden [1998] for a survey). Moreover, the certain an- swers have been used for query answering in data integration [Lenzerini 2002]. In the same data integration context, Abiteboul and Duschka [1998] studied the complexity of computing the certain answers. We showed [Fagin et al. 2003] that the certain answers of unions of con- junctive queries can be obtained by simply evaluating these queries on some arbitrarily chosen universal solution. We also showed that, under fairly gen- eral, yet practical, conditions, a universal solution exists whenever a solution exists. Furthermore, we showed that when these conditions are satisﬁed, there is a polynomial-time algorithm for computing a canonical universal solution; this algorithm is based on the classical chase procedure [Beeri and Vardi 1984; Maier et al. 1979]. 1.2 Data Exchange with Cores Even though they are homomorphically equivalent to each other, universal solu- tions need not be unique. In other words, in a data exchange setting, there may be many universal solutions for a given source instance I . Thus, it is natural to ask: what makes a universal solution “better” than another universal solution? Is there a “best” universal solution and, of course, what does “best” really mean? If there is a “best” universal solution, can it be efﬁciently computed? The present article addresses these questions and offers answers that are based on using minimality as a key criterion for what constitutes the “best” universal solution. Although universal solutions come in different sizes, they all share a unique (up to isomorphism) common “part,” which is nothing else but the core of each of them, when they are viewed as relational structures. By deﬁnition, the core of a structure is the smallest substructure that is also a homomorphic image of the structure. The concept of the core originated in graph theory, where a number of results about its properties have been established s r (see, for instance, Hell and Neˇ etˇ il [1992]). Moreover, in the early days of database theory, Chandra and Merlin [1977] realized that the core of a structure is useful in conjunctive-query processing. Indeed, since evaluating joins is the most expensive among the basic relational algebra operations, one of the most fundamental problems in query processing is the join-minimization problem: given a conjunctive query q, ﬁnd an equivalent conjunctive query involving the smallest possible number of joins. In turn, this problem amounts to computing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 177 the core of the relational instance Dq that is obtained from q by putting a fact into Dq for each conjunct of q (see Abiteboul et al. [1995]; Chandra and Merlin [1977]; Kanellakis [1990]). Consider a data exchange setting (S, T, st , t ) in which st is a set of source- to-target tgds and t is a set of target tgds and target egds. Since all universal solutions for a source instance I are homomorphically equivalent, it is easy to see that their cores are isomorphic. Moreover, we show in this article that the core of a universal solution for I is itself a solution for I . It follows that the core of the universal solutions for I is the smallest universal solution for I , and thus an ideal candidate for the “best” universal solution, at least in terms of the space required to materialize it. After this, we address the issue of how hard it is to compute the core of a universal solution. Chandra and Merlin [1977] showed that join minimization is an NP-hard problem by pointing out that a graph G is 3-colorable if and only if the 3-element clique K3 is the core of the disjoint sum G ⊕ K3 of G with K3 . From this, it follows that, unless P = NP, there is no polynomial-time al- gorithm that, given a structure as input, outputs its core. At ﬁrst sight, this result casts doubts on the tractability of computing the core of a universal solu- tion. For data exchange, however, we give natural and fairly broad conditions under which there are polynomial-time algorithms for computing the cores of universal solutions. Speciﬁcally, we show that there are polynomial-time algo- rithms for computing the core of universal solutions in data exchange settings in which st is a set of source-to-target tgds and t is a set of target egds. It remains an open problem to determine whether this result can be extended to data exchange settings in which the target constraints t consist of both egds and tgds. We also analyze the computational complexity of the following deci- sion problem, called CORE IDENTIFICATION, which underlies the computation of cores: given two graphs G and H, is H the core of G? As seen above, the results by Chandra and Merlin [1977] imply that this problem is NP-hard. Later on, s r Hell and Neˇ etˇ il [1992] showed that deciding whether a graph G is its own core is a coNP-complete problem; in turn, this implies that CORE IDENTIFICATION is a coNP-hard problem. Here, we pinpoint the exact computational complexity of CORE IDENTIFICATION by showing that it is a DP-complete problem, where DP is the class of decision problems that can be written as the intersection of an NP-problem and a coNP-problem. In the last part of the article, we further justify the selection of the core as the “best” universal solution by establishing its usefulness in answering queries over the target schema T. An existential query q(x) is a formula of the form ∃yφ(x, y), where φ(x, y) is a quantiﬁer-free formula.1 Perhaps the most important examples of existential queries are the conjunctive queries with inequalities =. Another useful example of existential queries is the set- difference query, which asks whether there is a member of the set difference A − B. Let J0 be the core of all universal solutions for a source instance I . As dis- cussed earlier, since J0 is itself a universal solution for I , the certain answers 1 We shall also give a safety condition on φ. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 178 • R. Fagin et al. of conjunctive queries over T can be obtained by simply evaluating them on J0 . In Fagin et al. [2003], however, it was shown that there are simple conjunctive queries with inequalities = such that evaluating them on a universal solution always produces a proper superset of the set of certain answers for I . Nonethe- less, here we show that evaluating existential queries on the core J0 of the uni- versal solutions yields the best approximation (that is, the smallest superset) of the set of the certain answers, among all universal solutions. Analogous to the deﬁnition of certain answers, let us deﬁne the certain answers on universal solutions of a query q over the target schema, with respect to a source instance I , to be the set of all tuples that are in the intersection of all q(J )’s, as J varies over all universal solutions for I ; we write u-certain(q, I ) to denote the certain answers of q on universal solutions for I . Since we consider universal solutions to be the preferred solutions to the data exchange problem, this suggests the naturalness of this notion of certain answers on universal solutions as an alter- native semantics for query answering in data exchange settings. We show that if q is an existential query and J0 is the core of the universal solutions for I , then the set of those tuples in q(J0 ) whose entries are elements from the source instance I is equal to the set u-certain(q, I ) of the certain answers of q on uni- versal solutions. We also show that in the LAV setting (an important scenario in data integration) there is an interesting contrast between the complexity of computing certain answers and of computing certain answers on universal solutions. Speciﬁcally, Abiteboul and Duschka [1998] showed that there is a data exchange setting with t = ∅ and a conjunctive query with inequalities = such that computing the certain answers of this query is a coNP-complete prob- lem. In contrast to this, we establish here that in an even more general data exchange setting (S, T, st , t ) in which st is an arbitrary set of tgds and t is an arbitrary set of egds, for every existential query q (and in particular, for every conjunctive query q with inequalities =), there is a polynomial-time algorithm for computing the set u-certain(q, I ) of the certain answers of q on universal solutions. 2. PRELIMINARIES This section contains the main deﬁnitions related to data exchange and a min- imum amount of background material. The presentation follows closely our earlier paper [Fagin et al. 2003]. 2.1 The Data Exchange Problem A schema is a ﬁnite sequence R = R1 , . . . , Rk of relation symbols, each of a ﬁxed arity. An instance I (over the schema R) is a sequence R1 , . . . , Rk that I I I associates each relation symbol Ri with a relation Ri of the same arity as Ri . We shall often abuse the notation and use Ri to denote both the relation symbol and the relation RiI that interprets it. We may refer to RiI as the Ri relation of I . Given a tuple t occurring in a relation R, we denote by R(t) the association between t and R, and call it a fact. An instance I can be identiﬁed with the set of all facts arising from the relations RiI of I . If R is a schema, then a dependency over R is a sentence in some logical formalism over R. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 179 Let S = S1 , . . . , Sn and T = T1 , . . . , Tm be two schemas with no relation symbols in common. We refer to S as the source schema and to the Si ’s as the source relation symbols. We refer to T as the target schema and to the T j ’s as the target relation symbols. We denote by S, T the schema S1 , . . . , Sn , T1 , . . . , Tm . Instances over S will be called source instances, while instances over T will be called target instances. If I is a source instance and J is a target instance, then we write I, J for the instance K over the schema S, T such that SiK = SiI and T jK = T jJ , when 1 ≤ i ≤ n and 1 ≤ j ≤ m. A source-to-target dependency is, in general, a dependency over S, T of the form ∀x(φS (x) → χT (x)), where φS (x) is a formula, with free variables x, of some logical formalism over S, and χT (x) is a formula, with free variables x, of some logical formalism over T (these two logical formalisms may be different). We use the notation x for a vector of variables x1 , . . . , xk . We assume that all the variables in x appear free in φS (x). A target dependency is, in general, a dependency over the target schema T (the formalism used to express a target dependency may be different from those used for the source-to-target depen- dencies). The source schema may also have dependencies that we assume are satisﬁed by every source instance. While the source dependencies may play an important role in deriving source-to-target dependencies [Popa et al. 2002], they do not play any direct role in data exchange, because we take the source instance to be given. Deﬁnition 2.1. A data exchange setting (S, T, st , t ) consists of a source schema S, a target schema T, a set st of source-to-target dependencies, and a set t of target dependencies. The data exchange problem associated with this setting is the following: given a ﬁnite source instance I , ﬁnd a ﬁnite target instance J such that I, J satisﬁes st and J satisﬁes t . Such a J is called a solution for I or, simply, a solution if the source instance I is understood from the context. For most practical purposes, and for most of the results of this article (all results except for Proposition 2.7), each source-to-target dependency in st is a tuple generating dependency (tgd) [Beeri and Vardi 1984] of the form ∀x(φS (x) → ∃yψT (x, y)), where φS (x) is a conjunction of atomic formulas over S and ψT (x, y) is a conjunc- tion of atomic formulas over T. We assume that all the variables in x appear in φS (x). Moreover, each target dependency in t is either a tgd, of the form ∀x(φT (x) → ∃yψT (x, y)), or an equality-generating dependency (egd) [Beeri and Vardi 1984], of the form ∀x(φT (x) → (x1 = x2 )). In these dependencies, φT (x) and ψT (x, y) are conjunctions of atomic formulas over T, where all the variables in x appear in φT (x), and x1 , x2 are among the variables in x. The tgds and egds together comprise Fagin’s (embedded) implicational dependencies [Fagin 1982]. As in Fagin et al. [2003], we will drop ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 180 • R. Fagin et al. the universal quantiﬁers in front of a dependency, and implicitly assume such quantiﬁcation. However, we will write down all the existential quantiﬁers. Source-to-target tgds are a natural and powerful language for expressing the relationship between a source schema and a target schema. Such dependencies are automatically derived and used as representation of a schema mapping in the Clio system [Popa et al. 2002]. Furthermore, data exchange settings with tgds as source-to-target dependencies include as special cases both local-as- view (LAV) and global-as-view (GAV) data integration systems in which the views are sound and deﬁned by conjunctive queries (see Lenzerini’s tutorial [Lenzerini 2002] for a detailed discussion of LAV and GAV data integration systems and sound views). A LAV data integration system with sound views deﬁned by conjunctive queries is a special case of a data exchange setting (S, T, st , t ), in which S is the source schema (consisting of the views, in LAV terminology), T is the target schema (or global schema, in LAV terminology), the set t of target de- pendencies is empty, and each source-to-target tgd in st is of the form S(x) → ∃y ψT (x, y), where S is a single relation symbol of the source schema S (a view, in LAV terminology) and ψT is a conjunction of atomic formulas over the target schema T. A GAV setting is similar, but the tgds in st are of the form φS (x) → T (x), where T is a single relation symbol over the target schema T (a view, in GAV terminology), and φS is a conjunction of atomic formulas over the source schema S. Since, in general, a source-to-target tgd relates a conjunctive query over the source schema to a conjunctive query over the target schema, a data exchange setting is strictly more expressive than LAV or GAV, and in fact it can be thought of as a GLAV (global-and-local-as-view) system [Friedman et al. 1999; Lenzerini 2002]. These similarities between data integration and data exchange notwithstanding, the main difference between the two is that in data exchange we have to actually materialize a ﬁnite target instance that best re- ﬂects the given source instance. In data integration no such exchange of data is required; the target can remain virtual. In general there may be multiple solutions for a given data exchange problem. The following example illustrates this issue and raises the question of which solution to choose to materialize. Example 2.2. Consider a data exchange problem in which the source schema consists of two binary relation symbols as follows: EmpCity, associating employees with cities they work in, and LivesIn, associating employees with cities they live in. Assume that the target schema consists of three binary re- lation symbols as follows: Home, associating employees with their home cities, EmpDept, associating employees with departments, and DeptCity, associating departments with their cities. We assume that t = ∅. The source-to-target tgds and the source instance are as follows, where (d 1 ), (d 2 ), (d 3 ), and (d 4 ) are labels for convenient reference later: st : (d 1 ) EmpCity(e, c) → ∃HHome(e, H), (d 2 ) EmpCity(e, c) → ∃D(EmpDept(e, D) ∧ DeptCity(D, c)), (d 3 ) LivesIn(e, h) → Home(e, h), ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 181 (d 4 ) LivesIn(e, h) → ∃D∃C(EmpDept(e, D) ∧ DeptCity(D, C)), I = {EmpCity(Alice, SJ), EmpCity(Bob, SD) LivesIn(Alice, SF), LivesIn(Bob, LA)}. We shall use this example as a running example throughout this article. Since the tgds in st do not completely specify the target instance, there are multiple solutions that are consistent with the speciﬁcation. One solution is J0 = {Home(Alice, SF), Home(Bob, SD) EmpDept(Alice, D1 ), EmpDept(Bob, D2 ) DeptCity(D1 , SJ), DeptCity(D2 , SD)}, where D1 and D2 represent “unknown” values, that is, values that do not occur in the source instance. Such values are called labeled nulls and are to be dis- tinguished from the values occurring in the source instance, which are called constants. Instances with constants and labeled nulls are not speciﬁc to data exchange. They have long been considered, in various forms, in the context of incomplete or indeﬁnite databases (see van der Meyden [1998]) as well as in the context of data integration (see Halevy [2001]; Lenzerini [2002]). Intuitively, in the above instance, D1 and D2 are used to “give values” for the existentially quantiﬁed variable D of (d 2 ), in order to satisfy (d 2 ) for the two source tuples EmpCity(Alice, SJ) and EmpCity(Bob, SD). In contrast, two constants (SF and SD) are used to “give values” for the existentially quantiﬁed variable H of (d 1 ), in order to satisfy (d 1 ) for the same two source tuples. The following instances are solutions as well: J = {Home(Alice, SF), Home(Bob, SD) Home(Alice, H1 ), Home(Bob, H2 ) EmpDept(Alice, D1 ), EmpDept(Bob, D2 ) DeptCity(D1 , SJ), DeptCity(D2 , SD)}, J0 = {Home(Alice, SF), Home(Bob, SD) EmpDept(Alice, D), EmpDept(Bob, D) DeptCity(D, SJ), DeptCity(D, SD)}. The instance J differs from J0 by having two extra Home tuples where the home cities of Alice and Bob are two nulls, H1 and H2 , respectively. The second instance J0 differs from J0 by using the same null (namely D) to denote the “unknown” department of both Alice and Bob. Next, we review the notion of universal solutions, proposed in Fagin et al. [2003] as the most general solutions. 2.2 Universal Solutions We denote by Const the set (possibly inﬁnite) of all values that occur in source instances, and as before we call them constants. We also assume an inﬁnite set Var of values, called labeled nulls, such that Var ∩ Const = ∅. We reserve ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 182 • R. Fagin et al. the symbols I, I , I1 , I2 , . . . for instances over the source schema S and with values in Const. We reserve the symbols J, J , J1 , J2 , . . . for instances over the target schema T and with values in Const ∪ Var. Moreover, we require that solutions of a data exchange problem have their values drawn from Const ∪ Var. If R = R1 , . . . , Rk is a schema and K is an instance over R with values in Const∪Var, then Const(K ) denotes the set of all constants occurring in relations in K , and Var(K ) denotes the set of labeled nulls occurring in relations in K . Deﬁnition 2.3. Let K 1 and K 2 be two instances over R with values in Const ∪ Var. 1. A homomorphism h: K 1 → K 2 is a mapping from Const(K 1 ) ∪ Var(K 1 ) to Const(K 2 ) ∪ Var(K 2 ) such that (1) h(c) = c, for every c ∈ Const(K 1 ); (2) for every fact Ri (t) of K 1 , we have that Ri (h(t)) is a fact of K 2 (where, if t = (a1 , . . . , as ), then h(t) = (h(a1 ), . . ., h(as ))). 2. K 1 is homomorphically equivalent to K 2 if there are homomorphisms h: K 1 → K 2 and h : K 2 → K 1 . Deﬁnition 2.4 (Universal Solution). Consider a data exchange setting (S, T, st , t ). If I is a source instance, then a universal solution for I is a solution J for I such that for every solution J for I , there exists a homomorphism h: J → J. Example 2.5. The instance J0 in Example 2.2 is not universal. In particu- lar, there is no homomorphism from J0 to J0 . Hence, the solution J0 contains “extra” information that was not required by the speciﬁcation; in particular, J0 “assumes” that the departments of Alice and Bob are the same. In contrast, it can easily be shown that J0 and J have homomorphisms to every solution (and to each other). Thus, J0 and J are universal solutions. Universal solutions possess good properties that justify selecting them (as opposed to arbitrary solutions) for the semantics of the data exchange problem. A universal solution is more general than an arbitrary solution because, by deﬁnition, it can be homomorphically mapped into that solution. Universal solutions have, also by their deﬁnition, homomorphisms to each other and, thus, are homomorphically equivalent. 2.2.1 Computing Universal Solutions. In Fagin et al. [2003], we addressed the question of how to check the existence of a universal solution and how to compute one, if one exists. In particular, we identiﬁed fairly general, yet practical, conditions that guarantee that universal solutions exist whenever solutions exist. Moreover, we showed that there is a polynomial-time algorithm for computing a canonical universal solution, if a solution exists; this algorithm is based on the classical chase procedure. The following result summarizes these ﬁndings. THEOREM 2.6 [FAGIN ET AL. 2003]. Assume a data exchange setting where st is a set of tgds, and t is the union of a weakly acyclic set of tgds with a set of egds. (1) The existence of a solution can be checked in polynomial time. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 183 (2) A universal solution exists if and only if a solution exists. (3) If a solution exists, then a universal solution can be produced in polynomial time using the chase. The notion of a weakly acyclic set of tgds ﬁrst arose in a conversation between the third author and A. Deutsch in 2001. It was then independently used in Deutsch and Tannen [2003] and in Fagin et al. [2003] (in the former article, under the term constraints with stratiﬁed-witness). This class guarantees the termina- tion of the chase and is quite broad, as it includes both sets of full tgds [Beeri and Vardi 1984] and sets of acyclic inclusion dependencies [Cosmadakis and Kanellakis 1986]. We note that, when the set t of target constraints is empty, a universal solution always exists and a canonical one is constructible in poly- nomial time by chasing I, ∅ with st . In the Example 2.2, the instance J is such a canonical universal solution. If the set t of target constraints contains egds, then it is possible that no universal solution exists (and hence no solution exists, either, by the above theorem). This occurs (see Fagin et al. [2003]) when the chase fails by attempting to identify two constants while trying to apply some egd of t . If the chase does not fail, then the result of chasing I, ∅ with st ∪ t is a canonical universal solution. 2.2.2 Certain Answers. In a data exchange setting, there may be many different solutions for a given source instance. Hence, given a source instance, the question arises as to what the result of answering queries over the target schema is. Following earlier work on information integration, in Fagin et al. [2003] we adopted the notion of the certain answers as the semantics of query answering in data exchange settings. As stated in Section 1, the set certain(q, I ) of the certain answers of q with respect to a source instance I is the set of tuples that appear in q(J ) for every solution J ; in symbols, certain(q, I ) = {q(J ) : J is a solution for I }. Before stating the connection between the certain answers and universal solutions, let us recall the deﬁnitions of conjunctive queries (with inequalities) and unions of conjunctive queries (with inequalities). A conjunctive query q(x) over a schema R is a formula of the form ∃yφ(x, y) where φ(x, y) is a conjunction of atomic formulas over R. If, in addition to atomic formulas, the conjunction φ(x, y) is allowed to contain inequalities of the form z i = z j , where z i , z j are variables among x and y, we call q(x) a conjunctive query with inequalities. We also impose a safety condition, that every variable in x and y must appear in an atomic formula, not just in an inequality. A union of conjunctive queries (with inequalities) is a disjunction q(x) = q1 (x) ∨ · · · ∨ qn (x) where q1 (x), . . . , qn (x) are conjunctive queries (with inequalities). If J is an arbitrary solution, let us denote by q(J )↓ the set of all “null-free” tuples in q(J ), that is the set of all tuples in q(J ) that are formed entirely of constants. The next proposition from Fagin et al. [2003] asserts that null-free evaluation of conjunctive queries on an arbitrarily chosen universal solution gives precisely the set of certain answers. Moreover, universal solutions are the only solutions that have this property. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 184 • R. Fagin et al. PROPOSITION 2.7 [FAGIN ET AL. 2003]. Consider a data exchange setting with S as the source schema, T as the target schema, and such that the dependencies in the sets st and t are arbitrary. (1) Let q be a union of conjunctive queries over the target schema T. If I is a source instance and J is a universal solution, then certain(q, I ) = q(J )↓ . (2) Let I be a source instance and J be a solution such that, for every conjunctive query q over T, we have that certain(q, I ) = q(J )↓ . Then J is a universal solution. 3. DATA EXCHANGE WITH CORES 3.1 Multiple Universal Solutions Even if we restrict attention to universal solutions instead of arbitrary solu- tions, there may still exist multiple, nonisomorphic universal solutions for a given instance of a data exchange problem. Moreover, although these universal solutions are homomorphically equivalent to each other, they may have dif- ferent sizes (where the size is the number of tuples). The following example illustrates this state of affairs. Example 3.1. We again revisit our running example from Example 2.2. As we noted earlier, of the three target instances given there, two of them (namely, J0 and J ) are universal solutions for I . These are nonisomorphic universal solutions (since they have different sizes). We now give an inﬁnite family of nonisomorphic universal solutions, that we shall make use of later. For every m ≥ 0, let Jm be the target instance Jm = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), ... EmpDept(Alice, X m ), EmpDept(Bob, Y m ), DeptCity(X m , SJ), DeptCity(Y m , SD)}, where X 0 , Y 0 , . . . , X m , Y m are distinct labeled nulls. (In the case of m = 0, the resulting instance J0 is the same, modulo renaming of nulls, as the ear- lier J0 from Example 2.2. We take the liberty of using the same name, since the choice of nulls really does not matter.) It is easy to verify that each tar- get instance Jm , for m ≥ 0, is a universal solution for I ; thus, there are in- ﬁnitely many nonisomorphic universal solutions for I . It is also easy to see that every universal solution must contain at least four tuples EmpDept(Alice, X ), EmpDept(Bob, Y ), DeptCity(X , SJ), and DeptCity(Y, SD), for some labeled nulls X and Y , as well as the tuples Home(Alice, SF) and Home(Bob, SD). Consequently, the instance J0 has the smallest size among all universal solutions for I and actually is the unique (up to isomorphism) universal solution of smallest size. Thus, J0 is a rather special universal solution and, from a size point of view, a preferred candidate to materialize in data exchange. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 185 Motivated by the preceding example, in the sequel we introduce and study the concept of the core of a universal solution. We show that the core of a universal solution is the unique (up to isomorphism) smallest universal solution. We then address the problem of computing the core and also investigate the use of cores in answering queries over the target schemas. The results that we will establish make a compelling case that cores are the preferred solutions to materialize in data exchange. 3.2 Cores and Universal Solutions In addition to the notion of an instance over a schema (which we deﬁned earlier), we ﬁnd it convenient to deﬁne the closely related notion of a structure over a schema. The difference is that a structure is deﬁned with a universe, whereas the universe of an instance is implicitly taken to be the “active domain,” that is, the set of elements that appear in tuples of the instance. Furthermore, unlike target instances in data exchange settings, structures do not necessarily have distinguished elements (“constants”) that have to be mapped onto themselves by homomorphisms. More formally, a structure A (over the schema R = R1 , . . . , Rk ) is a sequence A, R1 , . . . , Rk , where A is a nonempty set, called the universe, and each RiA is A A a relation on A of the same arity as the relation symbol Ri . As with instances, we shall often abuse the notation and use Ri to denote both the relation symbol and the relation RiA that interprets it. We may refer to RiA as the Ri relation of A. If A is ﬁnite, then we say that the structure is ﬁnite. A structure B = (B, R1 , . . . , Rk ) B B is a substructure of A if B ⊆ A and Ri ⊆ Ri , for 1 ≤ i ≤ k. We say that B A B is a proper substructure of A if it is a substructure of A and at least one of the containments RiB ⊆ RiA , for 1 ≤ i ≤ k, is a proper one. A structure B = (B, R1 , . . . , Rk ) is an induced substructure of A if B ⊆ A and, for every 1 ≤ B B i ≤ k, we have that RiB = {(x1 , . . . , xn ) | RiA (x1 , . . . , xn ) and x1 , . . . , xn are in B}. Deﬁnition 3.2. A substructure C of structure A is called a core of A if there is a homomorphism from A to C, but there is no homomorphism from A to a proper substructure of C. A structure C is called a core if it is a core of itself, that is, if there is no homomorphism from C to a proper substructure of C. Note that C is a core of A if and only if C is a core, C is a substructure of A, and there is a homomorphism from A to C. The concept of the core of a graph s r has been studied extensively in graph theory (see Hell and Neˇ etˇ il [1992]). The next proposition summarizes some basic facts about cores; a proof can be s r found in Hell and Neˇ etˇ il [1992]. PROPOSITION 3.3. The following statements hold: —Every ﬁnite structure has a core; moreover, all cores of the same ﬁnite structure are isomorphic. —Every ﬁnite structure is homomorphically equivalent to its core. Consequently, two ﬁnite structures are homomorphically equivalent if and only if their cores are isomorphic. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 186 • R. Fagin et al. — If C is the core of a ﬁnite structure A, then there is a homomorphism h: A → C such that h(v) = v for every member v of the universe of C. — If C is the core of a ﬁnite structure A, then C is an induced substructure of A. In view of Proposition 3.3, if A is a ﬁnite structure, there is a unique (up to isomorphism) core of A, which we denote by core(A). We can similarly deﬁne the notions of a subinstance of an instance and of a core of an instance. We identify the instance with the corresponding structure, where the universe of the structure is taken to be the active domain of the instance, and where we distinguish the constants. That is, we require that if h is a homomorphism and c is a constant, then h(c) = c (as already deﬁned in Section 2.2). The results about cores of structures will then carry over to cores of instances. Universal solutions for I are unique up to homomorphic equivalence, but as we saw in Example 3.1, they need not be unique up to isomorphism. Proposi- tion 3.3, however, implies that their cores are isomorphic; in other words, all universal solutions for I have the same core up to isomorphism. Moreover, if J is a universal solution for I and core(J ) is a solution for I , then core(J ) is also a universal solution for I , since J and core(J ) are homomorphically equiva- lent. In general, if the dependencies st and t are arbitrary, then the core of a solution to an instance of the data exchange problem need not be a solution. The next result shows, however, that this cannot happen if st is a set of tgds and t is a set of tgds and egds. PROPOSITION 3.4. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. If I is a source instance and J is a solution for I , then core(J ) is a solution for I . Consequently, if J is a universal solution for I , then also core(J ) is a universal solution for I . PROOF. Let φS (x) → ∃yψT (x, y) be a tgd in st and a = (a1 , . . . , an ) a tuple of constants such that I |= φS (a). Since J is a solution for I , there is a tuple b = (b1 , . . . , bs ) of elements of J such that I, J |= ψT (a, b). Let h be a homomor- phism from J to core(J ). Then h(ai ) = ai , since each ai is a constant, for 1 ≤ i ≤ n. Consequently, I, core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )). Thus, I, core(J ) satisﬁes the tgd. Next, let φT (x) → ∃yψT (x, y) be a tgd in t and a = (a1 , . . . , an ) a tuple of elements in core(J ) such that core(J ) |= φT (a). Since core(J ) is a subinstance of J , it follows that J |= φT (a), and since J is a solution, it follows that there is a tuple b = (b1 , . . . , bs ) of elements of J such that J |= ψT (a, b). According to the last part of Proposition 3.3, there is a homomorphism h from J to core(J ) such that h(v) = v, for every v in core(J ). In particular, h(ai ) = ai , for 1 ≤ i ≤ n. It follows that core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )). Thus, core(J ) satisﬁes the tgd. Finally, let φT (x) → (x1 = x2 ) be an egd in t . If a = (a1 , . . . , as ) is a tuple of elements in core(J ) such that core(J ) |= φT (a), then J |= φT (a), because core(J ) is a subinstance of J . Since J is a solution, it follows that a1 = a2 . Thus, core(J ) satisﬁes every egd in t . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 187 COROLLARY 3.5. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. If I is a source instance for which a universal solution exists, then there is a unique (up to isomorphism) universal solution J0 for I having the following properties: — J0 is a core and is isomorphic to the core of every universal solution J for I . — If J is a universal solution for I , there is a one-to-one homomorphism h from J0 to J . Hence, |J0 | ≤ |J |, where |J0 | and |J | are the sizes of J0 and J . We refer to J0 as the core of the universal solutions for I . As an illustration of the concepts discussed in this subsection, recall the data exchange problem of Example 3.1. Then J0 is indeed the core of the universal solutions for I . The core of the universal solutions is the preferred universal solution to materialize in data exchange, since it is the unique most compact universal solution. In turn, this raises the question of how to compute cores of universal solutions. As mentioned earlier, universal solutions can be canonically com- puted by using the chase. However, the result of such a chase, while a universal solution, need not be the core. In general, an algorithm other than the chase is needed for computing cores of universal solutions. In the next two sections, we study what it takes to compute cores. We begin by analyzing the complexity of computing cores of arbitrary instances and then focus on the computation of cores of universal solutions in data exchange. 4. COMPLEXITY OF CORE IDENTIFICATION Chandra and Merlin [1977] were the ﬁrst to realize that computing the core of a relational structure is an important problem in conjunctive query pro- cessing and optimization. Unfortunately, in its full generality this problem is intractable. Note that computing the core is a function problem, not a decision problem. One way to gauge the difﬁculty of a function problem is to analyze the computational complexity of its underlying decision problem. Deﬁnition 4.1. CORE IDENTIFICATION is the following decision problem: given two structures A and B over some schema R such that B is a substructure of A, is core(A) = B? It is easy to see that CORE IDENTIFICATION is an NP-hard problem. Indeed, consider the following polynomial-time reduction from 3-COLORABILITY: a graph G is 3-colorable if and only if core(G ⊕ K3 ) = K3 , where K3 is the complete graph with 3 nodes and ⊕ is the disjoint sum operation on graphs. This re- duction was already given by Chandra and Merlin [1977]. Later on, Hell and s r Neˇ etˇ il [1992] studied the complexity of recognizing whether a graph is a core. In precise terms, CORE RECOGNITION is the following decision problem: given a structure A over some schema R, is A a core? Clearly, this problem is in coNP. s r Hell and Neˇ etˇ il’s [1992] main result is that CORE RECOGNITION is a coNP- complete problem, even if the inputs are undirected graphs. This is established by exhibiting a rather sophisticated polynomial-time reduction from NON-3- COLORABILITY on graphs of girth at least 7; the “gadgets” used in this reduction are pairwise incomparable cores with certain additional properties. It follows ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 188 • R. Fagin et al. that CORE IDENTIFICATION is a coNP-hard problem. Nonetheless, it appears that the exact complexity of CORE IDENTIFICATION has not been pinpointed in the lit- erature until now. In the sequel, we will establish that CORE IDENTIFICATION is a DP-complete problem. We present ﬁrst some background material about the complexity class DP. The class DP consists of all decision problems that can be written as the in- tersection of an NP-problem and a coNP-problem; equivalently, DP consists of all decision problems that can be written as the difference of two NP-problems. This class was introduced by Papadimitriou and Yannakakis [1982], who dis- covered several DP-complete problems. The prototypical DP-complete problem is SAT/UNSAT: given two Boolean formulas φ and ψ, is φ satisﬁable and ψ unsatisﬁable? Several problems that express some “critical” property turn out to be DP-complete (see Papadimitriou [1994]). For instance, CRITICAL SAT is DP-complete, where an instance of this problem is a CNF-formula φ and the question is to determine whether φ is unsatisﬁable, but if any one of its clauses is removed, then the resulting formula is satisﬁable. Moreover, Cosmadakis [1983] showed that certain problems related to database query evaluation are DP-complete. Note that DP contains both NP and coNP as subclasses; further- more, each DP-complete problem is both NP-hard and coNP-hard. The pre- vailing belief in computational complexity is that the above containments are proper, but proving this remains an outstanding open problem. In any case, establishing that a certain problem is DP-complete is interpreted as signify- ing that this problem is intractable and, in fact, “more intractable” than an NP-complete problem. Here, we establish that CORE IDENTIFICATION is a DP-complete problem by exhibiting a reduction from 3-COLORABILITY/NON-3-COLORABILITY on graphs of girth at least 7. This reduction is directly inspired by the reduction of NON-3- COLORABILITY on graphs of girth at least 7 to CORE RECOGNITION, given in Hell s r and Neˇ etˇ il [1992]. THEOREM 4.2. CORE IDENTIFICATION is DP-complete, even if the inputs are undirected graphs. In proving the above theorem, we make essential use of the following result, s r which is a special case of Theorem 6 in [Hell and Neˇ etˇ il 1992]. Recall that the girth of a graph is the length of the shortest cycle in the graph. ˇ ˇ THEOREM 4.3 (HELL AND NESETRIL 1992). For each positive integer N , there is a sequence A1 , . . . A N of connected graphs such that (1) each Ai is 3-colorable, has girth 5, and each edge of Ai is on a 5-cycle; (2) each Ai is a core; moreover, for every i, j with i ≤ n, j ≤ n and i = j , there is no homomorphism from Ai to A j ; (3) each Ai has at most 15(N + 4) nodes; and (4) there is a polynomial-time algorithm that, given N , constructs the sequence A1 , . . . A N . We now have the machinery needed to prove Theorem 4.2. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 189 PROOF OF THEOREM 4.2. CORE IDENTIFICATION is in DP, because, given two structures A and B over some schema R such that B is a substructure of A, to determine whether core(A) = B one has to check whether there is a homo- morphism from A to B (which is in NP) and whether B is a core (which is in coNP). We will show that CORE IDENTIFICATION is DP-hard, even if the inputs are undirected graphs, via a polynomial-time reduction from 3-COLORABILITY/ NON-3-COLORABILITY. As a stepping stone in this reduction, we will deﬁne CORE HOMOMORPHISM, which is the following variant of CORE IDENTIFICATION: given two structures A and B, is there a homomorphism from A to B, and is B a core? There is a simple polynomial-time reduction of CORE HOMOMORPHISM to CORE IDENTIFICATION, where the instance (A, B) is mapped onto (A ⊕ B, B). This is a reduction, since there is a homomorphism from A to B with B as a core if and only if core(A⊕B) = B. Thus, it remains to show that there is a polynomial-time reduction of 3-COLORABILITY/NON-3-COLORABILITY to CORE HOMOMORPHISM. Hell and Neˇ etˇ il [1992] showed that 3-COLORABILITY is NP-complete even if s r the input graphs have girth at least 7 (this follows from Theorem 7 in Hell s r and Neˇ etˇ il [1992] by taking A to be a self-loop and B to be K3 ). Hence, 3- COLORABILITY/NON-3-COLORABILITY is DP-complete, even if the input graphs G and H have girth at least 7. So, assume that we are given two graphs G and H each having girth at least 7. Let v1 , . . . , vm be an enumeration of the nodes of G, let w1 , . . . , wn be an enumeration of the nodes of H, and let N = m + n. Let A1 , . . . , A N be a sequence of connected graphs having the properties listed in Theorem 4.3. This sequence can be constructed in time polynomial in N ; moreover, we can assume that these graphs have pairwise disjoint sets of nodes. Let G∗ be the graph obtained by identifying each node vi of G with some arbitrarily chosen node of Ai , for 1 ≤ i ≤ m (and keeping the edges between nodes of G intact). Thus, the nodes of G∗ are the nodes that appear in the Ai ’s, and the edges are the edges in the Ai ’s, along with the edges of G under our identiﬁcation. Similarly, let H∗ be the graph obtained by identifying each node w j of H with some arbitrarily chosen node of A j , for m + 1 ≤ j ≤ N = m + n (and keeping the edges between nodes of H intact). We now claim that G is 3- colorable and H is not 3-colorable if and only if there is a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is a core. Hell and Neˇ etˇ il [1992] showed that s r CORE RECOGNITION is coNP-complete by showing that a graph H of girth at least 7 is not 3-colorable if and only if the graph H∗ ⊕ K3 is a core. We will use this property in order to establish the above claim. Assume ﬁrst that G is 3-colorable and H is not 3-colorable. Since each Ai is a 3-colorable graph, G∗ ⊕ K3 is 3-colorable and so there a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 (in fact, to K3 ). Moreover, as shown in Hell and Neˇ etˇ ils r [1992], H∗ ⊕ K3 is a core, since H is not 3-colorable. For the other direction, assume that there is a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is s r a core. Using again the results in Hell and Neˇ etˇ il [1992], we infer that H is not 3-colorable. It remains to prove that G is 3-colorable. Let h be a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 . We claim that h actually maps G∗ to K3 ; hence, G is 3-colorable. Let us consider the image of each graph Ai , with 1 ≤ i ≤ m, under the homomorphism h. Observe that Ai cannot be mapped to some A j , when ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 190 • R. Fagin et al. m + 1 ≤ j ≤ N = m + n, since, for every i and j such that 1 ≤ i ≤ m and m + 1 ≤ j ≤ N = m + n, there is no homomorphism from Ai to A j . Observe also that the image of a cycle C under a homomorphism is a cycle C of length less than or equal the length of C. Since H has girth at least 7 and since each edge of Ai is on a 5-cycle, the image of Ai under h cannot be contained in H. For the same reason, the image of Ai under h cannot contain nodes from H and some A j , for m + 1 ≤ j ≤ N = m + n; moreover, it cannot contain nodes from two different A j ’s, for m + 1 ≤ j ≤ N = m + n (here, we also use the fact that each A j has girth 5). Consequently, the homomorphism h must map each Ai , 1 ≤ i ≤ m, to K3 . Hence, h maps G∗ to K3 , and so G is 3-colorable. It should be noted that problems equivalent to CORE RECOGNITION and CORE IDENTIFICATION have been investigated in logic programming and artiﬁcial intel- ¨ ligence. Speciﬁcally, Gottlob and Fermuller [1993] studied the problem of re- moving redundant literals from a clause, and analyzed the computational com- plexity of two related decision problems: the problem of determining whether a given clause is condensed and the problem of determining whether, given ¨ two clauses, one is a condensation of the other. Gottlob and Fermuller showed that the ﬁrst problem is coNP-complete and the second is DP-complete. As it turns out, determining whether a given clause is condensed is equivalent to CORE RECOGNITION, while determining whether a clause is a condensation of an- other clause is equivalent to CORE IDENTIFICATION. Thus, the complexity of CORE RECOGNITION and CORE IDENTIFICATION for relational structures (but not for undi- rected graphs) can also be derived from the results in Gottlob and Fermuller ¨ ¨ [1993]. As a matter of fact, the reductions in Gottlob and Fermuller [1993] give easier proofs for the coNP-hardness and DP-hardness of CORE RECOGNITION and CORE IDENTIFICATION, respectively, for undirected graphs with constants, that is, undirected graphs in which certain nodes are distinguished so that every homomorphism maps each such constant to itself (alternatively, graphs with constants can be viewed as relational structures with a binary relation for the edges and unary relations each of which consists of one of the constants). For in- stance, the coNP-hardness of CORE IDENTIFICATION for graphs with constants can be established via the following reduction from the CLIQUE problem. Given an undirected graph G and a positive integer k, consider the disjoint sum G ⊕ Kk , where Kk is the complete graph with k elements. If every node in G is viewed as a constant, then G ⊕ Kk is a core if and only if G does not contain a clique with k elements. We now consider the implications of the intractability of CORE RECOGNITION for the problem of computing the core of a structure. As stated earlier, Chandra and Merlin [1977] observed that a graph G is 3-colorable if and only if core(G⊕K3 ) = K3 . It follows that, unless P = NP, there is no polynomial-time algorithm for computing the core of a given structure. Indeed, if such an algorithm existed, then we could determine in polynomial time whether a graph is 3-colorable by ﬁrst running the algorithm to compute the core of G ⊕ K3 and then checking if the answer is equal to K3 . Note, however, that in data exchange we are interested in computing the core of a universal solution, rather than the core of an arbitrary instance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 191 Consequently, we cannot assume a priori that the above intractability car- ries over to the data exchange setting, since polynomial-time algorithms for computing the core of universal solutions may exist. We address this next. 5. COMPUTING THE CORE IN DATA EXCHANGE In contrast with the case of computing the core of an arbitrary instance, comput- ing the core of a universal solution in data exchange does have polynomial-time algorithms, in certain natural data exchange settings. Speciﬁcally, in this sec- tion we show that the core of a universal solution can be computed in polynomial time in data exchange settings in which st is an arbitrary set of tgds and t is a set of egds. We give two rather different polynomial-time algorithms for the task of com- puting the core in data exchange settings in which st is an arbitrary set of tgds and t is a set of egds: a greedy algorithm and an algorithm we call the blocks algorithm. Section 5.1 is devoted to the greedy algorithm. In Section 5.2 we present the blocks algorithm for data exchange settings with no target con- straints (i.e., t = ∅). We then show in Section 5.3 that essentially the same blocks algorithm works if we remove the emptiness condition on t and al- low it to contain egds. Although the blocks algorithm is more complicated than the greedy algorithm (and its proof of correctness much more involved), it has certain advantages for data exchange that we will describe later on. In what follows, we assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Given a source instance I , we let J be the target instance obtained by chasing I, ∅ with st . We call J a canonical preuniversal instance for I . Note that J is a canonical universal solution for I with respect to the data exchange setting (S, T, st , ∅) (that is, no target constraints). 5.1 Greedy Algorithm Intuitively, given a source instance I , the greedy algorithm ﬁrst determines whether solutions for I exist, and then, if solutions exist, computes the core of the universal solutions for I by successively removing tuples from a canonical universal solution for I , as long as I and the instance resulting in each step satisfy the tgds in st . Recall that a fact is an expression of the form R(t) indi- cating that the tuple t belongs to the relation R; moreover, every instance can be identiﬁed with the set of all facts arising from the relations of that instance. Algorithm 5.1 (Greedy Algorithm). Input: source instance I . Output: the core of the universal solutions for I , if solutions exist; “failure,” otherwise. (1) Chase I with st to produce a canonical pre-universal instance J . (2) Chase J with t ; if the chase fails, then stop and return “failure”; otherwise, let J be the canonical universal solution for I produced by the chase. (3) Initialize J ∗ to be J . (4) While there is a fact R(t) in J ∗ such that I, J ∗ − {R(t)} satisﬁes st , set J ∗ to be J ∗ − {R(t)}. (5) Return J ∗ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 192 • R. Fagin et al. THEOREM 5.2. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Then Algorithm 5.1 is a correct, polynomial-time algorithm for computing the core of universal solutions. PROOF. As shown in Fagin et al. [2003] (see also Theorem 2.6), the chase is a correct, polynomial-time algorithm for determining whether, given a source instance I , a solution exists and, if so, producing the canonical universal solution J . Assume that for a given source instance I , a canonical universal solution J for I has been produced in Step (2) of the greedy algorithm. We claim that each target instance J ∗ produced during the iterations of the while loop in Step (4) is a universal solution for I . To begin with, I, J ∗ satisﬁes the tgds in st by construction. Furthermore, J ∗ satisﬁes the egds in t , because J ∗ is a subin- stance of J , and J satisﬁes the egds in t . Consequently, J ∗ is a solution for I ; moreover, it is a universal solution, since it is a subinstance of the canonical universal solution J for I and thus it can be mapped homomorphically into every solution for I . Let C be the target instance returned by the algorithm. Then C is a universal solution for I and hence it contains an isomorphic copy J0 of the core of the universal solutions as a subinstance. We claim that C = J0 . Indeed, if there is a fact R(t) in C − J0 , then C − {R(t)} satisﬁes the tgds in st , since J0 satisﬁes the tgds in st and is a subinstance of J0 − {R(t)}; thus, the algorithm could not have returned C as output. In order to analyze the running time of the algorithm, we consider the following parameters: m is the size of the source instance I (number of tuples in I ); a is the maximum number of universally quantiﬁed variables over all tgds in st ; b is the maximum number of existentially quantiﬁed variables over all tgds in st ; ﬁnally, a is the maximum number of universally quantiﬁed variables over all egds in t . Since the data exchange setting is ﬁxed, the quantities a, b, and a are constants. Given a source instance I of size m, the size of the canonical preuniversal instance J is O(ma ) and the time needed to produce it is O(ma+ab ). Indeed, the canonical preuniversal instance is constructed by considering each tgd (∀x)(ϕS (x) → (∃y)ψT (x, y)) in st , instantiating the universally quantiﬁed variables x with elements from I in every possible way, and, for each such instantiation, checking whether the existentially quantiﬁed variables y can be instantiated by existing elements so that the formula ψT (x, y) is satisﬁed, and, if not, adding null values and facts to satisfy it. Since st is ﬁxed, at most a constant number of facts are added at each step, which accounts for the O(ma ) bound in the size of the canonical preuniversal instance. There are O(ma ) possible instantiations of the universally quantiﬁed variables, and for each such instantiation O((ma )b) steps are needed to check whether the existentially quantiﬁed variables can be instantiated by existing elements, hence the total time required to construct the canonical preuniversal instance is O(ma+ab ). The size of the canonical universal solution J is also O(ma ) (since it is at most the size of J ) and the time needed to produce J from J is O(maa +2a ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 193 Indeed, chasing with the egds in t requires at most O((ma )2 ) = O(m2a ) chase steps, since in the worst case every two values will be set equal to each other. Moreover, each chase step takes time O((ma )a ), since at each step we need to instantiate the universally quantiﬁed variables in the egds in every possible way. The while loop in Step (4) requires at most O(ma ) iterations each of which takes O(ma+ab ) steps to verify that st is satisﬁed by I, J ∗ − {R(t)} . Thus, Step (4) takes time O(m2a+ab ). It follows that the running time of the greedy algorithm is O(m2a+ab + m2a+aa ). Several remarks are in order now. First, it should be noted that the cor- rectness of the greedy algorithm depends crucially on the assumption that t consists of egds only. The crucial property that holds for egds, but fails for tgds, is that if an instance satisﬁes an egd, then every subinstance of it also satisﬁes that egd. Thus, if the greedy algorithm is applied to data exchange settings in which t contains at least one tgd, then the output of the algorithm may fail to be a solution for the input instance. One can consider a variant of the greedy algorithm in which the test in the while loop is that I, J ∗ − {R(t)} satisﬁes both st and t . This modiﬁed greedy algorithm outputs a universal solution for I , but it is not too hard to construct examples in which the output is not the core of the universal solutions for I . Note that Step (4) of the greedy algorithm can also be construed as a polynomial-time algorithm for producing the core of the universal solutions, given a source instance I and some arbitrary universal solution J for I . The ﬁrst two steps of the greedy algorithm produce a universal solution for I in time polynomial in the size of the source instance I or determine that no solution for I exists, so that the entire greedy algorithm runs in time polynomial in the size of I . Although the greedy algorithm is conceptually simple and its proof of correct- ness transparent, it requires that the source instance I be available throughout the execution of the algorithm. There are situations, however, in which the orig- inal source I becomes unavailable, after a canonical universal solution J for I has been produced. In particular, the Clio system [Popa et al. 2002] uses a specialized engine to produce a canonical universal solution, when there are no target constraints, or a canonical preuniversal instance, when there are target constraints. Any further processing, such as chasing with target egds or pro- ducing the core, will have to be done by another engine or application that may not have access to the original source instance. This state of affairs raises the question of whether the core of the universal solutions can be produced in polynomial time using only a canonical univer- sal solution or only a canonical pre-universal instance. In what follows, we describe such an algorithm, called the blocks algorithm, which has the fea- ture that it can start from either a canonical universal solution or a canonical pre-universal instance, and has no further need for the source instance. We present the blocks algorithms in two stages: ﬁrst, for the case in which there are no target constraints ( t = ∅), and then for the case in which t is a set of egds. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 194 • R. Fagin et al. 5.2 Blocks Algorithm: No Target Constrains We ﬁrst deﬁne some notions that are needed in order to state the algorithm as well as to prove its correctness and polynomial-time bound. For the next two deﬁnitions, we assume K to be an arbitrary instance whose elements consists of constants from Const and nulls from Var. We say that two elements of K are adjacent if there exists some tuple in some relation of K in which both elements occur. Deﬁnition 5.3. The Gaifman graph of the nulls of K is an undirected graph in which (1) the nodes are all the nulls of K , and (2) there exists an edge between two nulls whenever the nulls are adjacent in K . A block of nulls is the set of nulls in a connected component of the Gaifman graph of the nulls. If y is a null of K , then we may refer to the block of nulls that contains y as the block of y. Note that, by the deﬁnition of blocks, the set Var(K ) of all nulls of K is partitioned into disjoint blocks. Let K and K be two instances with elements in Const ∪ Var. Recall that K is a subinstance of K if every tuple of a relation of K is a tuple of the corresponding relation of K . Deﬁnition 5.4. Let h be a homomorphism of K . Denote the result of ap- plying h to K by h(K ). If h(K ) is a subinstance of K , then we call h an endo- morphism of K . An endomorphism h of K is useful if h(K ) = K (i.e., h(K ) is a proper subinstance of K ). The following lemma is a simple characterization of useful endomorphisms that we will make use of in proving the main results of this subsection and of Section 5.3. LEMMA 5.5. Let K be an instance, and let h be an endomorphism of K . Then h is useful if and only if h is not one-to-one. PROOF. Assume that h is not one-to-one. Then there is some x that is in the domain of h but not in the range of h (here we use the fact that the instance is ﬁnite.) So no tuple containing x is in h(K ). Therefore, h(K ) = K , and so h is useful. Now assume that h is one-to-one. So h is simply a renaming of the members of K , and so an isomorphism of K . Thus, h(K ) has the same number of tuples as K . Since h(K ) is a subinstance of K , it follows that h(K ) = K (here again we use the fact that the instance K is ﬁnite). So h is not useful. For the rest of this subsection, we assume that we are given a data exchange setting (S, T, st , ∅) and a source instance I . Moreover, we assume that J is a canonical universal solution for this data exchange problem. That is, J is such that I, J is the result of chasing I, ∅ with st . Our goal is to compute core(J ), that is, a subinstance C of J such that (1) C = h(J ) for some endomorphism h of J , and (2) there is no proper subinstance of C with the same property (condition (2) is equivalent to there being no endomorphism of C onto a proper subinstance of C). The central idea of the algorithm, as we shall see, is to show that the above mentioned endomorphism h of J can be found as the composition ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 195 of a polynomial-length sequence of “local” (or “small”) endomorphisms, each of which can be found in polynomial time. We next deﬁne what “local” means. Deﬁnition 5.6. Let K and K be two instances such that the nulls of K form a subset of the nulls of K , that is, Var(K ) ⊆ Var(K ). Let h be some endo- morphism of K , and let B be a block of nulls of K . We say that h is K -local for B if h(x) = x whenever x ∈ B. (Since all the nulls of K are among the nulls of K , it makes sense to consider whether or not a null x of K belongs to the block B of K .) We say that h is K -local if it is K -local for B, for some block B of K . The next lemma is crucial for the existence of the polynomial-time algorithm for computing the core of a universal solution. LEMMA 5.7. Assume a data exchange setting where st is a set of tgds and t = ∅. Let J be a subinstance of the canonical universal solution J . If there exists a useful endomorphism of J , then there exists a useful J -local endomor- phism of J . PROOF. Let h be a useful endomorphism of J . By Lemma 5.5, we know that h is not one-to-one. So there is a null y that appears in J but does not appear in h(J ). Let B be the block of y (in J ). Deﬁne h on J by letting h (x) = h(x) if x ∈ B, and h (x) = x otherwise. We show that h is an endomorphism of J . Let (u1 , . . . , us ) be a tuple of the R relation of J ; we must show that (h (u1 ), . . . , h (us )) is a tuple of the R relation of J . Since J is a subinstance of J , the tuple (u1 , . . . , us ) is also a tuple of the R relation of J . Hence, by deﬁnition of a block of J , all the nulls among u1 , . . . , us are in the same block B . There are two cases, depending on whether or not B = B. Assume ﬁrst that B = B. Then, by deﬁnition of h , for every ui among u1 , . . . , us , we have that h (ui ) = h(ui ) if ui is a null, and h (ui ) = ui = h(ui ) if ui is a constant. Hence (h (u1 ), . . . , h (us )) = (h(u1 ), . . . , h(us )). Since h is an endomorphism of J , we know that (h(u1 ), . . ., h(us )) is a tuple of the R relation of J . Thus, (h (u1 ), . . . , h (us )) is a tuple of the R relation of J . Now assume that B = B. So for every ui among u1 , . . . , us , we have that h (ui ) = ui . Hence (h (u1 ), . . . , h (us )) = (u1 , . . . , us ). Therefore, once again, (h (u1 ), . . . , h (us )) is a tuple of the R relation of J , as desired. Hence, h is an endomorphism of J . We now present the blocks algorithm for computing the core of the universal solutions, when t = ∅. Algorithm 5.8 (Blocks Algorithm: No Target Constraints). Input: source instance I . Output: the core of the universal solutions for I . (1) Compute J , the canonical universal solution, from I, ∅ by chasing with st . (2) Compute the blocks of J , and initialize J to be J . (3) Check whether there exists a useful J -local endomorphism h of J . If not, then stop with result J . (4) Update J to be h(J ), and return to Step (3). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 196 • R. Fagin et al. THEOREM 5.9. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t = ∅. Then Algorithm 5.8 is a correct, polynomial- time algorithm for computing the core of the universal solutions. PROOF. We ﬁrst show that Algorithm 5.8 is correct, that is, that the ﬁnal instance C at the conclusion of the algorithm is the core of the given universal solution. Every time we apply Step (4) of the algorithm, we are replacing the instance by a homomorphic image. Therefore, the ﬁnal instance C is the result of applying a composition of homomorphisms to the input instance, and hence is a homomorphic image of the canonical universal solution J . Also, since each of the homomorphisms found in Step (3) is an endomorphism, we have that C is a subinstance of J . Assume now that C is not the core; we shall derive a contradiction. Since C is not the core, there is an endomorphism h such that when h is applied to C, the resulting instance is a proper subinstance of C. Hence, h is a useful endomorphism of C. Therefore, by Lemma 5.7, there must exist a useful J -local endomorphism of C. But then Algorithm 5.8 should not have stopped in Step 3 with C. This is the desired contradiction. Hence, C is the core of J . We now show that Algorithm 5.8 runs in polynomial time. To do so, we need to consider certain parameters. As in the analysis of greedy algorithm, the ﬁrst parameter, denoted by b, is the maximum number of existentially quantiﬁed variables over all tgds in st . Since we are taking st to be ﬁxed, the quantity b is a constant. It follows easily from the construction of the canonical universal solution J (by chasing with st ) that b is an upper bound on the size of a block in J . The second parameter, denoted by n, is the size of the canonical univer- sal solution J (number of tuples in J ); as seen in the analysis of the greedy algorithm, n is O(ma ), where a is the maximum number of the universally quan- tiﬁed variables over all tgds in st and m is the size of I . Let J be the instance in some execution of Step (3). For each block B, to check if there is a useful endomorphism of J that is J -local for B, we can exhaustively check each of the possible functions h on the domain of J such that h(x) = x whenever x ∈ B: there are at most nb such functions. To check that such a function is actually a useful endomorphism requires time O(n). Since there are at most n blocks, the time to determine if there is a block with a useful J -local endomorphism is O(nb+2 ). The updating time in Step (4) is O(n). By Lemma 5.5, after Step (4) is executed, there is at least one less null in J than there was before. Since there are initially at most n nulls in the instance, it follows that the number of loops that Algorithm 5.8 performs is at most n. Therefore, the running time of the algorithm (except for Step (1) and Step (2), which are executed only once) is at most n (the number of loops) times O(nb+2 ), that is, O(nb+3 ). Since Step (1) and Step (2) take polynomial time as well, it follows that the entire algorithm executes in polynomial time. The crucial observation behind the polynomial-time bound is that the total number of endomorphisms that the algorithm explores in Step (3) is at most nb for each block of J . This is in strong contrast with the case of minimizing arbitrary instances with constants and nulls for which we may need to explore ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 197 a much larger number of endomorphisms (up to nn , in general) in one mini- mization step. 5.3 Blocks Algorithm: Target Egds In this subsection, we extend Theorem 5.9 by showing that there is a polynomial-time algorithm for ﬁnding the core even when t is a set of egds. Thus, we assume next that we are given a data exchange setting (S, T, st , t ) where t is a set of egds. We are also given a source instance I . As with the greedy algorithm, let J be a canonical preuniversal instance, that is, J is the result of chasing I with st . Let J be the canonical universal solution obtained by chasing J with t . Our goal is to compute core(J ), that is, a subinstance C of J such that C = h(J ) for some endomorphism h of J , and such that there is no proper subinstance of C with the same property. As in the case when t = ∅, the central idea of the algorithm is to show that the above mentioned endomorphism h of J can be found as the composition of a polynomial-length sequence of “small” endomorphisms, each ﬁndable in polynomial time. As in the case when t = ∅, “small” will mean J -local. We make this precise in the next lemma. This lemma, crucial for the existence of the polynomial-time algorithm for computing core(J ), is a nontrivial generalization of Lemma 5.7. LEMMA 5.10. Assume a data exchange setting where st is a set of tgds and t is a set of egds. Let J be the canonical preuniversal instance, and let J be an endomorphic image of the canonical universal solution J . If there exists a useful endomorphism of J , then there exists a useful J -local endomorphism of J . The proof of Lemma 5.10 requires additional deﬁnitions as well as two addi- tional lemmas. We start with the required deﬁnitions. Let J be the canonical preuniversal instance, and let J be the canonical universal solution produced from J by chasing with the set t of egds. We deﬁne a directed graph, whose nodes are the members of J , both nulls and constants. If during the chase process, a null u gets replaced by v (either a null or a constant), then there is an edge from u to v in the graph. Let ≤ be the reﬂexive, transitive closure of this graph. It is easy to see that ≤ is a reﬂexive partial order. For each node u, deﬁne [u] to be the maximal (under ≤) node v such that u ≤ v. Intuitively, u eventually gets replaced by [u] as a result of the chase. It is clear that every member of J is of the form [u]. It is also clear that if u is a constant, then u = [u]. Let us write u ∼ v if [u] = [v]. Intuitively, u ∼ v means that u and v eventually collapse to the same element as a result of the chase. Deﬁnition 5.11. Let K be an instance whose elements are constants and nulls. Let y be some element of K . We say that y is rigid if h( y) = y for every homomorphism h of K . (In particular, all constants occurring in K are rigid.) A key step in the proof of Lemma 5.10 is the following surprising result, which says that if two nulls in different blocks of J both collapse onto the same element z of J as a result of the chase, then z is rigid, that is, h(z) = z for every endomorphism h of J . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 198 • R. Fagin et al. LEMMA 5.12 (RIGIDITY LEMMA). Assume a data exchange setting where st is a set of tgds and t is a set of egds. Let J be the canonical preuniversal instance, and let J be the result of chasing J with the set t of egds. Let x and y be nulls of J such that x ∼ y, and such that [x] is a nonrigid null of J . Then x and y are in the same block of J . PROOF. Assume that x and y are nulls in different blocks of J with x ∼ y. We must show that [x] is rigid in J . Let φ be the diagram of the instance J , that is, the conjunction of all expressions S(u1 , . . . , us ) where (u1 , . . . , us ) is a tuple of the S relation of J . (We are treating members of J , both constants and nulls, as variables.) Let τ be the egd φ → (x = y). Since x ∼ y, it follows that t |= τ . This is because the chase sets variables equal only when it is logically forced to (the result appears in papers that characterize the implication problem for dependencies; see, for instance, Beeri and Vardi [1984]; Maier et al. [1979]). Since J satisﬁes t , it follows that J satisﬁes τ . We wish to show that [x] is rigid in J . Let h be a homomorphism of J ; we must show that h([x]) = [x]. Let B be the block of x in J . Let V be the assignment to the variables of τ obtained by letting V (u) = h([u]) if u ∈ B, and V (u) = [u] otherwise. We now show that V is a valid assignment for φ in J , that is, that for each conjunct S(u1 , . . . , us ) of φ, necessarily (V (u1 ), . . . , V (us )) is a tuple of the S relation of J . Let S(u1 , . . . , us ) be a conjunct of φ. By the construction of the chase, we know that ([u1 ], . . . , [us ]) is a tuple of the S relation of J , since (u1 , . . . , us ) is a tuple of the S relation of J . There are two cases, depending on whether or not some ui (with 1 ≤ i ≤ s) is in B. If no ui is in B, then V (ui ) = [ui ] for each i, and so (V (u1 ), . . . , V (us )) is a tuple of the S relation of J , as desired. If some ui is in B, then every ui is either a null in B or a constant (this is because (u1 , . . . , us ) is a tuple of the S relation of J ). If ui is a null in B, then V (ui ) = h([ui ]). If ui is a constant, then ui = [ui ], and so V (ui ) = [ui ] = ui = h(ui ) = h([ui ]), where the third equality holds since h is a homomorphism and ui is a constant. Thus, in both cases, we have V (ui ) = h([ui ]). Since ([u1 ], . . . , [us ]) is a tuple of the S relation of J and h is a homomorphism of J , we know that (h[u1 ], . . . , h[us ]) is a tuple of the S relation of J . So again, (V (u1 ), . . . , V (us )) is a tuple of the S relation of J , as desired. Hence, V is a valid assignment for φ in J . Therefore, since J satisﬁes τ , it follows that in J , we have V (x) = V ( y). Now V (x) = h([x]), since x ∈ B. Further, V ( y) = [ y], since y ∈ B (because y is in a different block than x). So h([x]) = [ y]. Since x ∼ y, that is, [x] = [ y], we have h([x]) = [ y] = [x], which shows that h([x]) = [x], as desired. The contrapositive of Lemma 5.12 says that if x and y are nulls in different blocks of J that are set equal (perhaps transitively) during the chase, then [x] is rigid in J . LEMMA 5.13. Let h be an endomorphism of J . Then every rigid element of J is a rigid element of h(J ). PROOF. Let u be a rigid element of J . Then h(u) is an element of h(J ), and so u is an element of h(J ), since h(u) = u by rigidity. Let h be a homomorphism ˆ of h(J ); we must show that h(u) = u. But h(u) = hh(u), since h(u) = u. Now ˆ ˆ ˆ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 199 ˆ hh is also a homomorphism of J , since the composition of homomorphisms is a homomorphism. By rigidity of u in J , it follows that hh(u) = u. So h(u) = ˆ ˆ hh(u) = u, as desired. ˆ We are now ready to give the proof of Lemma 5.10, after which we will present the blocks algorithm for the case of target egds. PROOF OF LEMMA 5.10. Let h be an endomorphism of J such that J = h(J ), and let h be a useful endomorphism of h(J ). By Lemma 5.5, there is a null y that appears in h(J ) but does not appear in h h(J ). Let B be the block in J that contains y. Deﬁne h on h(J ) by letting h (x) = h (x) if x ∈ B, and h (x) = x otherwise. We shall show that h is a useful J -local endomorphism of h(J ). We now show that h is an endomorphism of h(J ). Let (u1 , . . . , us ) be a tuple of the R relation of h(J ); we must show that (h (u1 ), . . ., h (us )) is a tuple of the R relation of h(J ). We ﬁrst show that every nonrigid null among u1 , . . . , us is in the same block of J . Let u p and uq be nonrigid nulls among u1 , . . . , us ; we show that u p and uq are in the same block of J . Since (u1 , . . . , us ) is a tuple of the R relation of h(J ), and h(J ) is a subinstance of J , we know that (u1 , . . . , us ) is a tuple of the R relation of J . By construction of J from J using the chase, we know that there is ui where ui ∼ ui for 1 ≤ i ≤ s, such that (u1 , . . . , us ) is a tuple of the R relation of J . Since u p and uq are nonrigid nulls of h(J ), it follows from Lemma 5.13 that u p and uq are nonrigid nulls of J . Now u p is not a constant, since u p ∼ u p and u p is a nonrigid null. Similarly, uq is not a constant. So u p and uq are in the same block B of J . Now [u p ] = u p , since u p is in J . Since u p ∼ u p and [u p ] = u p is nonrigid, it follows from Lemma 5.12 that u p and u p are in the same block of J , and so u p ∈ B . Similarly, uq ∈ B . So u p and uq are in the same block B of J , as desired. There are now two cases, depending on whether or not B = B. Assume ﬁrst that B = B. For those ui ’s that are nonrigid, we showed that ui ∈ B = B, and so h (ui ) = h (ui ). For those u j ’s that are rigid (including nulls and constants), we have h (u j ) = u j = h (u j ). So for every ui among u1 , . . . , us , we have h (u j ) = h (u j ). Since h is a homomorphism of h(J ), and since (u1 , . . . , us ) is a tuple of the R relation of h(J ), we know that (h (u1 ), . . . , h (us )) is a tuple of the R relation of h(J ). Hence (h (u1 ), . . . , h (us )) is a tuple of the R relation of h(J ), as desired. Now assume that B = B. For those ui ’s that are nonrigid, we showed that ui ∈ B , and so ui ∈ B. Hence, for those ui ’s that are nonrigid, we have h (u j ) = u j . But also h (ui ) = ui for the rigid ui ’s. Thus, (h (u1 ), . . . , h (us )) = (u1 , . . . , us ). Hence, once again, (h (u1 ), . . . , h (us )) is a tuple of the R relation of h(J ), as desired. So h is an endomorphism of h(J ). By deﬁnition, h is J -local. We now show that h is useful. Since y appears in h(J ), Lemma 5.5 tells us that we need only show that the range of h does not contain y. If x ∈ B, then h (x) = h (x) = y, since the range of h does not include y. If x ∈ B, then h (x) = x = y, since y ∈ B. So the range of h does not contain y, and hence h is useful. Therefore, h is a useful J -local endomorphism of h(J ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 200 • R. Fagin et al. We now present the blocks algorithm for computing the core when t is a set of egds. (As mentioned earlier, when the target constraints include egds, it may be possible that there are no solutions and hence no universal solutions. This case is detected by our algorithm, and “failure” is returned.) Algorithm 5.14 (Blocks Algorithm: Target egds). Input: source instance I . Output: the core of the universal solutions for I , if solutions exist, and “failure”, other- wise. (1) Compute J , the canonical preuniversal instance, from I, ∅ by chasing with st . (2) Compute the blocks of J , and then chase J with t to produce the canonical universal solution J . If the chase fails, then stop with “failure.” Otherwise, initialize J to be J . (3) Check whether there exists a useful J -local endomorphism h of J . If not, then stop with result J . (4) Update J to be h(J ), and return to Step (3). THEOREM 5.15. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Then Algorithm 5.14 is a correct, polynomial-time algorithm for computing the core of the universal solutions. PROOF. The proof is essentially the same as that of Theorem 5.9, except that we make use of Lemma 5.10 instead of Lemma 5.7. For the correctness of the algorithm, we use the fact that each h(J ) is both a homomorphic image and a subinstance of the canonical universal solution J ; hence it satisﬁes both the tgds in st and the egds in t . For the running time of the algorithm, we also use the fact that chasing with egds (used in Step (2)) is a polynomial-time procedure. We note that it is essential for the polynomial-time upper bound that the endomorphisms explored by Algorithm 5.14 are J -local and not merely J -local. While, as argued earlier in the case t = ∅, the blocks of J are bounded in size by the constant b (the maximal number of existentially quantiﬁed variables over all tgds in st ), the same is not true, in general, for the blocks of J . The chase with egds, used to obtain J , may generate blocks of unbounded size. Intuitively, if an egd equates the nulls x and y that are in different blocks of J , then this creates a new, larger, block out of the union of the blocks of x and y. 5.4 Can We Obtain the Core Via the Chase? A universal solution can be obtained via the chase [Fagin et al. 2003]. What about the core? In this section, we show by example that the core may not be obtainable via the chase. We begin with a preliminary example. Example 5.16. We again consider our running example from Example 2.2. If we chase the source instance I of Example 2.2 by ﬁrst chasing with the dependencies (d 2 ) and (d 3 ), and then by the dependencies (d 1 ) and (d 4 ), neither of which add any tuples, then the result is the core J0 , as given in Example 2.2. If, however, we chase ﬁrst with the dependency (d 1 ), then with the dependencies ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 201 (d 2 ) and (d 3 ), and ﬁnally with the dependency (d 4 ), which does not add any tuples, then the result is the target instance J , as given in Example 2.2, rather than the core J0 . In Example 5.16 , the result of the chase may or may not be the core, depend- ing on the order of the chase steps. We now give an example where there is no chase (that is, no order of doing the chase steps) that produces the core. Example 5.17. Assume that the source schema consists of one 4-ary rela- tion symbol R and the target schema consists of one 5-ary relation symbol S. There are two source-to-target tgds d 1 and d 2 , where d 1 is R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(x5 , b, x1 , x2 , a) ∧S(x5 , c, x3 , x4 , a) ∧S(d , c, x3 , x4 , b)) and where d 2 is R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(d , a, a, x1 , b) ∧S(x5 , a, a, x1 , a) ∧S(x5 , c, x2 , x3 , x4 )). The source instance I is {R(1, 1, 2, 3)}. The result of chasing I with d 1 only is {S(N5 , 1, N1 , N2 , 1), S(N5 , 2, N3 , N4 , 1), S(3, 2, N3 , N4 , 1)}, (1) where N1 , N2 , N3 , N4 , N5 are nulls. The result of chasing I with d 2 only is {S(3, 1, 1, N1 , 1), S(N5 , 1, 1, N1 , 1), S(N5 , 2, N2 , N3 , N4 )}, (2) where N1 , N2 , N3 , N4 , N5 are nulls. Let J be the universal solution that is the union of (1) and (2). We now show that the core of J is given by the following instance J0 , which consists of the third tuple of (1) and the ﬁrst tuple of (2): {S(3, 2, N3 , N4 , 1), S(3, 1, 1, N1 , 1)}. First, it is straightforward to verify that J0 is the image of the universal so- lution J under the following endomorphism h: h(N1 ) = 1; h(N2 ) = N1 ; h(N3 ) = N3 ; h(N4 ) = N4 ; h(N5 ) = 3; h(N1 ) = N1 ; h(N2 ) = N3 ; h(N3 ) = N4 ; h(N4 ) = 1; and h(N5 ) = 3. Second, it is easy to see that there is no endomorphism of J0 into a proper substructure of J0 . From these two facts, it follows immediately that J0 is the core. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 202 • R. Fagin et al. Since the result of chasing ﬁrst with d 1 has three tuples, and since the core has only two tuples, it follows that the result of chasing ﬁrst with d 1 and then d 2 does not give the core. Similarly, the result of chasing ﬁrst with d 2 and then d 1 does not give the core. Thus, no chase gives the core, which was to be shown. This example has several other features built into it. First, it is not possible to remove a conjunct from the right-hand side of d 1 and still maintain a depen- dency equivalent to d 1 . A similar comment applies to d 2 . Therefore, the fact that no chase gives the core is not caused by the right-hand side of a source-to-target tgd having a redundant conjunct. Second, the Gaifman graph of the nulls as determined by (1) is connected. In- tuitively, this tells us that the tgd d 1 cannot be “decomposed” into multiple tgds with the same left-hand side. A similar comment applies to d 2 . Therefore, the fact that no chase gives the core is not caused by the tgds being “decomposable.” Third, not only does the set (1) of tuples not appear in the core, but even the core of (1), which consists of the ﬁrst and third tuples of (1), does not appear in the core. A similar comment applies to (2), whose core consists of the ﬁrst and third tuples of (2). So even if we were to modify the chase by inserting, at each chase step, only the core of the set of tuples generated by applying a given tgd, we still would not obtain the core as the result of a chase. 6. QUERY ANSWERING WITH CORES Up to this point, we have shown that there are two reasons for using cores in data exchange: ﬁrst, they are the smallest universal solutions, and second, they are polynomial-time computable in many natural data exchange settings. In this section, we provide further justiﬁcation for using cores in data exchange by establishing that they have clear advantages over other universal solutions in answering target queries. Assume that (S, T, st , t ) is a data exchange setting, I is a source instance, and J0 is the core of the universal solutions for I . If q is a union of conjunctive queries over the target schema T, then, by Proposition 2.7, for every universal solution J for I , we have that certain(q, I ) = q(J )↓ . In particular, certain(q, I ) = q(J0 )↓ , since J0 is a universal solution. Suppose now that q is a conjunctive query with inequalities = over the target schema. In general, if J is a universal solution, then q(J )↓ may properly contain certain(q, I ). We illustrate this point with the following example. Example 6.1. Let us revisit our running example from Example 2.2. We saw earlier in Example 3.1 that, for every m ≥ 0, the target instance Jm = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), ... EmpDept(Alice, X m ), EmpDept(Bob, Y m ), DeptCity(X m , SJ), DeptCity(Y m , SD)} ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 203 is a universal solution for I ; moreover, J0 is the core of the universal solutions for I . Consider now the following conjunctive query q with one inequality: ∃D1 ∃D2 (EmpDept(e, D1 ) ∧ EmpDept(e, D2 ) ∧ (D1 = D2 )). Clearly, q(J0 ) = ∅, while if m ≥ 1, then q(Jm ) = {Alice, Bob}. This implies that certain(q, I ) = ∅, and thus evaluating the above query q on the universal solution Jm , for arbitrary m ≥ 1, produces a strict superset of the set of the certain answers. In contrast, evaluating q on the core J0 coincides with the set of the certain answers, since q(J0 ) = ∅ = certain(q, I ). This example can also be used to illustrate another difference between con- junctive queries and conjunctive queries with inequalities. Speciﬁcally, if J and J are universal solutions for I , and q ∗ is a conjunctive query over the target schema, then q ∗ (J )↓ = q ∗ (J )↓ . In contrast, this does not hold for the above conjunctive query q with one inequality. Indeed, q(J0 ) = ∅ while q(Jm ) = {Alice, Bob}, for every m ≥ 1. In the preceding example, the certain answers of a particular conjunctive query with inequalities could be obtained by evaluating the query on the core of the universal solutions. As shown in the next example, however, this does not hold true for arbitrary conjunctive queries with inequalities. Example 6.2. Referring to our running example, consider again the univer- sal solutions Jm , for m ≥ 0, from Example 6.1. In particular, recall the instance J0 , which is the core of the universal solutions for I , and which has two distinct labeled nulls X 0 and Y 0 , denoting unknown departments. Besides their role as placeholders for department values, the role of such nulls is also to “link” employees to the cities they work in, as speciﬁed by the tgd (d 2 ) in st . For data exchange, it is important that such nulls be different from constants and different from each other. Universal solutions such as J0 naturally satisfy this requirement. In contrast, the target instance J0 = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, X 0 ), DeptCity(X 0 , SJ), DeptCity(X 0 , SD)} is a solution2 for I , but not a universal solution for I , because it uses the same null for both source tuples (Alice, SJ) and, (Bob, SD) and, hence, there is no homomorphism from J0 to J0 . In this solution, the association between Alice and SJ as well as the association between Bob and SD have been lost. Let q be the following conjunctive query with one inequality: ∃D∃D (EmpDept(e, D) ∧ DeptCity(D , c) ∧ (D = D )). It is easy to see that q(J0 ) = {(Alice, SD), (Bob, SJ)}. In contrast, q(J0 ) = ∅, since in J0 both Alice and Bob are linked with both SJ and SD. Consequently, certain(q, I ) = ∅, and thus certain(q, I ) is properly contained in q(J0 )↓ . 2 This is the same instance, modulo renaming of nulls, as the earlier instance J0 of Example 2.2. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 204 • R. Fagin et al. Let J be a universal solution for I . Since J0 is (up to a renaming of the nulls) the core of J , it follows that q(J0 ) ⊆ q(J )↓ . (We are using the fact that q(J0 ) = q(J0 )↓ here.) Since also we have the strict inclusion certain(q, I ) ⊂ q(J0 ), we have that certain(q, I ) ⊂ q(J )↓ , for every universal solution J . This also means that there is no universal solution J for I such that certain(q, I ) = q(J )↓ . Finally, consider the target instance: J = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), DeptCity(X , SJ)}. It is easy to verify that J is a universal solution and that q(J ) = {(Alice, SJ), (Alice, SD), (Bob, SJ) }. Thus, the following strict inclusions hold: certain(q, I ) ⊂ q(J0 )↓ ⊂ q(J )↓ . This shows that a strict inclusion hierarchy can exist among the set of the certain answers, the result of the null-free query evaluation on the core and the result of the null-free query evaluation on some other universal solution. We will argue in the next section that instead of computing certain(q, I ) a better answer to the query may be given by taking q(J0 )↓ itself! 6.1 Certain Answers on Universal Solutions Although the certain answers of conjunctive queries with inequalities cannot always be obtained by evaluating these queries on the core of the universal so- lutions, it turns out that this evaluation produces a “best approximation” to the certain answers among all evaluations on universal solutions. Moreover, as we shall show, this property characterizes the core, and also extends to existential queries. We now deﬁne existential queries, including a safety condition. An existential query q(x) is a formula of the form ∃yφ(x, y), where φ(x, y) is a quantiﬁer-free formula in disjunctive normal form. Let φ be ∨i ∧ j γij , where each γij is an atomic formula, the negation of an atomic formula, an equality, or the negation of an equality. As a safety condition, we assume that for each conjunction ∧ j γij and each variable z (in x or y) that appears in this conjunction, one of the conjuncts γij is an atomic formula that contains z. The safety condition guarantees that φ is domain independent [Fagin 1982] (so that its truth does not depend on any underlying domain, but only on the “active domain” of elements that appear in tuples in the instance). We now introduce the following concept, which we shall argue is fundamental. Deﬁnition 6.3. Let (S, T, st , t ) be a data exchange setting and let I be a source instance. For every query q over the target schema T, the set of the certain answers of q on universal solutions with respect to the source instance I , ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 205 denoted by u-certain(q, I ), is the set of all tuples that appear in q(J ) for every universal solution J for I ; in symbols, u-certain(q, I ) = {q(J ) : J is a universal solution for I }. Clearly, certain(q, I ) ⊆ u-certain(q, I ). Moreover, if q is a union of conjunc- tive queries, then Proposition 2.7 implies that certain(q, I ) = u-certain(q, I ). In contrast, if q is a conjunctive query with inequalities, it is possible that certain(q, I ) is properly contained in u-certain(q, I ). Concretely, this holds true for the query q and the source instance I in Example 6.2, since certain(q, I ) = ∅, while u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}. In such cases, there is no uni- versal solution J for I such that certain(q, I ) = q(J )↓ . Nonetheless, the next result asserts that if J0 is the core of the universal solutions for I , then u-certain(q, I ) = q(J0 )↓ . Therefore, q(J0 )↓ is the best approximation (that is, the least superset) of the certain answers for I among all choices of q(J )↓ where J is a universal solution for I . Before we prove the next result, we need to recall some deﬁnitions from Fagin et al. [2003]. Let q be a Boolean (that is, 0-ary) query over the target schema T and I a source instance. If we let true denote the set with one 0-ary tuple and false denote the empty set, then each of the statements q(J ) = true and q(J ) = false has its usual meaning for Boolean queries q. It follows from the deﬁnitions that certain(q, I ) = true means that for every solution J of this instance of the data exchange problem, we have that q(J ) = true; moreover, certain(q, I ) = false means that there is a solution J such that q(J ) = false. PROPOSITION 6.4. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. Let I be a source instance such that a universal solution for I exists, and let J0 be the core of the universal solutions for I . (1) If q is an existential query over the target schema T, then u-certain(q, I ) = q(J0 )↓ . ∗ (2) If J is a universal solution for I such that for every existential query q over the target schema T we have that u-certain(q, I ) = q(J ∗ )↓ , then J ∗ is isomorphic to the core J0 of the universal solutions for I . In fact, it is enough for the above property to hold for every conjunctive query q with inequalities =. PROOF. Let J be a universal solution, and let J0 be the core of J . By Proposition 3.3, we know that J0 is an induced substructure of J . Let q be an ex- istential query over the target schema T. Since q is an existential query and J0 is an induced substructure of J , it is straightforward to verify that q(J0 ) ⊆ q(J ) (this is a well-known preservation property of existential ﬁrst-order formulas). Since J0 is the core of every universal solution for I up to a renaming of the nulls, it follows that q(J0 )↓ ⊆ {q(J ) : J universal for I }. We now show the re- verse inclusion. Deﬁne J0 by renaming each null of J0 in such a way that J0 and J0 have no nulls in common. Then {q(J ) : J universal for I } ⊆ q(J0 ) ∩ q(J0 ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 206 • R. Fagin et al. But it is easy to see that q(J0 )∩q(J0 ) = q(J0 )↓ . This proves the reverse inclusion and so u-certain(q, I ) = {q(J ) : J universal for I } = q(J0 )↓ . For the second part, assume that J ∗ is a universal solution for I such that for every conjunctive query q with inequalities = over the target schema, q(J ∗ )↓ = {q(J ) : J is a universal solution for I }. (3) Let q ∗ be the canonical conjunctive query with inequalities associated with J ∗ , that is, q ∗ is a Boolean conjunctive query with inequalities that asserts that there exist at least n∗ distinct elements, where n∗ is the number of elements of J ∗ , and describes which tuples from J ∗ occur in which relations in the target schema T. It is clear that q ∗ (J ∗ ) = true. Since q ∗ is a Boolean query, we have q(J ∗ )↓ = q(J ∗ ). So from (3), where q ∗ plays the role of q, we have q ∗ (J ∗ ) = {q ∗ (J ) : J is a universal solution for I }. (4) Since q ∗ (J ∗ ) = true, it follows from (4) that q ∗ (J0 ) = true. In turn, q ∗ (J0 ) = true implies that there is a one-to-one homomorphism h∗ from J ∗ to J0 . At the same time, there is a one-to-one homomorphism from J0 to J ∗ , by Corollary 3.5. Consequently, J ∗ is isomorphic to J0 . Let us take a closer look at the concept of the certain answers of a query q on universal solutions. In Fagin et al. [2003], we made a case that the universal solutions are the preferred solutions to the data exchange problem, since in a precise sense they are the most general possible solutions and, thus, they rep- resent the space of all solutions. This suggests that, in the context of data exchange, the notion of the certain answers on universal solutions may be more fundamental and more meaningful than that of the certain answers. In other words, we propose here that u-certain(q, I ) should be used as the se- mantics of query answering in data exchange settings, instead of certain(q, I ), because we believe that this notion should be viewed as the “right” semantics for query answering in data exchange. As pointed out earlier, certain(q, I ) and u-certain(q, I ) coincide when q is a union of conjunctive queries, but they may very well be different when q is a conjunctive query with inequalities. The pre- ceding Example 6.2 illustrates this difference between the two semantics, since certain(q, I ) = ∅ and u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}, where q is the query ∃D∃D (EmpDept(e, D) ∧ DeptCity(D , c) ∧ (D = D )). We argue that a user should not expect the empty set ∅ as the answer to the query q, after the data exchange between the source of the target (un- less, of course, further constraints are added to specify that the nulls must be equal). Thus, u-certain(q, I ) = {(Alice, SD), (Bob, SJ)} is a more intuitive an- swer to q than certain(q, I ) = ∅. Furthermore, this answer can be computed as q(J0 )↓ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 207 We now show that for conjunctive queries with inequalities, it may be easier to compute the certain answers on universal solutions than to com- pute the certain answers. Abiteboul and Duschka [1998] proved the following result. THEOREM 6.5 [ABITEBOUL AND DUSCHKA 1998]. There is a LAV setting and a Boolean conjunctive query q with inequalities = such that computing the set certain(q, I ) of the certain answers of q is a coNP-complete problem. By contrast, we prove the following result, which covers not only LAV settings but even broader settings. THEOREM 6.5. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of egds. For every existential query q over the target schema T, there is a polynomial-time algorithm for computing, given a source instance I , the set u-certain(q, I ) of the certain answers of q on the universal solutions for I . PROOF. Let q be an existential query, and let J0 be the core of the universal solutions. We see from Proposition 6.4 that u-certain(q, I ) = q(J0 )↓ . By Theo- rem 5.2 or Theorem 5.15, there is a polynomial-time algorithm for computing J0 , and hence for computing q(J0 )↓ . Theorems 6.5 and 6.5 show a computational advantage for certain answers on universal solutions over simply certain answers. Note that the core is used in the proof of Theorem 6.5 but does not appear in the statement of the theorem and does not enter into the deﬁnitions of the concepts used in the theorem. It is not at all clear how one would prove this theorem directly, without making use of our results about the core. We close this section by pointing out that Proposition 6.4 is very dependent on the assumption that q is an existential query. A universal query is taken to be the negation of an existential query. It is a query of the form ∀xφ(x), where φ(x) is a quantiﬁer-free formula, with a safety condition that is inherited from existential queries. Note that each egd and full tgd is a universal query (and in particular, satisﬁes the safety condition). For example, the egd ∀x(A1 ∧ A2 → (x1 = x2 )) satisﬁes the safety condition, since its negation is ∃x(A1 ∧ A2 ∧ (x1 = x2 )), which satisﬁes the safety condition for existential queries since every variable in x appears in one of the atomic formulas A1 or A2 . We now give a data exchange setting and a universal query q such that u-certain(q, I ) cannot be obtained by evaluating q on the core of the universal solutions for I . Example 6.6. Referring to our running example, consider again the univer- sal solutions Jm , for m ≥ 0, from Example 6.1. Among those universal solutions, the instance J0 is the core of the universal solutions for I . Let q be the following Boolean universal query (a functional dependency): ∀e∀d 1 ∀d 2 (EmpDept(e, d 1 ) ∧ EmpDept(e, d 2 ) → (d 1 = d 2 )). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 208 • R. Fagin et al. It is easy to see that q(J0 ) = true and q(Jm ) = false, for all m ≥ 1. Consequently, certain(q, I ) = false = u-certain(q, I ) = q(J0 ). 7. CONCLUDING REMARKS In a previous article [Fagin et al. 2003], we argued that universal solutions are the best solutions in a data exchange setting, in that they are the “most general possible” solutions. Unfortunately, there may be many universal solutions. In this article, we identiﬁed a particular universal solution, namely, the core of an arbitrary universal solution, and argued that it is the best universal solution (and hence the best of the best). The core is unique up to isomorphism, and is the universal solution of the smallest size, that is, with the fewest tuples. The core gives the best answer, among all universal solutions, for existential queries. By “best answer,” we mean that the core provides the best approximation (among all universal solutions) to the set of the certain answers. In fact, we proposed an alternative semantics where the set of “certain answers” are redeﬁned to be those that occur in every universal solution. Under this alternative semantics, the core gives the exact answer for existential queries. We considered the question of the complexity of computing the core. To this effect, we showed that the complexity of deciding if a graph H is the core of a graph G is DP-complete. Thus, unless P = NP, there is no polynomial-time algorithm for producing the core of a given arbitrary structure. On the other hand, in our case of interest, namely, data exchange, we gave natural conditions where there are polynomial-time algorithms for computing the core of universal solutions. Speciﬁcally, we showed that the core of the universal solutions is polynomial-time computable in data exchange settings in which st is a set of source-to-target tgds and t is a set of egds. These results raise a number of questions. First, there are questions about the complexity of constructing the core. Even in the case where we prove that there is a polynomial-time algorithm for computing the core, the exponent may be somewhat large. Is there a more efﬁcient algorithm for computing the core in this case and, if so, what is the most efﬁcient such algorithm? There is also the question of extending the polynomial-time result to broader classes of tar- get dependencies. To this effect, Gottlob [2005] recently showed that computing the core may be NP-hard in the case in which t consists of a single full tgd, provided a NULL “built-in” target predicate is available to tell labeled nulls from constants in target instances; note that, since NULL is a “built-in” predi- cate, it need not be preserved under homomorphisms. Since our formalization of data exchange does not allow for such a NULL predicate, it remains an open problem to determine the complexity of computing the core in data exchange settings in which the target constraints are egds and tgds. On a slightly different note, and given the similarities between the two prob- lems, it would be interesting to see if our techniques for minimizing univer- sal solutions can be applied to the problem of minimizing the chase-generated universal plans that arise in the comprehensive query optimization method introduced in [Deutsch et al. 1999]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Data Exchange: Getting to the Core • 209 Finally, the work reported here addresses data exchange only between rela- tional schemas. In the future we hope to investigate to what extent the results presented in this article and in Fagin et al. [2003] can be extended to the more general case of XML/nested data exchange. ACKNOWLEDGMENTS e Many thanks to Marcelo Arenas, Georg Gottlob, Ren´ e J. Miller, Arnon Rosenthal, Wang-Chiew Tan, Val Tannen, and Moshe Y. Vardi for helpful sug- gestions, comments, and pointers to the literature. REFERENCES ABITEBOUL, S. AND DUSCHKA, O. M. 1998. Complexity of answering queries using materialized views. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 254– 263. ABITEBOUL, S., HULL, R., AND VIANU, V. 1995. Foundations of Databases. Addison-Wesley, Reading, MA. BEERI, C. AND VARDI, M. Y. 1984. A proof procedure for data dependencies. Journal Assoc. Comput. Mach. 31, 4, 718–741. CHANDRA, A. K. AND MERLIN, P. M. 1977. Optimal implementation of conjunctive queries in re- lational data bases. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 77–90. COSMADAKIS, S. 1983. The complexity of evaluating relational queries. Inform. Contr. 58, 101–112. COSMADAKIS, S. S. AND KANELLAKIS, P. C. 1986. Functional and inclusion dependencies: A graph theoretic approach. In Advances in Computing Research., vol. 3. JAI Press, Greenwich, CT, 163– 184. DEUTSCH, A., POPA, L., AND TANNEN, V. 1999. Physical data independence, constraints and opti- mization with universal plans. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 459–470. DEUTSCH, A. AND TANNEN, V. 2003. Reformulation of XML queries and constraints. In Proceedings of the International Conference on Database Theory (ICDT). 225–241. FAGIN, R. 1982. Horn clauses and database dependencies. Journal Assoc. Comput. Mach. 29, 4 (Oct.), 952–985. FAGIN, R., KOLAITIS, P. G., MILLER, R. J., AND POPA, L. 2003. Data exchange: Semantics and query answering. In Proceedings of the International Conference on Database Theory (ICDT). 207– 224. FRIEDMAN, M., LEVY, A. Y., AND MILLSTEIN, T. D. 1999. Navigational plans for data integration. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI). 67–73. GOTTLOB, G. 2005. Cores for data exchange: Hard cases and practical solutions. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). ¨ GOTTLOB, G. AND FERMULLER, C. 1993. Removing redundancy from a clause. Art. Intell. 61, 2, 263–289. HALEVY, A. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270–294. ˇ ˇ HELL, P. AND NESETRIL, J. 1992. The core of a graph. Discr. Math. 109, 117–126. KANELLAKIS, P. C. 1990. Elements of relational database theory. In Handbook of Theoretical Com- puter Science, Volume B: Formal Models and Sematics. Elsevier, Amsterdam, The Netherlands, and MIT Press, Cambridge, MA, 1073–1156. LENZERINI, M. 2002. Data integration: A theoretical perspective. In Proceedings of the ACM Sym- posium on Principles of Database Systems (PODS). 233–246. MAIER, D., MENDELZON, A. O., AND SAGIV, Y. 1979. Testing implications of data dependencies. ACM Trans. Database Syst. 4, 4 (Dec.), 455–469. ´ MILLER, R. J., HAAS, L. M., AND HERNANDEZ, M. 2000. Schema mapping as query discovery. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 77–88. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 210 • R. Fagin et al. PAPADIMITRIOU, C. AND YANNAKAKIS, M. 1982. The complexity of facets and some facets of complexity. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 229–234. PAPADIMITRIOU, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA. POPA, L., VELEGRAKIS, Y., MILLER, R. J., HERNANDEZ, M. A., AND FAGIN, R. 2002. Translating Web data. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 598–609. SHU, N. C., HOUSEL, B. C., TAYLOR, R. W., GHOSH, S. P., AND LUM, V. Y. 1977. EXPRESS: A data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2, 2, 134– 174. VAN DER MEYDEN, R. 1998. Logical approaches to incomplete information: A survey. In Logics for Databases and Information Systems. Kluwer, Dordrecht, The Netherlands, 307–356. Received October 2003; revised May 2004; accepted July 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets KEN Q. PU and ALBERTO O. MENDELZON University of Toronto We study the problem of economical representation of subsets of structured sets, which are sets equipped with a set cover or a family of preorders. Given a structured set U , and a language L whose expressions deﬁne subsets of U , the problem of minimum description length in L (L- MDL) is: “given a subset V of U , ﬁnd a shortest string in L that deﬁnes V .” Depending on the structure and the language, the MDL-problem is in general intractable. We study the complexity of the MDL-problem for various structures and show that certain specializations are tractable. The families of focus are hierarchy, linear order, and their multidimensional extensions; these are found in the context of statistical and OLAP databases. In the case of general OLAP databases, data organization is a mixture of multidimensionality, hierarchy, and ordering, which can also be viewed naturally as a cover-structured ordered set. Efﬁcient algorithms are provided for the MDL- problem for hierarchical and linearly ordered structures, and we prove that the multidimensional extensions are NP-complete. Finally, we illustrate the application of the theory to summarization of large result sets and (multi) query optimization for ROLAP queries. Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design—Data mod- els; normal forms; H.2.3 [Database Management]: Languages General Terms: Algorithms, Theory Additional Key Words and Phrases: Minimal description length, OLAP, query optimization, summarization 1. INTRODUCTION Consider an OLAP or multidimensional database setting [Kimball 1996], where a user has requested to view a certain set of cells of the datacube, say in the form of a 100 × 20 matrix. Typically, the user interacts with a front-end query tool that ships SQL queries to a back-end database management system (DBMS). After perusing the output, the user clicks on some of the rows of the matrix, say 20 of them, and requests further details on these rows. Suppose each row represents data on a certain city. A typical query tool will translate the user request to a long SQL query with a WHERE clause of the form city = city1 OR city = city2 ... OR city = city20. However, if the set of cities happens to include every city in Ontario except Toronto, an equivalent but much This work was supported by the Natural Sciences and Engineering Research Council of Canada. Authors’ address: Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ont., Canada M5S 3H5; email: {kenpu,mendel}@cs.toronto.edu. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0211 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 211–248. 212 • K. Q. Pu and A. O. Mendelzon shorter formulation would be province = ‘Ontario’ AND city <> ‘Toronto’. Minimizing the length of the query that goes to the back end is advantageous for two reasons. First, many systems1 have difﬁculty dealing with long queries, or even hard limits on query length. Second, the shorter query can often be processed much faster than the longer one (even though an extra join may be required, e.g., if there is no Province attribute stored in the cube). With this problem as motivation, we study the concise representations of subsets of a structured set. By “structured” we simply mean that we are given a (ﬁnite) set, called the universe, and a (ﬁnite) set of symbols, called the alpha- bet, each of which represents some subset of the universe. We are also given a language L of expressions on the alphabet, and a semantics that maps ex- pressions to subsets of the universe. Given a subset V of the universe, we want to ﬁnd a shortest expression in the given language that describes V . We call this the L-MDL (minimum description length) problem. In the ex- ample above, the universe is the set of city names, the alphabet includes at least the city name Toronto plus a set of province names, and the seman- tics provides a mapping from province names to sets of cities. This is the simplest case, where the symbols in the alphabet induce a partition of the universe. The most general language we consider, called L, is the language of arbitrary Boolean set expressions on symbols from the alphabet. In Section 2.1 we show that the L-MDL problem is solvable in polynomial time when the alphabet forms a partition of the universe. In particular, when the partition is granular, that is, every element of the universe is represented as one of the symbols in the alphabet, we obtain a normal form for minimum-length expressions, leading to a polynomial time algorithm. Of course, in addition to cities grouped into provinces, we could have provinces grouped into regions, regions into countries, etc. That is, the sub- sets of the universe may form a hierarchy. We consider this case in Section 2.2 and show that the normal forms of the previous section can be generalized, leading again to a polynomial time L-MDL problem. In the full OLAP context, elements of the universe can be grouped according to multiple independent criteria. If we think of a row in our initial example as a tuple <city, product, date, sales>, and the universe is the set of such tuples, then these tuples can be grouped by city into provinces, or by product into brands, or by date into years, etc. In Section 2.3 we consider the multidi- mensional case. In particular, we focus on the common situation in which each of the groupings is a hierarchy. We consider three increasingly powerful sublan- guages of L, including L itself, and show that the MDL-problem is NP-complete for each of them. In many cases, the universe is naturally ordered, such as the TIME di- mension. In Section 3, we deﬁne order-structures to capture such ordering. A language L(≤) is deﬁned to express subsets of the ordered universe. The 1 Many commercial relational OLAP engines naively translate user selection into simple SELECT SQL queries. It has been known that large enough user selections are executed as several SQL queries. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 213 Fig. 1. A structured set. MDL-problem is in general NP-complete, but in the case of one linear ordering, it can be solved in polynomial time. Section 4 focuses on two areas of application of the theory: summarization of query answers and optimization of SELECT queries in OLAP. We consider the scenario of querying a relational OLAP database using simple SELECT queries, and show that it is advantageous to rewrite the queries into the corresponding compact expressions. In Section 5.1, we describe some related MDL-problem and they are related to various languages presented in this article. We also present some existing OLAP query optimization techniques and how they are related to our approach. Finally we summarize our ﬁndings and outline the possibilities of future research in Section 6. 2. COVER STRUCTURES, LANGUAGES, AND THE MDL PROBLEM In this section we introduce our model of structured sets and descriptive lan- guages for subsets of them, and state the minimum description length problem. Deﬁnition 1 (Cover Structured Set). A structured set is a pair of ﬁnite sets (U, ) together with an interpretation function [·] : → Pwr(U ) : σ → [σ ] which is injective, and is such that σ ∈ [σ ] = U . The set U is referred to as the universe, and the alphabet. Intuitively the cover2 structure of the set U is modeled by the grouping of its elements; each group is labeled by a symbol in the alphabet . The inter- pretation of a symbol σ is the elements in U belonging to the group labeled by σ . Example 1. Consider a cover structured set depicted in Figure 1. The uni- verse U = {1, 2, 3, 4, 5}. The alphabet = {A, B, C}. The interpretation func- tion is [A] = {1, 2}, [B] = {2, 3, 5}, and [C] = {4, 5}. Elements of the alphabet can be combined in expressions that describe other subsets of the universe. The most general language we will consider for these expressions is the propositional language that consists of all expressions com- posed of symbols from the alphabet and operators that stand for the usual set operations of union, intersection and difference. Deﬁnition 2 (Propositional Language). Given a structured set (U, ), its propositional language L(U, ) is deﬁned as ∈ L(U, ), σ ∈ L(U, ) for all σ ∈ , and if α, β ∈ L(U, ), then (α + β), (α − β) and (α · β) are all in L(U, ). 2 The term cover refers to the fact that the universe U is covered by the interpretation of the alphabet . Later, in Section 3, we introduce the order-structure in which the universe is ordered. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 214 • K. Q. Pu and A. O. Mendelzon Deﬁnition 3 (Semantics and Length). The evaluation of L(U, ) is a func- tion [·]∗ : L(U, ) → Pwr(U ), deﬁned as — [ ]∗ = ∅, — [σ ]∗ = [σ ] for any σ ∈ , and — [α + β]∗ = [α]∗ ∪ [β]∗ , [α − β]∗ = [α]∗ − [β]∗ , and [α · β]∗ = [α]∗ ∩ [β]∗ . The string length of L(U, ) is a function · : L(U, ) → N, given by — = 0, — σ = 1 for any σ ∈ , and — α+β = α−β = α·β = α + β . Remark. We abuse the deﬁnitions in a number of harmless ways. For in- stance, we may refer to U as a structured set, implying that it is equipped with an alphabet and an interpretation function [·]. The language L(U, ) is some- times written simply as L when the structured set (U, ) is understood from the context. The evaluation function [·]∗ supersedes the single-symbol inter- pretation function [·], so the latter is omitted from discussions and the simpler form [·] is used in place of [·]∗ . Two expressions s and t in L are equivalent if they evaluate to the same set: that is, [s] = [t]. (Note that this means equivalence with respect to a particular structured set (U, ) and thus does not coincide with propositional equivalence.) In case they are equivalent, we say that s is reducible to t if s ≥ t . The expression s is strictly reducible to t if they are equivalent and s > t . An expression is compact if it is not strictly reducible to any other expression in the language. Given a sublanguage K ⊆ L, an expression is K-compact if it belongs to K and is not strictly reducible to any other expression in K. A language K ⊆ L(U, ) is granular if it can express every subset, or equiv- alently, every singleton, that is, (∀a ∈ U )(∃s ∈ K) [s] = {a}. We say that a structure is granular if the propositional language L(U, ) is granular. If L(U, ) is not granular, then certain subsets (speciﬁcally singletons) of U cannot be expressed by any expression. The solution is then to augment the alphabet to include sufﬁciently more symbols until it becomes granular. Deﬁnition 4 (K-Descriptive Length). Given a structured set (U, ), con- sider a sublanguage K ⊆ L(U, ), and a subset V ⊆ U . The language K(V ) is all expressions s ∈ K such that [s] = V , and the K-descriptive length of V , written V K , is deﬁned as min{ α : α ∈ K(V )} if K(V ) = ∅, and V K = ∞ otherwise. In case K = L(U, ), we write V K simply as V . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 215 Fig. 2. A partition. The K-descriptive length of a subset V is just the minimal length needed to express it in the language K. Example 2. Continuing with the example of the structure shown in Figure 1, the language L(U, ) includes expressions like s1 = (A − B) − C, s2 = A − B, and s3 = (B − A) − C, with [s1 ] = [A − B] − [C] = ([A] − [B]) − [C] = {1} = [s2 ] and [s1 ] = [B− A]−[C] = ([B]−[A])−[C] = {3}. The ﬁrst two strings s1 and s2 are equivalent, but s2 is shorter in length; therefore s1 is strictly reducible to s2 . It’s not difﬁcult to check that s2 is L(U, )-compact, so {1} = 2. Our ﬁrst algorithmic problem is: what is the complexity of determining the minimum length of a subset in the language K. We pose it as a decision problem. Deﬁnition 5 (The K-MDL Decision Problem). — INSTANCE: A structured set (U, ), a subset V ⊆ U , and a positive integer k > 0. — QUESTION: V K ≤ k? PROPOSITION 1. The L-MDL decision problem is NP-complete. The proof of Proposition 1 requires the simple observation that for any struc- tured set (U, ), there is a naturally induced set cover, written U/ , on U given by U/ = {[σ ] : σ ∈ }. The general minimum set-cover problem [Garey and Johnson 1979] easily reduces to the general L-MDL problem. The next few sections will focus on some speciﬁc structures that are relevant to realistic databases. 2.1 Partition is in P In this section we focus our attention on the simple case where the symbols in form a partition of U . Deﬁnition 6 (Partition). A structured set (U, ) is a partition if the induced set cover U/ partitions U . Example 3. Consider these streets: Grand, Canal, Broadway in the city NewYork, VanNess, Market, Mission in SanFrancisco, and Victoria, DeRivoli in Paris. The street names form the universe, which is partitioned by the alphabet consisting of the three city names, as shown in Figure 2. PROPOSITION 2. The L-MDL decision problem for a partition (U, ) can be solved in O(|U | · log |U |). The L-MDL decision problem for partitions is particularly easy because, given a subset V , V L is simply the number of cells that cover V exactly. Given the partition and V , computing the number of cells that cover V exactly ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 216 • K. Q. Pu and A. O. Mendelzon can be done in O(|U | log |V |), and can in fact be further optimized to O(|V |) if special data structures are used. Of course, in general not all subsets of street names can be expressed only by using city names—that is, the propositional language L(U, ) for a partition is not, in general, granular. We therefore extend the alphabet to be granular; this requires having additional symbols in , one for each element of U . Deﬁnition 7 (Granular Partition). A structured set (U, ) is a granular partition if = 0 ∪U where (U, 0 ) is a partition. The interpretation func- ˙ tion [·] : → Pwr(U ) is extended such that [u] = {u} for any u ∈ U . The L-MDL decision problem for granular partitions is also solvable in poly- nomial time. We ﬁrst deﬁne a sublanguage Npar ⊆ L consisting of expressions which we refer to as normal, and show that all expressions in L are reducible to ones in Npar , and use this to constructively show that the Npar -MDL decision problem is solvable in polynomial time. → − Let A = {a1 , a2 , . . . , an } ⊆ be a set of symbols. We write A = a1 + a2 + · · · + an . The ordering of the symbols ai does not change the semantic evaluation nor → − its length, so A can be any of the strings that are equivalent to a1 + a2 + · · · + an → − up to the permutations of {ai }. Furthermore, we write [A] to mean [ A]. For a set of expressions {si }, i si is the expression formed by concatenating si by the + operator. Deﬁnition 8 (Normal Form for Granular Partitions). Let (U, 0 ∪U ) be a ˙ granular partition, and its propositional language be L. An expression s ∈ L is → − − → −→ in normal form if it is of the form ( + A+ ) − A− where ⊆ 0 and A+ and A− are elements in U interpreted as symbols in . The normal expression s is trim if A+ = [s] − [ ] and A− = [ ] − [s]. Let Npar (U, ) be all the normal expressions in L(U, ) that are trim. Intuitively, a normal form expression consists of taking the union of some set of symbols from the alphabet, adding to it some elements from the universe, and subtracting some others. The expression is trim if we only add and subtract exactly those symbols that we need to express a particular subset. Note that all normal and trim expressions s ∈ Npar are uniquely determined can by their semantics [s] and the high-level symbols used. Therefore we− write → → → − − π(V / ) to mean the normal and trim expression of the form + A+ − A− where A+ = V − [ ] and A− = [ ] − V . With the interest of compact expressions, we only need to be concerned with normal expressions that are trim for the following reasons. → − → − − → PROPOSITION 3. A normal expression s = + A+ − A− is L-compact only if A+ ∩ [ ] = A− − [ ] = ∅. PROOF. If A+ ∩ [ −is nonempty, say a ∈ A+ ∩ [ ], then deﬁne A + = A+ − {a}, → ]→ → −+ − and s = + A − A− . It is clear that [s ] = [s] but s < s , so s cannot be L-compact. Similarly if A− − [ ] = ∅, we can reduce s strictly as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 217 PROPOSITION 4. A normal expression for a granular partition is L-compact only if it is trim. → → → − − + −− PROOF. Let s = + A − A be a normal expression. Say that it is not trim. Then either A+ = [s] − [ ] or A− = [ ] − [s]. We show that in either case, the expression s can be strictly reduced. Say A+ = [s] − [ ]. There are two possibilities: — A+ − ([s] − [ ]) = ∅: Since A+ ∩ [ ] = ∅ by Proposition 3,− − − − that we − → −−−have + + −+− − → → −− − − − A − [s] = ∅. Let a ∈ A − [s]. Deﬁne s = + (A − {a}) − (A − {a}). It’s easy to see that [s ] = [s] but s < s . — ([s] − [ ]) − A+ = ∅: Recall that [s] = ([ ] ∪ A+ ) − A− ; we have [ ] ∪ A+ ⊇ [s], ([s] − [ ]) − A+ = ∅ always, making this case impossible. The second case of A− = [ ] − [s] implies that s is reducible by similar arguments. LEMMA 1 (NORMALIZATION). Every expression in L is reducible to one in Npar . PROOF. The proof is by induction on the construction of the expression s in L. The base case of s = and s = σ are trivially reducible to Npar . The expression → → → − − − − s = is reducible to ∅ + ∅ − ∅ ,→ The which also has a length of zero.− expression −→ → − → − − → → s = σ is reducible to {σ } + ∅ − ∅ if σ ∈ 0 , and to ∅ + {σ } − ∅ if σ ∈ U . The inductive step has three cases: (i) Suppose that s = s1 + s2 where si ∈ Npar . We show that s is reducible to Npar . + − Write si = i + Ai − Ai . Deﬁne = 1 ∪ 2 . Then, by Deﬁnition 8, we have the following, A+ = [s] − [ ] = ([s1 ] ∪ [s2 ]) − [ ] = ([s1 ] − [ ]) ∪ ([s2 ] − [ ]) ⊆ ([s1 ] − [ 1 ]) ∪ ([s2 ] − [ 2 ]) = A+ ∪ A+ , and 1 2 A− = [ ] − [s] = ([ 1 ] − [s]) ∪ ([ 2 ] − [s]) ⊆ ([ 1]− [s1 ]) ∪ ([ 2] − [s2 ]) = A− 1 ∪ A− . 2 → → → − + −− − So the normal expression π ([s]/ ) = + A − A is equivalent to s, and has its length → −+ −− − → → π ([s]/ ) = + A − A = | | + |A+ | + |A− | ≤ | 1 | + |A+ | + |A− | + | 2 | + |A+ | + |A− | 1 1 2 2 = s1 + s2 = s . (ii) Suppose that s = s1 · s2 . Let si be as in (i), and deﬁne = 1 ∩ 2 . By standard set manipulations similar to those in (i), we once again get A+ ⊆ A+ ∪ A+ 1 2 and A− ⊆ A− ∪ A− . 1 2 Hence s is reducible to π ([s]/ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 218 • K. Q. Pu and A. O. Mendelzon (iii) Finally consider the case that s = s1 − s2 with si in normal form as before. Let = 1 − 2 . Then one can show that A+ ⊆ A+ ∪ A− 1 2 and A− ⊆ A− ∪ A+ . 1 2 Again s is reducible to π ([s]/ ). This concludes the proof. Lemma 1 immediately implies the following. THEOREM 1. For all V ⊆ U , we have V Npar = V L. By Theorem 1, one only needs to focus on the Npar -MDL problem for granular partitions. The necessary and sufﬁcient condition for Npar -compactness can be easily stated in terms of the symbols used. Suppose V ⊆ U ; let us denote + (V ) = {σ ∈ : |[σ ] ∩ V | > |[σ ] − V | + 1}, and very similarly # (V ) = {σ ∈ : |[σ ] ∩ V | ≥ |[σ ] − V | + 1}. Intuitively, the interpretation of a symbol in + (V ) includes more elements in V than elements not in V —by a difference of at least two. Similarly for a symbol in # (V ), the difference is at least one. We say that symbols in # (V ) are efﬁcient with respect to V and ones in + (V ) are strictly efﬁcient. Symbols that are not in # (V ) are inefﬁcient with respect to V . Example 4. Consider the partition in Figure 2. Let V1 = {Victoria, DeRivoli}, and V2 = {Grand, Canal}. # (V1 ) = + (V1 ) = {Paris}, # (V2 ) = {NewYork}, and + (V2 ) = ∅. → − − → −→ LEMMA 2. Let s = ( + A+ ) − A− be an expression in Npar representing V . It is Npar -compact if and only if + (V ) ⊆ ⊆ # (V ). + PROOF (ONLY IF). We show that s is Npar -compact implies that (V ) ⊆ ⊆ # (V ) by contradiction. (i) Suppose + (V ) ⊆ , then there exists an symbol σ ∈ + (V ) but σ ∈ . Deﬁne = ∪{σ }, and s = π (V / ). We have that ˙ A + = V − [ ] = V − ([ ]∪[σ ]) = (V − [ ]) − [σ ] = (V − [ ])−(V ∩ [σ ]) ˙ ˙ +˙ = A −(V ∩ [σ ]), and A − = [ ] − V = ([ ]∪[σ ]) − V = ([ ] − V )∪([σ ] − V ) ˙ ˙ = A− ∪([σ ] − V ). ˙ So s = | | + |A + | + |A − | = s + (|[σ ] − V | + 1 − |V − [σ ]|) < s . This contradicts with the assumption that s is Npar -compact. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 219 (ii) Say that ⊆ # (V ). Let ω ∈ but ω ∈ # (V ). Deﬁne = −{ω} and ˙ s = π (V / ). We then have, + A = V − ([ ]−[ω]) ˙ = A+ ∪ (V ∩ [ω]), and A− = [ ] − V = ([ ]−[ω]) − V = ([ ] − V )−([ω] − V ) ˙ ˙ −˙ = A −([ω] − V ). It follows then, s = | | + |A + | + |A − | = s + (|V ∩ [ω]| − |[ω] − V | − 1) < s . Again a contradiction. + (IF). It remains to be shown that (V ) ⊆ ⊆ # (V ) implies that s is Npar - compact. Let 0 = + (V ) and s0 = π (V / 0) We are going to prove the following fact: + (∀ ⊆ 0) (V ) ⊆ ⊆ # (V ) =⇒ s0 = π (V / ) . (∗) + Therefore by Equation (*), all expressions in Npar with (V ) ⊆ ⊆ (V ) # have the same length, and since one must be Npar -compact by the necessary condition and the guaranteed existence of a Npar -compact expression, all must be Npar -compact. → → → − + −− − Now we prove (*). Consider any s = + A − A with + (V ) ⊆ ⊆ # (V ). Deﬁne = − + (V ). Then = 0 ∪ , and ˙ A+ = V − [ ] = (V − [ 0 ])−(V ∩ [ ]) = A+ −([ ] ∩ V ), and 0 ˙ ˙ − −˙ A0 = [ ] − V = A ∪([ ] − V ). It then follows that s0 = s + |V ∩ [ ]| − |[ ] − V | − | |. (∗∗) Furthermore, since ⊆ # (V ), and [ ] = ∪γ ∈ [γ ], we conclude ˙ V ∩ [ ]| − |[ ] − V | = (|V ∩ [γ ]| − |[γ ] − V |) = 1 = | |. γ∈ Substitute into Equation (**), we have the desired result: s0 = s . Intuitively Lemma 2 tells us that an expression is Npar -compact if and only if it uses all strictly efﬁcient symbols, and never uses any inefﬁcient ones. COROLLARY 1. Let (U, ) be a granular partition. Given any V ⊆ U , π(V / # (V )) is L-compact. Computing π (V / # (V )) is certainly in polynomial time. THEOREM 2. The L-MDL problem for granular partitions can be solved in polynomial time. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 220 • K. Q. Pu and A. O. Mendelzon Fig. 3. The STORE dimension as a tree. Example 5. Consider V1 and V2 as deﬁned in the previous example. By Lemma 2, both of the following expressions of V1 ∪ V2 = { Victoria, DeRivoli, Grand, Canal} are compact: s1 = (NewYork + Paris) − Broadway and s2 = Paris + (Grand + Canal). Note that π(V / # (V )) is s1 . 2.2 Hierarchy is in P Partition has the nice property that its MDL problem is simple. However it does not adequately express many realistic structures. We shall generalize the notion of (granular) partitions to (granular) multilevel hierarchies. Deﬁnition 9 (Hierarchy). A structured set (U, ) is a hierarchy if = 1 ∪ ˙ 2 ∪ ˙ 3 ...∪ ˙ N, such that for any i ≤ N , (U, i ) is a partition; furthermore, for any i, j ≤ N , we have i < j =⇒ U/ i reﬁnes U/ j . The integer N is referred as the number of levels or the height of the hierarchy, and (U, i ) the ith level. Example 6. We extend the partition in Figure 2 to form a hierarchy with three levels (N = 3) shown in Figure 3. The ﬁrst level has 1 being the street names, the second has 2 being the city names, and ﬁnally the third level has 3 having only one symbol STORE. Consider a hierarchy (U, 1 ∪ 2 · · · ∪ N ). First note that it is granular if ˙ ˙ and only if in the ﬁrst level 1 = U , that is, (U, 1 ∪ 2 ) is a granular partition. ˙ N For i < N , we deﬁne i = k=i+1 k . The alphabet i contains all symbols in levels higher than the ith level of the hierarchy. We may view i as a universe, and consider ( i , i ) as a new hierarchy, with the interpretation function given by [·]i : i → Pwr( i) : λ → {σ ∈ i : [σ ] ⊆ [λ]}. Let Li denote the propositional language L( i , i ). Much of the discussion regarding partitions naturally applies to hierarchies with some appropriate generalization. Deﬁnition 10 (Normal Forms). An expression s ∈ Li is in normal form for → −+ −− → the hierarchy if it is of the form s = s + Ai − Ai , where s ∈ Li+1 is the leading ˆ ˆ + − + subexpression of s, and Ai , Ai ⊆ i . It is trim if s is Li+1 -compact and Ai = ˆ − [s]i − [ˆ ]i and Ai = [ˆ ]i − [s]i . We denote (Nhie )i = Nhie ( , i ) to be the set of all s s normal and trim expressions of the hierarchy ( i , i ), and let Nhie ≡ (Nhie )1 . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 221 Fig. 4. The ﬁlled circles are the selected elements. Here are some familiar results. PROPOSITION 5. A normal expression in Li is Li -compact only if it is trim. The proof of Proposition 5 mirrors that of Proposition 4 exactly. LEMMA 3 (NORMALIZATION). Every expression in Li can be reduced to one in (Nhie )i . PROOF. We prove by induction on the construction of expressions in (Nhie )i . The base cases of s = and s = σ are trivial. → − − → Suppose that s = s1 + s2 ∈ Li , where sk = sˆ + A+ − A− for k = 1, 2. Then let k k k t be an Li+1 -compact expression that sˆ1 + sˆ2 → − to. ˆ reduces −+ → Consider the normal expression s = t + A − A− where A+ = [s]i − [t ]i and ˆ ˆ A = [t ]i − [s]i . Repeating the proof of Lemma 1, we have that A+ ⊆ A+ ∪ A+ − ˆ 1 2 and A− ⊆ A− ∪ A− . Therefore s reduces to s . 1 2 The cases for s = s1 · s2 and s = s1 − s2 are handled similarly. THEOREM 3. Let (U, ) be a hierarchy, then for any V ⊆ U , V L = V Nhie . Theorem 3 follows immediately from Lemma 3. As in the case for partitions, one only needs to focus on the expressions in Nhie since Nhie -compactness implies L-compactness. LEMMA 4 (NECESSARY CONDITION). Let s ∈ (Nhie )i , and V = [s]i . It is (Nhie )i - + + compact only if i+1 (V ) ⊆ [ˆ ]i+1 ⊆ i+1 (V ), where i+1 (V ) and i+1 (V ) are, s # # respectively, the strictly efﬁcient and efﬁcient alphabets in i+1 with respect to V . The (only if) half of the proof of Lemma 2 applies with minimal modiﬁcations. Note that Lemma 4 mirrors Lemma 2. It states that the expression s is ˆ compact only when s expresses all the efﬁcient symbols in i+1 with respect to V , and never any inefﬁcient ones. It is also worth noting that this condition is not sufﬁcient, unlike the case in Lemma 2, as demonstrated in the following example. Example 7. Consider the hierarchical structure shown in Figure 4. Let V = {1, 2, 4, 5}. The expression s = 1 + 2 + 4 + 5 expresses V is normal. + Note that 1 (V ) is empty, so s is also trim, but it is not compact as it can be reduced to s = D − (3 + 6). For any i ≤ N , deﬁne a partial order over (Nhie )i , such that for any two expressions s, t ∈ (Nhie )i , s t ⇐⇒ [s]i = [t]i and [ˆ ]i+1 ⊇ [t ]i+1 . s ˆ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 222 • K. Q. Pu and A. O. Mendelzon PROPOSITION 6. Let s, t be two equivalent expressions in Nhie which satisfy the necessary condition of Lemma 4. Then s t =⇒ s ≤ t . In other words, · : (Nhie , ) → (N, ≤) is order preserving. → − − → → − − → PROOF. Write s = s + A+ − A− and t = t + B+ − B− , and let V = [s]i = [t]i . ˆ ˆ By assumption, [ˆ ]i+1 and [t ]i+1 are subsets of i+1 (V ). Deﬁne = [ˆ ]i+1 −[t ]i+1 s ˆ # s ˙ ˆ # which is also a subset of i+1 (V ). Recall that A+ = V − [ˆ ]i and B+ = V − [t ]i . s ˆ s t =⇒ [ˆ ]i ⊇ [t ]i =⇒ A+ ⊆ B+ . s ˆ Furthermore, B+ = V − [t ]i = V − [ˆ − ]i = (V − [ˆ ]i ) ∪ (V ∩ [ ]i ) ˆ s s ˙ + ˙ = A ∪ (V ∩ [ ]i ). Similarly, we can show that A− = B− ∪ ([ ]i − V ). Therefore, ˙ |B+ − A+ | = |B+ | − |A+ | = |V ∩ [ ]i |, |A− − B− | = |A− | − |B− | = |[ ]i − V |. So, s − t = ( s − t ) + (|A+ | − |B+ |) + (|A− | − |B− |) = ( s − t ) − | |. ˆ ˆ ˆ ˆ → − Observe that s is equivalent to t + , so s ≤ t + | |. Therefore s ≤ t . ˆ ˆ ˆ ˆ Therefore by minimizing with respect to , we are effectively minimizing the length. It is immediate from the deﬁnition of that minimization over # in (Nhie )i yields maximization of [ˆ ]i+1 which is bounded by i+1 ([s]). This leads s to the following recursive description of a minimal expression of a set V . COROLLARY 2. Let minexpi : Pwr( i ) → Li be deﬁned as → − — minexp N (V ) = V , — for 0 ≤ i < N , minexpi (V ) = πi (V /minexpi+1 ( i+1 (V ))) where πi (V /t) denotes # −−− −− −→ −− −→ −−− the expression t + (V − [t]i ) − ([t]i − V ). Then for any subset V ⊆ U , minexp0 (V ) is an Nhie -compact expression for V . Here is a bottom-up decomposition procedure to compute a minimal expres- sion in Nhie for a given subset V ⊆ U . Deﬁnition 11 (Decomposition Operators). Deﬁne the following mappings for each i ≤ N : i : Pwr( i ) → Pwr( i+1 ) :V → # — i+1 (V ). + — i : Pwr( i ) → Pwr( i ) : V → V − [ i (V )]i , and − — i : Pwr( i ) → Pwr( i ) : V → [ i (V )]i − V . With these operators and Corollary 2, we can construct a Nhie -compact ex- pression for a given set V with respect to a hierarchy in an iterative fashion. THEOREM 4. Suppose V ⊆ U . Let — V1 = V , — Vi+1 = i (Vi ), Wi+ = + i (Vi ) and Wi− = − i (Vi ), for 1 < i ≤ N . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 223 Fig. 5. The decomposition algorithm for hierarchy. Deﬁne the expressions → − — sN = VN , − −+→ − −− → — si−1 = si + Wi−1 − Wi−1 for 1 ≤ i < N . Then s1 is a Nhie -compact expression expressing V . Corollary 2 follows from simple induction on the number of levels of the hierarchy and showing that at each level the constructed expression satisﬁes the sufﬁcient condition stated in Proposition 6. Clearly the complexity of construction of s1 is in polynomial time, in fact can be done in O(| | · |V | · log |V |). The algorithm is illustrated in Figure 5. Example 8. Consider the hierarchy in Figure 3. Let V1 = {Victoria, DeRivoli, Grand, Broadway, Market}. The algorithm produces: + − V2 = {Paris, NewYork}, and W1 = {Market}, W1 = {Canal}, + − V3 = {STORE}, W2 = ∅. W2 = {SanFrancisco} The expressions produced by the algorithm are — s3 = STORE, — s2 = STORE − SanFrancisco, — s1 = (STORE − SanFrancisco) + Market − Canal. Since s1 is guaranteed compact, V1 = s1 = 4. Note that s1 is not the only compact expression; (NewYork − Canal) + Market + Paris, for instance, is another expression with length 4. 2.3 Multidimensional Partition and Hierarchy An important family of structures is the multidimensional structures. The sim- plest is the multidimensional partition. Deﬁnition 12 (Multidimensional Partition). A cover structure (U, ) is a multidimensional partition if the alphabet = 1 ∪ 2 · · · ∪ N where for ev- ˙ ˙ ery i, (U, i ) is a partition as deﬁned in Deﬁnition 6. The integer N is the dimensionality of the structure. The hierarchy (U, i ) is the ith dimension. Note the subtle difference between a multidimensional partition and a hi- erarchy. A hierarchy has the additional constraint that U/ i are ordered by granularity, and is in fact a special case of the multidimensional partition, but ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 224 • K. Q. Pu and A. O. Mendelzon as one might expect, and we shall show, the relaxed deﬁnition of multidimen- sional partition leads to a NP-hard MDL-problem. A simple extension of the multidimensional partition is the multidimensional hierarchy. Deﬁnition 13 (Multidimensional Hierarchy). A cover structure (U, ) is a multidimensional hierarchy if the alphabet = 1 ∪ 2 · · · ∪ N where, for ˙ ˙ every i, (U, i ) is a hierarchy as deﬁned in Deﬁnition 9. The integer N is the dimensionality of the structure. In this section, we will consider three languages which express subsets of the universe, with successively more grammatic freedom. It will be shown that the MDL decision problem is NP-complete for all three languages. In fact, we will show this on a speciﬁc kind of structures that we call product structures. In- tuitively, multidimensional partitions and multidimensional hierarchies make sense when the elements of the universe can be thought of as N -dimensional points, and each of the partitions or hierarchies operates along one dimension. Most of our discussion will focus on the two-dimensional (2D) case (N = 2), which is enough to yield the NP-completeness results. We next deﬁne product structures for the 2D case. Deﬁnition 14 (2D Product Structure). We say that (U, ) is a 2D product structure if universe U is the cartesian product of two disjoint sets X and Y : U = X × Y , and the alphabet is the union of X and Y : = X ∪ Y . The ˙ interpretation function is deﬁned as, for any z ∈ , {z} × Y if z ∈ X , [z] = X × {z} if z ∈ Y . Note that the 2D product structure is granular, since the language L(X × Y, ) can express every singleton {(x, y)} ∈ Pwr(U ) by the expression (x · y). The 2D product structure admits two natural expression languages, both requiring the notion of product expressions. Deﬁnition 15 (Product Expressions). An expression s ∈ L is a product ex- pression if it is of the form → → − − s = ( A · B ) where A ⊆ X and B ⊆ Y . We build up two languages using product expressions. Deﬁnition 16 (Disjunctive Product Language). The disjunctive product language L P + is deﬁned as — ∈ LP +, —any product expression s belongs to L P + , — if s, t ∈ L P + , then (s + t) ∈ L P + . It is immediate that any expression s ∈ L P + can be written in the form i∈I si where, for any i, si is a product expression. A generalization of the disjunctive product language is to allow other operators to connect the product expressions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 225 Deﬁnition 17 (Propositional Product Language). The propositional prod- uct language L P is deﬁned as — ∈ LP , —any product expression s belongs to L P , — if s, t ∈ L P , then (s + t), (s − t), (s · t) ∈ L P . Obviously L P + LP L. Example 9. Consider a 2D product structure with CITY = { New York, San Francisco, Paris}, and PRODUCT = { Clothing, Beverage, Automobile}. The universe U = CITY × PRODUCT consists of the nine pairs of city name and product family: U= (NewYork, Clothing), (NewYork, Beverage), (NewYork, Automobile), (SanFrancisco, Clothing), (SanFrancisco, Beverage), (SanFrancisco, Automobile), (Paris, Clothing), (Paris, Beverage), (Paris, Automobile) . The alphabet consists of six symbols = CITY ∪ PRODUCT ˙ = {NewYork, SanFrancisco, Paris, Clothing, Beverage, Automobile}. The interpretation of a symbol is the pairs in U in which the symbol occurs. ( NewYork, Beverage) For instance, [ Beverage] = ( SanFrancisco, Beverage) . ( Paris, Beverage) Consider the following expressions in L(U, ): — s1 = ((NewYork + Paris) · Clothing) + (NewYork · Beverage), —s2 = ((NewYork + Paris) · (Clothing + Beverage)) − (NewYork · Clothing), — s3 = NewYork − Beverage. The expressions s1 ∈ L P + , s2 ∈ L P − L P + , and s3 ∈ L − L P . They are evaluated to [s1 ] = {(NewYork, Clothing), (Paris, Clothing), (NewYork, Beverage)}, and [s2 ] = (NewYork, Beverage), (Paris, Clothing), (Paris, Beverage) . The last expression s3 is a bit tricky—it contains all tuples of NewYork that are not Beverage, so [s3 ] = {(NewYork, Clothing), (NewYork, Automobile)}. We will see that the MDL decision problem for each of these languages is NP-complete. 2.4 The L P -MDL Decision Problem is NP-Complete In this section, we prove that the MDL problems for L P + and L P are NP- complete. It’s obvious that they are all in NP. The proof of NP-hardness is by a reduction from the minimal three-set cover problem. Recall that an instance of minimal three-set cover problem consists of a set cover C = {C1 , C2 , . . . , Cn } ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 226 • K. Q. Pu and A. O. Mendelzon Fig. 6. A set cover with three cells. Fig. 7. The transformed instance of the MDL problem of 2D product structure. where (∀C ∈ C)|C| = 3 and an integer k > 0. The question is if there exists a subcover D ⊆ C such that D = C and |D| ≤ k. This is known to be NP-complete [Garey and Johnson 1979]. From this point on, we ﬁx the instance of the minimum cover problem (C, k). Write C = {C1 , C2 , . . . , Cn }. Deﬁne X = C, and for each i ≤ n, let Y i be a set such that |Y i | = m > 3. The family {Y i }n is made disjoint. Let Y = (∪i≤n Y i ) ∪ { y ∗ }, where y ∗ does not belong to any Y i . The structure is the 2D ˙ ˙ product structure of X × Y . The subset to be represented is given by V = ∪i≤n (Ci × Y i ) ∪ ( X × { y ∗ }). It is not difﬁcult to see that this is a polynomial time reduction. Example 10. Consider a set X = {A, B, C, D, E}, and a cover C = {C1 , C2 , C3 } where C1 = {A, B, C}, C2 = {C, D, E} and C3 = {A, C, D}, as shown in Figure 6. It is transformed by ﬁrst constructing Y 1 , Y 2 , and Y 3 , all disjoint and each with four elements. Then let Y = Y 1 ∪ Y 2 ∪ Y 3 ∪ { y ∗ }. The structure is the 2D ˙ ˙ ˙ product structure of X and Y . The subset V = (C1 × Y 1 ) ∪ (C2 × Y 2 ) ∪ (C3 × ˙ ˙ Y 3 ) ∪ (X × { y ∗ }). It is shown as the shaded boxes in Figure 7. ˙ It turns out that for this very speciﬁc subset V , one can characterize the form of the compact expressions that express V in L P . LEMMA 5. Let V be a subset resulted from the reduction from a set cover problem (depicted in Figure 7). Then all L P -compact expressions of V are in the form of − − → → → → − −∗ s= ( Ci · Y i ) + (C j · Y j ), i∈I j ∈J ∗ where Y j = Y j ∪ { y ∗ }, and I ∩ J = ∅, and I ∪ J = {1, 2, . . . , n}. ˙ ˙ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 227 Note that, by Lemma 5, the L P -compact expressions of V do not make use of the negation “−” and conjunction “·” operators between product expressions; hence they belong to L P + . Example. For subset V in Figure 7, the expression → → − −∗ → → − −∗ → → − − s = ( C1 · Y 1 ) + ( C2 · Y 2 ) + ( C3 · Y 3 ) is both L P and L P + -compact. Therefore V L P = V L P + = (3 + 5) + (3 + 4) + (3 + 5) = 23. The proof of Lemma 5 is by ruling out all other possible forms. Before delving into the details of the proof of Lemma 5, let’s use it to prove the NP-hardness of the L P + -MDL and L P -MDL problem. THEOREM 5. L P + -MDL and L P -MDL’s are NP-complete for multidimen- sional partitions. PROOF. This follows from Lemma 5. As we mentioned, V LP + = V LP . Let s be a L P -compact expression of V . Since − − → → → → − −∗ s= ( Ci · Y i ) + (C j · Y j ), i∈I j ∈J its length is s = i≤n (|Ci | + |Y i |) + |J | = (3 + m)n + |J |. Since [s] = V , it is → → − −∗ necessarily the case that X × { y ∗ } ⊆ [ j ∈J (C j · Y j )], or that {C j } j ∈J covers X . Minimizing s with s in the given form is equivalent to minimization of |J |, or ﬁnding a minimal cover of X , which is of course the objective of the minimum set cover problem. The proof of Lemma 5 makes use of the following results. Deﬁnition 18 (Expression Rewriting). Let σ be a symbol, and t an expres- sion. The rewriting, written · : σ → t is a function L → L : s → s : σ → t , deﬁned inductively as — :σ →t = , t if σ = σ , —for any symbol σ ∈ , σ :σ →t = σ else, — for any two strings s, s ∈ L, s s :σ →t = s:σ →t s : σ → t , where can be +, −, or ·. Basically s : σ → t replaces all occurrences of σ in s by the expression t. Deﬁnition 19 (Extended Expression Rewriting). Given a set of symbols 0 ⊆ , and t an expression that does not make use of symbols in 0 , then s : 0 → t is the expression of replacing every occurrence of symbols in 0 by the expression t. PROPOSITION 7 (SYMBOL REMOVAL). For any expression s ∈ L P , we have that [s:z→ ] = [s] − [z], for any symbol z ∈ X ∪ Y . In other words, s : z → ≡ s − z. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 228 • K. Q. Pu and A. O. Mendelzon PROOF. We prove by induction on the number of product expressions in s. → → − − Suppose s = A · B where A ⊆ X and B ⊆ Y . Without loss of generality, say z ∈ A; then [s:z→ ] = (A − {z}) × B = A × B − [z] = [s] − [z]. The induction goes as follows: [ t+t : z → ] = [t:z→ + t :z→ ] = ([t] − [z]) ∪ ([t ] − [z]) = ([t] ∪ [t ]) − [z] = [t + t ] − [z]. Similar arguments apply to the cases t − t and t · t . We need to emphasize that Proposition 7 does not apply to expressions in L in general. For instance, if s = x and z = y, we have that x : y → = x ≡ x − y. PROPOSITION 8 (SYMBOL ADDITION). Let s ∈ L P and x, x ∈ X where x does not occur in s. Then, [ s : x → x + x ] = [s] ∪ ({x } × [s](x)), ˙ where [s](x) = { y ∈ Y : (x, y) ∈ [s]}. Similarly, [ s : y → y + y ] = [s] ∪ ([s]( y) × { y }). ˙ PROOF. As a notational convenience, let’s ﬁx x, x ∈ X and write ↑ s = s : x → x + x and d (s) = {x } × [s](x). Let be +, − or ·, and x not occur in s or s ; then by simple arguments, [s s ](x) = [s](x) [s ](x). It follows, then, that d (s s ) = x × [s s ](x) = (x × [s](x)) (x × [s](x)) = d (s) d (s ). So d (·) distributes over +, − and ·. We now prove Proposition 8 by induction on the number of product expres- → → − − sions in s. For s = or s = A · B , it is obvious. Suppose that s = t + t ; then [↑ s] = [↑ t] ∪ [↑ t ] = ([t] ∪ d (t)) ∪ ([t ] ∪ d (t )) ˙ ˙ = ([t + t ]) ∪ (d (t + t )). This is not sufﬁcient yet since we need to show that the union of [t + t ] and d (t + t ) is a disjoint one. It’s not too difﬁcult since we recall that d (t + t ) = x × [t + t ](x), but x does not occur in t nor in t ; therefore it is not in s. And since t, t ∈ L P , [t + t ] ∩ [x ] = ∅. The cases for s = t − t and s = t · t are handled similarly. We only wish to remark that, for these two cases, it is important to have the disjointness from d (t) to both [t] and [t ]. Again Proposition 8 does not generalize to L. As a counterexample, say s = x + y. Then ↑ s = x + x + y, so [s](x) = Y . Indeed [↑ s] = [s] ∪ d (s), but it’s not a disjoint union: [↑ s] = [s] ∪ d (s), since d (s) ∩ [s] = {(x , y)}. ˙ We now prove Lemma 5 using the results in Proposition 7 and Proposition 8. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 229 PROOF. Let’s ﬁrst deﬁne #z(s) to be the number of occurrences of the symbol z in the expression s. Consider the reduction from an instance of the minimum three-set cover problem. We have the instance of C = {Ci }i ∈ I where |Ci | = 3 for all i ∈ I . The reduction produces the universe X ×Y where X = i Ci , and Y = i∈I Y i ∪{ y ∗ }˙ where Y i are disjoint and |Y i | > 3. The subset to be represented is V = (∪i (Ci × ˙ Y i )) ∪ (X × { y ∗ }). Let s be a L P -compact expression. ˙ Claim I: (∀i ≤ n)(∃ y ∈ Y i ) # y(s) = 1. By contradiction, suppose that (∃i)(∀ y ∈ Y i ) # y(s) > 1; then let s = s : Y i → → → − − + ( Ci · Y i ). By Proposition 7, [s ] = ([s] − [Y i ]) ∪ (Ci × Y i ) = ([s] − [Y i ]) ∪ ([s] ∩ [Y i ]) = [s]. So s is equivalent to s, but it is shorter in length: s = s − # y(s) + |Ci | + |Y i | y∈Y i ≤ s − 2 + |Ci | + |Y i | y∈Y i = s − 2|Y i | + |Ci | + |Y i | = s + (|Ci | − |Y i |) < s . Therefore s strictly reduces to s , which is contradictory to the compactness of s. Claim II: (∀i)(∀ y ∈ Y i ) # y(s) = 1. For contradiction, let’s assume (∃i)(∃ y ∈ Y i ) # y(s) ≥ 2. By Claim I, for i, there exists at least one z ∈ Y i such that #z(s) = 1. Deﬁne s1 = s : y → and s = s1 : z → z + y . We show that s reduces strictly to s : First note that [s1 ] = [s] − [ y], and [s ] = [s1 ] ∪ ([s1 ](z) × y). However, [s1 ](z) = ([s] − [ y])(z) = ˙ [s](z) − [ y](z) = [s](z) since [ y](z) = ∅. So [s ] = ([s] − [ y]) ∪ ([s]( y) × y) = ([s] − [ y]) ∪ ([s] ∩ [ y]) = [s]. In terms of its length, s = s1 + 1 = s − # y(s) + 1 < s . Again a contradiction. Since each y ∈ i Y i must occur exactly once in s, it must be then of the form as claimed. This proof works for both L P or L P + - compactness. 2.5 The General L-MDL Problem is NP-Complete As mentioned, the symbol removal and additions rules do not hold in general for expressions in L and, as a result, it is not guaranteed that the minimal expression for V is in the prescribed form in Lemma 5. Here is an example. Example. Consider once again the subset V in Figure 7, and an expression in L but not in L P : → − → − → − s = (A + B) · Y 1 + (D + E) · Y 2 + (A + D) · Y 3 + C + y ∗ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 230 • K. Q. Pu and A. O. Mendelzon Note that [s] = V , but certainly s is not of the form given in Lemma 5. Its length is s = (2 + 4) + (2 + 4) + (2 + 4) + 1 + 1 = 20. Therefore in this case we have that V L < V L P . The richness of L prevents us from using Lemma 5 to arrive at the NP- hardness of the L-MDL decision problem. We have to modify the reduction from the minimum three-set cover problem, and deal with the expressions in greater detail. Deﬁnition 20 (Domain Dependency). Let X 0 = X and Y 0 = Y as deﬁned in the reduction from a minimum cover problem. Deﬁne a sequence of sets X 0 , X 1 , X 2 , . . . , and Y 0 , Y 1 , Y 2 , . . . , such that for all k ≥ 0, X k+1 = X k ∪ {αk } ˙ and Y k+1 = Y k ∪ ˙ {βk }, where αk and βk are two symbols that do not belong to X k and Y k , respectively. We therefore have a family of 2D product structures {X k × Y k } with the propositional languages L0 L1 L2 · · ·. Let s ∈ Lk , for k ≥ k, and write [s]k to be the evaluation of the expression s in the language Lk . For any k ≥ 0, we say that s ∈ Lk is domain independent if ∀k > k · [s]k = [s]k . If s ∈ Lk is not domain independent, then it’s domain dependent. The notion of domain dependency naturally bipartitions the languages. Let Lk = {s ∈ Lk : s is domain independent.}, and Lk = {s ∈ Lk : I D s is domain dependent.}. Given an expression s, whether it is domain dependent or not depends on the set of unbounded symbols, deﬁned below. Deﬁnition 21 (Bounded Expressions). Let s be an expression in a propo- sitional language. The set of unbounded symbols of s, U(s) is a set of symbols that appear in s, deﬁned as — U( ) = ∅, — U(σ ) = {σ }, and — U(t + t ) = U(t) ∪ U(t ), U(t − t ) = U(t) − U(t ), U(t · t ) = U(t) ∩ U(t ). In case U(s) = ∅, we say that s is a bounded expression, or that it is bounded; otherwise s is unbounded. An expression s ∈ Lk can be demoted to an expression in Lk−1 by erasing the symbols αk−1 and βk−1 so the resulting expression is one in Lk−1 . Let’s write ↓ k s = s : αk−1 → : βk−1 → . Therefore ↓ k : Lk → Lk−1 . k−1 k−1 The following is a useful fact. PROPOSITION 9. For s ∈ Lk , [↓ k k−1 s]k−1 = [s]k ∩ Uk−1 where Uk−1 = X k−1 × Y k−1 . The proof of Proposition 9 is by straightforward induction on s in Lk . While s ∈ Lk can be demoted to Lk−1 , it can also be promoted to Lk+1 without any syntactic modiﬁcation. Of course, when treated as an expression in Lk+1 , it has a different evaluation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 231 PROPOSITION 10. For s ∈ Lk , [s]k+1 = [s]k ∪ (U X (s) × {βk }) ∪ ({αk } × UY (s)), ˙ ˙ where U X (s) = U(s) ∩ X k and UY (s) = U(s) ∩ Y k . PROOF. We write δk (s) = (U X (s) × {βk }) ∪ ({αk } × UY (s)). So the result reads ˙ [s]k+1 = [s]k ∪ ˙ δk (s). Note [s]k = [↓ k+1 s]k = [s]k+1 ∩ Uk , which is disjoint k from δk (s); hence the union is disjoint. The union inside δk (s) is disjoint for the obvious reason that αk ∈ X k and βk ∈ Y k . By straightforward set manipulations, we can show that δk (t t ) = δk (t) δ(t ) for any t, t ∈ Lk and be +, − or ·. The rest of the proof is by induction on the construction of s mirroring exactly that in Proposition 8 with δk (s) in the place of d (s). COROLLARY 3. An expression is domain independent if and only if it is bounded, that is, ∀k.s ∈ Lk ⇐⇒ U(s) = ∅. I PROOF. [s]k+1 = [s]k ⇐⇒ U X (s) = UY (s) = ∅ ⇐⇒ U(s) = ∅. Another result that follows from Proposition 9 and Corollary 3 is the following. COROLLARY 4. If s is domain-independent in Lk , then for all (x, y) ∈ [s]k , both x and y must appear in s. PROOF. Let s ∈ Lk . We show the contrapositive statement: if x or y does not I appear in s, then (x, y) ∈ [s]k . Let’s say x does not appear in s (the case for y is by symmetry). Since s is domain independent, and U(↓ k s) ⊆ U(s) = ∅, ↓ k s k−1 k−1 is also domain independent. We can make the arbitrarily removed symbols αk−1 and βk−1 to be x and some z which does not appear in s either, respectively. This means that ↓ k s = s, and (x, y) ∈ Uk ⊇ [↓ k s]k−1 = [s]k by Proposition 9. k−1 k−1 The importance of domain dependency of expressions is demonstrated by the following results. LEMMA 6. Let V ⊆ X 0 × Y 0 , V Lk and V Lk be the lengths of its shortest I D domain-independent and domain-dependent expressions in Lk , respectively. We have ∀k ≥ 0. V Lk I ≥ V Lk+1 , I and ∀k ≥ 0. V Lk D V Lk+1 . D PROOF. It’s easy to see why V Lk ≥ V Lk+1 : let s be a Lk -compact expres- I I I sion of V . Since s ∈ Lk+1 and [s]k+1 = [s]k = V , it also is the case that s ∈ Lk+1 ; I hence V Lk+1 ≤ s = V Lk . I I To show V Lk D V Lk+1 , let s be a Lk+1 -compact expression for V . By D D Proposition 9, [↓ k+1 s]k = [s]k+1 ∩ Uk = V . It’s not difﬁcult to see that αk+1 and k βk+1 do not appear in s, so U(↓ k+1 s) = U(s) = ∅. That is, ↓ k+1 s expresses V and k k is domain dependent. Next we show that ↓ k+1 s < s by contradiction: if ↓ k+1 s = s , then k k ↓ k s = s since ↓ k+1 s is formed by removing symbols from s. Therefore we k+1 k have that [↓ k+1 s]k+1 = [s]k+1 = [↓ k+1 s]k . But by Proposition 10, this means k k ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 232 • K. Q. Pu and A. O. Mendelzon that U(↓ k+1 s) = ∅, which is a contradiction. Therefore, k V Lk D ≤ ↓ k+1 s < s = V k Lk+1 . D This concludes the proof. COROLLARY 5. For any V ⊆ X 0 × Y 0 , ∀k > (2|V |). V Lk I V Lk . D In other words, ∀k > (2|V |). V Lk = V Lk . I Therefore by enlarging the dimensions X and Y by adding 2|V | new symbols to each, we are guaranteed that all L-compact expressions are domain inde- pendent, and hence are bounded. The reason to force the compact expressions to be domain independent is so that we can reuse symbol removal and addition rules of Proposition 7 and Proposition 8. From this point on, it is understood that the domain has been enlarged to Uk for k > 2|V | and the subscript k is dropped. For instance, we write L for Lk . PROPOSITION 11. Let s ∈ L I . Then, (1) [ s : z → ] = [s] − [z]. (2) If z does not occur in s, then [ s : z → z + z ] = [s] ∪ (z × [s](z)). ˙ PROOF. For (1), suppose z is the symbol to be replaced with . One can show that for all subexpressions s of s, for all x ∈ X and y ∈ Y , if x = z and y = z then (x, y) ∈ [s ] ⇐⇒ (x, y) ∈ [ s : z → ] by induction on the subexpressions of s. Therefore we immediately have [s] − [z] ⊆ [ s : z → ]. For the other containment, observe that U( s : z → ) ⊆ U(s) = ∅, so s:z → is also domain indenpendent by Corollary 3. It follows, then, that every point (x, y) ∈ [ s : z → ] cannot be in [z] by Corollary 4, so x = z and y = z; therefore (x, y) ∈ [s] − [z]. For (2), z is to be replaced with z +z where z does not occur in s. Without loss of generality, say z ∈ X . By induction, we can show that for all subexpressions s of s, for all y ∈ Y , we have (z, y) ∈ [s ] ⇐⇒ (z , y) ∈ [ s : z → z + z ]. It then follows that [ s : z → z + z ] = [s] ∪ [s](z) × z . The disjointness of the union comes the fact that, since s is domain independent, [s] ∩ [z ] = ∅ since z does not appear in s. This allows us to repeat the arguments as in Lemma 5 to obtain the following. LEMMA 7. There exists a L I -compact expression for V of the form (∗). − − → → → → − −∗ s= ( Ci · Y i ) + (C j · Y j ). (∗) i∈I j ∈J SKETCH OF PROOF. Let s be a L I -compact expression for V . Following the arguments presented in the proof of Lemma 5 using symbol addition and removal rules in Proposition 11, we obtain: (∀i)(∀ y ∈ Y i ) # y(s) = 1. It is still possible that, s is not in the form (∗), for L I is ﬂexible enough that → − there is no guarantee that, for all i, all y ∈ Y i occurs consecutively to form Y i . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 233 But we can always rewrite s to that form. For each i, pick a y i , and let Y i = Y i − { y i }. First rewrite s to s = s : Y i → . So s results from s by replacing of all occurrences of y ∈ Y i for y = y i by the empty string . Then → − construct s = s : y i → Y i . One can easily show that [s ] = [s] and s = s , → − −−− −− −→ → − −−− −− −→ and in s , all occurrences of y occur in Y i or Y i ∪ { y ∗ }. Each Y i or Y i ∪ { y ∗ } is → − necessarily individually bounded by Ci . Therefore s is of the form (∗). Finally we arrive at the more or less expected result: THEOREM 6. The L-MDL decision problem is NP-complete for multidimen- sional partitions. 3. THE ORDER-STRUCTURE AND LANGUAGES So far, all aforementioned structures are cover structures, namely, structures characterized by a set cover on the universe. Another important family of structures is the order-structure where structures are characterized by a family of partial orders on the universe. Deﬁnition 22 (Order-Structured Set and Its Language). An order struc- tured set is a set equipped with partial order relations (U, ≤1 , ≤2 , . . . , ≤ N ). The language L(U, ≤1 , . . . , ≤ N ) is given by — is an expression in L(U, ≤1 , . . . , ≤ N ), — for any a ∈ U , a is an expression in L(U, ≤1 , . . . , ≤ N ), — for any a, b ∈ U and 1 ≤ i ≤ N , (a →i b) is an expression in L(U, ≤1 , . . . , ≤ N ), — (s + t), (s − t) and (s · t) are all expressions in L(U, ≤1 , . . . , ≤ N ) given that s, t ∈ L(U, ≤1 , . . . , ≤ N ), and — nothing else is in L(U, ≤1 , . . . , ≤ N ). When no ambiguity arises, we write L(U, ≤1 , . . . , ≤ N ) as L. Similar to the proposition language for cover structured sets, we deﬁne the expression evaluation and length for the language L(U, ≤1 , . . . , ≤ N ). Deﬁnition 23 (Semantics and Length). The evaluation function [·] : L(U, ≤1 , . . . , ≤ N ) → Pwr(U ) is deﬁned as — [ ] = ∅, — [a] = {a} for any a ∈ U , — [a →i b] = {c ∈ U : a ≤i c and c ≤i b}, — [s + t] = [s] ∪ [t], [s − t] = [s] − [t] and [s · t] = [s] ∩ [t]. The length · : L(U, ≤) = N is given by = 0, a = 1, a →i b = 2, and s+t = s−t = s·t = s + t . Example 11. Consider a universe of names for cities: Toronto (TO), San Francisco (SF), New York City (NYC), and Los Angeles (LA); U = {TO, SF, NYC, LA}. We consider three orders. First, they are ordered from east ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 234 • K. Q. Pu and A. O. Mendelzon Fig. 8. A set cover. to west: NYC ≤1 TO ≤1 LA ≤1 SF. Independently, they are also ordered from south to north: LA ≤2 SF ≤2 NYC ≤2 TO. Finally, we know that San Francisco (SF) is much smaller in population than Toronto (TO) and Los Angeles (LA), which are comparable, and in turn New York city (NYC) has the largest population by far. Therefore, by population, we order them partially as SF ≤3 TO, SF ≤3 LA, and TO ≤ NYC, LA ≤3 NYC, but TO and LA are incomparable with respect to ≤3 . The following are expressions in L(U, ≤1 , ≤2 , ≤3 ): — s1 = LA →2 TO; the cities north of LA and south of TO inclusively, and [s1 ] = U . — s2 = (SF →3 NYC) − (SF + NYC); the cities larger than SF but smaller than NYC, so [s2 ] = {TO, LA}. — s3 = (NYC →1 LA) · (LA →2 NYC) − (NYC + LA); the cities strictly between NYC and LA in both latitude and longtitude, and [s3 ] = ∅. The notion of compactness and the MDL-problem naturally extend to expressions of order structures. Unfortunately, the general L(U, ≤)-MDL is intractable even with one order relation. PROPOSITION 12. Even with one partial order ≤, the L(U, ≤)-MDL decision problem is NP-complete. SKETCH OF PROOF. We reduce from the minimum set cover problem. Let C = {Ci }i∈I where, without loss of generality, we assume that each Ci has at least ﬁve elements that are not covered by other {C j : j = i}. This can always be ensured by duplicating each element in the set into ﬁve distinct copies. The universe of our ordered structure set is U = i∈I (Ci ∪ { i , ⊥i }). For each ˙ cover Ci , we introduce two new symbols i and ⊥i . The ordering ≤ is deﬁned as (∀i ∈ I )(∀c ∈ Ci ) c < i and ⊥i < c. Nothing else is comparable. Consider the instance of a set-cover problem shown in Figure 8. We ﬁrst duplicate each element into ﬁve copies, and obtain another instance shown in Figure 9. Finally the order-structure is shown in Figure 10. The subset to be expressed is ∪i∈I Ci , and its L(U, ≤)-compact expression −−→ −−− is always of the form s = j ∈J (( j → ⊥ j ) − { j , ⊥ j }). It will not mention ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 235 Fig. 9. Each element is duplicated. Fig. 10. The transformed order-structure. individual elements in any of the Ci , since by symmetry of the problem, if one is mentioned, then its copies will be mentioned, and that would use ﬁve symbols, − −− −→ which is longer than ( i → ⊥i ) − { i , ⊥i }. Its length is then 4|J | where |J | is the number of covers needed to cover i∈I Ci . Minimizing |J | is equivalent to minimizing s . 3.1 Linear Ordering is in P We say that an order-structure (U, ≤) is linear if there is only one ordering and it is linear, that is, if every two elements u, u ∈ U are comparable. Therefore, (U, ≤) forms a chain, and in this case, not surprisingly, the MDL-problem is solvable in polynomial time. The formal argument for this statement is analogous to that for partitions. In this section, we ﬁx the structure (U, ≤) to be linear. Deﬁnition 24 (Closure and Segments). Let A ⊆ U . Its closure A is deﬁned as A = {u ∈ U : (∃a, b ∈ A)a ≤ u and u ≤ b}. A segment is a subset A of U such that A = A. The length of a segment is simply |A|. Segments are particularly easy to express: if A is a segment with length greater than 2, then A L(U,≤) = 2 always since it can be expressed by the expression (min A → max A) using only two symbols. A segment of V is simply a segment A such that A ⊆ V . We denote the set of maximal segments in V by SEG(V ). Note that maximal segments are pairwise disjoint. The set SEG(V ) also has a natural−compact expression: −→ A∈SEG(V ) (min A → max A), which from now on we call SEG(V ). Example. Consider a universe U with 10 elements linearly ordered by ≤. We simply call them 1 to 10, and ≤ is the ordering of the natural numbers. Let V be {2, 4, 5, 7, 8} shown in Figure 11. − −→ The segments of V are {2}, {4, 5} and {7, 8}, and SEG(V ) = (2 → 2) + (4 → 5) + (7 → 8). PROPOSITION 13. For any two subsets A and B, we have, for be- ing any of ∪, ∩, or −, |SEG(A B)| ≤ |SEG(A)| + |SEG(B)|. Therefore, −−→ −−→ − −→ SEG(A B) ≤ SEG(A) + SEG(B) . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 236 • K. Q. Pu and A. O. Mendelzon Fig. 11. A subset V of the universe: ﬁlled elements belong to V . One might at ﬁrst be tempted to just express a set V by its segment de- −−→ composition SEG(V ). But we can in general do better than that. For instance, consider the previous example with V shown in Figure 11. The expression −−→ SEG(V ) has a length of 6, but V can be expressed by only four symbols by s = (4 → 8) + 2 − 6 or (2 → 8) − (3 + 6). For the remainder of this section, we ﬁx the subset V and assume that V does not contain the extrema max U, min U of U . This restriction on V relieves us from considering some trivial cases, and can be lifted without loss of generality. Deﬁnition 25 (Normal Form for Linear Order—Structures). The normal form is the sublanguage Nlin of L(U, ≤) consisting of expressions of the form → − − → s = t + A+ − A− , where the subexpression t = i (ai → ai ) is a union of segments and A+ = [s] − [t] and A− = [t] − [s]. LEMMA 8. For the linear order-structure (U, ≤), every expression of L(U, ≤) can be reduced to an expression in Nlin . OUTLINE OF PROOF. The proof is very similar to Lemma 1. It is by induction. The base cases of s = and s = u for u ∈ U are trivial. −→ − → → − − → Let s1 = t1 + A+ − A− and s2 = t2 + A+ − A− be two expressions already in 1 1 2 2 Nlin . We need to show that s1 + s2 , s1 − s2 and s1 · s2 are all reducible to Nlin . s = s1 + s2 : Let t = SEG([t1 ] ∪ [t2 ]), A+ = [s] − [t], and A− = [t] − [s]. Then we have that t ≤ t1 + t2 , A+ ≤ A+ + A+ , and A− ≤ A+ + A− 1 2 1 2 (as was the case in the proof of Lemma 1). The other two cases are handled similarly. COROLLARY 6. V L = V Nlin . Therefore the L(U, ≤) MDL-problem reduces to the Nlin MDL-problem when the ordering is linear. We only need to show that the latter is tractable. Deﬁnition 26 (Neighbors, Holes, Isolated, Interior, and Exterior Points). Consider an element u in the universe U . We deﬁne max{u ∈ U : u < u} if u = min U , u−1 = undeﬁned if u = min U ., min{u ∈ U : u > u} if u = max U , u+1 = undeﬁned if u = max U ., to be the immediate predecessor and the immediate successor, respectively. We say that u ∈ U is a hole in V if u ∈ V − V but {u − 1, u + 1} ⊆ V . The set of all holes of V is denoted by Hol(V ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 237 An element u ∈ U is an isolated point in V if u ∈ V but u − 1, u + 1 ∈ V . The set of all isolated points is denoted by Pnt(V ). An interior point u of V is when u ∈ V and at least one of u − 1 or u + 1 is also in V . All the interior points of V is Int(V ). Conversely, an exterior point of V is an element u ∈ U such that u ∈ V and {u − 1, u + 1} − V = ∅. Ext(V ) are all the exterior points of V . Example. Consider the subset V in the universe in Figure 11. Observe that Hol(V ) = {3, 6}, Pnt(V ) = {2}, Int(V ) = {4, 5, 7, 8} and Ext(V ) = {1, 9, 10}. Note that the universe is partitioned into holes, isolated, interior, and exterior points of V . These concepts allow us to deﬁne extended segments of V which are very useful in constructing a compact expression of V . Deﬁnition 27 (Extended Segments). A subset A is an extended segment of V if A ⊆ V ∪ Hol(V ), A = A, and A ∩ Int(V ) = ∅. So an extended segment is a segment that can only contain elements of V and holes in V , and must contain at least one interior point of V . Observe that the maximally extended segments of V are pairwise disjoint. The set of the maximally extended segments is denoted by XSEG(V ). The expression −− −−→ A∈XSEG(V ) (min A → max A) is denoted by XSEG(V ). Example. Again, consider V Figure 11. The extended segments in V are {2, 3, 4, 5}, {4, 5, 6, 7, 8}, {5, 6, 7} · · ·. In general, there could be many maximally extended segments, but in this case there is only one : {2, 3, 4, 5, 6, 7, 8}. −− −−→ Therefore XSEG(V ) = (2 → 8). −→ → − −− −−→ THEOREM 7. An expression s∗ = t∗ + A+ − A− , where t∗ = XSEG(V ), ∗ ∗ A+ = V − [t] and A− = [t] − V is compact for V in Nlin . ∗ ∗ SKETCH OF PROOF. We show that any expression s ∈ Nlin for V can be reduced to s∗ . The proof is by describing explicitly a set of rewrite procedures that take any expressions of V in the normal form and reduce it to s∗ . Without loss of generality, we assume that all segments in t are of length of at least two. (1) First we make sure that all the segments a → a in t are such that a, a ∈ V , and all segments are disjoint: this can be done without increasing the length of the expression. (2) Remove exterior points from [t]: if there is an exterior point u in [t], then it appears in some a → a in t. Since u ∈ Ext(V ), at least one of its neighbors u ∈ {u − 1, u + 1} must also be exterior to V and appear in a → a . They must then appear in A− . Rewrite a → a to at most two segments a → b and b → a such that u and its neighbor u are no longer included in t. This increases the length of t by at most 2. We then remove u, u from A− . The overall expression length is not increased. (3) Add all interior points to [t]: if there is an interior point u that is not in [t], then it must appear in A+ . Since u ∈ Int(V ), there is a neighbor u ∈ {u − 1, u + 1} ∩ V . If u ∈ [t], then it is in A+ as well. In this case, create a new segment u → u (or u → u if u = u − 1) in t, and delete u, u from A+ . If u ∈ A+ , then it must appear in a segment a → a in t. Extend the segment to include u, and delete u from A+ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 238 • K. Q. Pu and A. O. Mendelzon (4) Remove all segments in [t] not containing interior points: if there is a segment a → a in t and [a → a ] ∩ Int(V ) = ∅, then it must contain only isolated points and holes of V , but not exterior points (by step 2). Furthermore, since the end points a, a ∈ V (by step 1), there is one more isolated point than holes in [a → a ]. The holes appear in A− . Delete a → a from t, holes from A− , and add isolated points to A+ . The overall expression length is then reduced by 1. At this point, observe that all segments in t contain some interior points and none of exterior points, and hence are extended segments of V . Therefore, [t] ⊆ XSEG(V ). (5) Add XSEG(V ) − [t] to [t]: consider u ∈ XSEG(V ) − [t]. Let u ∈ A ∈ XSEG(V ). The segment A must contain an interior point v which must appear in some segment [a → b] in t. It is always possible to extend a → b (and possibly merge with neighboring segments in t) to cover u. The extension will include some holes and isolated points which need to be added to A− and removed from A+ respectively. This can always be done without increasing the length of the expression. By the end of the rewriting, − have [t] = − − we − → XSEG(V ), and clearly the minimal expression for [t] is XSEG(V ). COROLLARY 7. The L(U, ≤) MDL-problem can be solved in linear time for linear order-structure. 3.2 Multilinear Ordering Is “Hard” It is not terribly realistic to consider only a single ordering of the universe. There are often many: we may order people by age, or by their names, or some other attributes. In this section, we introduce multiorder structures and the corresponding language. In this case the MDL-problem is hard even when we only have two linear orders. Deﬁnition 28 (2-Linear Order-Structure). Consider the universe U = X × Y where both X and Y are linearly ordered by ≤1 and ≤2 . We deﬁne two orderings ≤ X and ≤Y over the universe U as the lexicographical ordering along X and Y respectively. Formally, (x, y) ≤ X (x , y ) ⇐⇒ (x ≤1 x ) ∧ ((x <1 x ) ∨ ( y ≤2 y )), (x, y) ≤Y (x , y ) ⇐⇒ ( y ≤2 y ) ∧ (( y <2 y ) ∨ (x ≤1 x )). We refer to this speciﬁc structure (U, ≤ X , ≤Y ) as the 2-linear order-structure since both ≤ X and ≤Y are linear. The 2-linear order-structure is the counterpart of the 2D product structure deﬁned in Deﬁnition 14. Recall that we have shown that the MDL-problem for a 2D structure is NP-hard even though the cover structure is made up of two simple partitions each of which is on its own tractable. We will see that the same type of complexity increase seems to hold for the 2-linear order-structure. Though individually linear orders yield a tractable MDL-problem, together the L MDL-problem for the 2-linear order-structure is hard. We ﬁrst identify ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 239 a sublanguage L P + ⊆ L for the 2-linear structure that has an NP-complete ≤ MDL-problem. It is very much similar to the disjunctive product language (Deﬁnition 16). Deﬁnition 29 (Some Product Sublanguages). We deﬁne L P + (U, ≤ X , ≤Y ) or simply L P + as ≤ — ∈ LP +, ≤ — ((a → X b) · (a →Y b)) ∈ L P + if a and b are in X × Y , ≤ — (s + t) ∈ L P + if s, t ∈ L P + , ≤ ≤ —nothing else is in L P + . ≤ A natural generalization of L P + is to allow (s + t), (s − t), and (s · t) as part of ≤ the language. We call the more general language L P ≤ . Expressions of the form (a → X b) · (a →Y b) are really descriptions of rectangles. Since a ∈ X × Y , it is a pair (though still one symbol) (a1 , a2 ), and the same holds for b, b = (b1 , b2 ). The points expressed by (a → X b) · (a →Y b) are exactly these {(x, y) ∈ X × Y : a1 ≤1 x ≤1 b1 and a2 ≤2 y ≤2 b2 }. The points a and b are then the bounding corners of this rectangle. This connection of expressions to unions of rectangles leads to an immediate result. PROPOSITION 14. The L P + -MDL-problem is NP-complete. ≤ PROOF. It is directly reducible from the rectangle covering problem [Keil 1999; Garey and Johnson 1979]. The expressions in the more general language L P ≤ also have a geometric interpretation: they are the general rectangle decomposition of axis-aligned polygons allowing set union, difference, and intersection. The generalized polygon decomposition has been studied in Tor and Middleditch [1984] and Batchelor [1980] in the context of using only components that are convex polygons. So far, we are not aware that anyone has shown whether this more general decomposition problem is NP-hard or not. We believe that the MDL-problems for L P ≤ and the most general language L for the 2-linear order-structure are NP-hard. 4. APPLICATIONS OF COMPACT EXPRESSIONS We give two examples of practical applications of compact expressions: sum- marization of large query answers, and the application used as motivation in the Introduction, reduction in length of SELECT queries in a relational-OLAP system. The MDL principle has been proposed as a guiding principle for summariza- tion of large query results in data mining applications [Agrawal et al. 1998; Lakshmanan et al. 1999]. Our theory of compact expressions is clearly appli- cable to concise summarization of hierarchical query results. We demonstrate how compact expressions can be used in summarizing keyword search results in hierarchically organized data format, such as XML. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 240 • K. Q. Pu and A. O. Mendelzon Fig. 12. An XML document with each node having a unique name. Fig. 13. The tree representing the XML document. We then propose to view an OLAP data cube as a mapping whose domain is a structured set, and queries as expressions of subsets of the domain. We argue that it is generally beneﬁcial to rewrite the query to be as short as possible, that is, to express the subset of the domain using a compact expression. Although, in general, shorter queries are not necessarily faster, for the family of simple SELECT queries in a setting of typical relational OLAP storage, compact expressions have a performance advantage. 4.1 Summarizing Keyword-Search We view an XML document as a labeled tree with the leaf-nodes being the content. We assume that each node has a unique name. For instance, for the XML document in Figure 12, the corresponding tree is as shown in Figure 13. The structured set corresponding to the tree has the leaf-nodes as elements of the universe, and names of the higher nodes as the alphabet . The interpretation of a symbol is simply the set of its descendant leaf-nodes. The result of a keyword search for the word “WINTER” is {Section:1.1, Section:1.2, Section:1.3}, which is compactly expressed by “Chapter:1”, while a keyword search for the word “is” results in {Section:1.1, Section:1.2, Sec- tion:1.3, Section:2.1}. By the decomposition algorithm (Theorem 4), its compact expression is “Book:root − Section:1.2”; the word “IS” is found everywhere except Section:1.2 in the book. One way of summarizing hierarchical data is by the lowest-common-ancestors (LCA) of the answer set ([Lakshmanan et al. 1999]), that is, the set of closest ancestor nodes whose descedants are exactly the answer set. LCA-based summarization would give as a representation: “Section:1.1 + Section:1.3 + Chapter:2”, which is less informative in this case. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 241 4.2 Compact Expressions as Better OLAP SELECT Queries In this section, we focus on SELECT queries in relational OLAP (ROLAP). In ROLAP scenarios, an OLAP cube is stored in databases using tables with a star-schema. A SELECT query specifying the restrictions on each dimension is then an SQL query of the form: SELECT measure FROM cube WHERE dim1 IN (· · ·) AND dim2 IN (· · ·) AND dim3 IN (· · ·) · · · In many cases, this SQL statement can be very long in length, and conse- quently problematic for many back-end relational database management systems (RDBMs).3 We argue that rewriting the predicates using compact expressions will alleviate this problem. First we model OLAP dimensions as hierarchically structured sets, and OLAP cubes as functions whose domain is a multidimensional hierarchy. 4.2.1 Modeling OLAP Cubes and Selection Queries Using Structures and Expressions. There has been a plethora of formal models of OLAP databases: Gyssens and Lakshmanan [1997], Hurtado and Mendelzon [2002], Agrawal et al. [1997], Cabibbo and Torlone [1997, 1998]. These models are formal in the sense that they provide precise semantics for the data model and the query language. The focus is on the expressiveness of the model and the query language but not on the performance issues of query execution. In this section, we show how structured sets and their languages can be used to model basic multidimensional databases and their query languages. Similarly to the approaches in Agrawal et al. [1997] and Cabibbo and Torlone [1998], we model an OLAP dataset as a function. Formally, a dataset is a function D : U → M ∪ {null}, where the domain U is a universe of a ˙ structured set (U, ), and the codomain M the measure values. If a point u ∈ U does not have a measure, then D(u) = null. If (U, ) is a multidimensional hierarchy, then we say that D is a multidimen- sional hierarchical cube, or simply a cube. In this case, U = D1 × D2 · · · × D N and = ∪ 1≤i≤N i such that (Di , i ), and hence (U, i ) is a hierarchy. We call ˙ (Di , i ) the dimensions and N the dimensionality of D. To better illustrate how this captures an OLAP cube, let us consider the following example. Consider an OLAP cube with three dimensions: TIME, PRODUCT, STORE, as shown in Figure 14. There are two measures: Total Sales and Sales Count. We will refer to this cube as SALES. To model this OLAP cube, the dimensions themselves are structured sets: Product is a granular hierarchy with alphabet Product = Name ∪ Family. ˙ Similarly the alphabet for the store dimension is Store = Street ∪ City. Finally ˙ the time dimension is not a hierarchy, but is a linear order-structure. To represent a data cube, we deﬁne the domain (U, ) to be the product structure: 3 In the experience of one of the authors, relational back ends will often poorly execute or even reject SQL queries for excessive length, and thus some ROLAP implementations break them down into multiple queries with the consequent increase in overhead. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 242 • K. Q. Pu and A. O. Mendelzon Fig. 14. The dimensions of SALES. U = Month × Name × Street. The alphabet is = Time ∪ Product ∪ Store. The interpretation is the usual one when forming a product structure as it was done in Section 2.3. The codomain is M = R × N containing pairs of real and natural numbers for the Total Sales and Sales Count. A data cube is then D : U → M. We consider only the simple selection queries: a subset V of the domain is speciﬁed in query q, and the answer to the query is the function D|V —the re- striction of the function D on the subset V . The propositional language L(U, ), or product sublanguages L+ (U, ) and L P (U, ), can be used to describe the P subset V . We write q(s) to indicate that the expression s represents the region of interest for the query q. The answer for q(s) is then D|[s]. For example, let s1 = 01 · Beverage · Canal, and s2 = 01 · (Soda + Beer Spirits) · Canal. Since s1 and s2 are equivalent, the queries q(s1 ) and q(s2 ) will have the same answer: it’s the sales information on beverage products for January for the store on Canal street. 01 Soda Canal – – 01 Beer Canal – – 01 Spirits Canal – – 4.2.2 Single and Multi-OLAP Query Optimization. Given a query q(V ) for some subset V ⊆ U , the objective of optimization is to ﬁnd a compact expres- sion s for V such that the query can be expressed as q(s). A region of interest V is typically generated automatically by report writing software4 and is often expressed as an explicit list of elements in U . When the backend storage of the OLAP cube is a relational database, evaluating the query q(V ) can be cumber- some and inefﬁcient. We shall soon see that there is a performance advantage in evaluating a compact expression of V when the storage is a relational database. Since (U, ) is a multidimensional-hierarchical structure, computing s is in itself a NP-complete problem (Theorem 6). However in the special case that V = A1 × A2 × · · · × AN where Ai is a subset of the dimension Di , the problem can be 4 Available from Microstrategy and Brio . Go online to www.microstrategy.com. and www.brio.com. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 243 solved efﬁciently (Theorem 4). We can compute a compact expression si for each Ai , and then express V by s1 ·s2 · · · sN . This expression of V is economical in both length and in its evaluation, since it makes use of higher-level symbols in each dimension as much as possible, which is beneﬁcial to the query execution. A mi- nor problem is that, by its strict deﬁnition, L P (U, ) does not include the range → operator, and hence we cannot ofﬁcially take advantage of an ordered dimen- sion Dk such as TIME for the cube SALES. This problem can be overcome if we carefully introduce → to the language allowing only σ → σ when σ and σ are in the universe of an ordered dimension Dk . Therefore we would have the subex- pression sk in s be L(Dk , ≤)-compact. In practice, we only consider dimensions that are linearly ordered structures, so sk is easily constructed (by Corollary 7). Consider our sample cube SALES with the dimensions in Figure 14. Typically, when stored in a relational database, the OLAP cube would be mapped to a star-schema with dimension tables for TIME(Month), PROD- UCT(Family, Name), and STORE(City, Street). The dimension tables would look as in Figure 14. The measures are stored in a fact table FACT(Month, Name, Street, TotalSales, SalesCount). For convenience, we assume that all the dimension tables are joined with the fact table to form a full view OLAPVIEW(Month, Family, Name, City, Street, TotalSales, SalesCount). A report on the sales information for the products in Beverage for stores on streets Canal and Grand in New York in the ﬁrst four months of the year has a region V = {Jan, Feb, Mar, Apr} × {Soda, Beer, Spirits} × {Canal, Grand}. A naive translation of q(V ) into SQL would result in a needlessly long statement: SELECT * FROM OLAPVIEW WHERE Month IN (‘01’, ‘02’, ‘03’, ‘04’) AND Name IN (‘Soda’, ‘Beer’, ‘Spirits’ ) AND Street IN (‘Canal’, ‘Grand’); The L P -compact expression of V is s = (01 → 04) · (Beverage) · (NewYork − Broadway). The corresponding SQL to q(s) is SELECT * FROM OLAPVIEW WHERE (Month BETWEEN ‘01’ AND ‘04’) AND Family = ‘Beverage’ AND City = ‘NewYork’ AND Street <> ‘Broadway’; Note that the SQL statement for q(s) makes use of higher-level symbols (such as Beverage and NewYork) instead of the lower-level symbols (Soda, Beer, . . .). This is because of the fact that the algorithm will try to rewrite the expression using as many higher-level symbols as possible without increasing the length of the expression. Since there are fewer symbols in the higher level than in the lower level, the higher-level indices are smaller and can be accessed more quickly. Therefore, the index-access time for the rewritten query is cut down. For instance, there are three symbols in Family level, but 10 in Name level, which means that evaluating Family = ‘Beverage’ is faster than evaluating ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 244 • K. Q. Pu and A. O. Mendelzon Name IN (‘Soda’, ‘Beer’, ‘Spirits’) provided that proper indices are built. The same applies to City and Street predicates. Given a family of queries {q(Vi )}, it is advantageous to consider them simul- taneously and to form the amalgamated query q i Vi . From the minimal length point of view, this is motivated by the simple fact that i Vi ≤ i Vi , that is, the descriptive length of the union is always at most as long as the sum of the descriptive lengths of the individual parts. The possibility of reduction in length lies in the potential overlap among Vi . Practically, computing a compact expression for i Vi is difﬁcult since, even though each Vi is a Cartesian product, i Vi is hardly ever a Cartesian product, rendering our single query optimization ineffective. We must resort to heuristic approaches. As mentioned in Section 5.1, some existing heuristics can be applied such as the greedy growth algorithm in [Agrawal et al. 1998] or the polygon covering algorithm in [Kumar and Ramesh 1999]. Unfortunately, they all assume linearly ordered dimensions, so the approximation factor given in Kumar and Ramesh [1999] does not hold. 4.3 Performance Considerations It is clear that our proposed optimization techniques require additional and possibly intensive access to the dimensional structures. Often these structures are mapped to dimensional tables, which, along with the fact table, are stored in a relational database. This means that traversal of the dimension hierarchies is costly. So to make the optimization overhead minimal and this approach practical, we need to index the dimensions using tree-based native data structures. The dimensions are usually of manageable size and slow varying, making it possible to be more heavily indexed. The fact table, how- ever, is fast changing and much larger in size, potentially spanning multiple remote storage servers. With this in mind, it is reasonable to expend effort on optimizing the retrieval queries by exploiting the structural information of the dimensions. Indeed many OLAP systems commercially available coincide with this type of architecture. For instance, ESSBASE5 has a separate dimension index which is much smaller and faster to access than the data page ﬁle. 5. RELATED WORK 5.1 Some Related MDL Problems The idea of minimal descriptive length has been a classical theme in areas of machine learning [Lam and Bacchus 1994] and statistics [Hansen and Yu 2001]. There, the motivation is to select a model that adequately explains the observations while having an economical representation. In computational geometry, the interest behind the polygon covering problem is to represent a given polygon using minimal number of simpler components [Keil 1999]. Work more relevant to this article has been done with the emphasis on the compact representation of data sets [Lodi et al. 1979; Edmonds et al. 2001]; Agrawal et al. [1998]; Lakshmanan et al. [1999]. 5 Available from Hyperion. Go online to www.essbase.com. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 245 In Agrawal et al. [1998] and Lakshmanan et al. [1999], the authors were in a succinct summarization of query answers for multidimensional databases. This is especially important for complex queries such as those in data mining because the user needs to easily comprehend the result of the query. The general guideline in improving the comprehensibility of the query result is to reduce its descriptive length. In Agrawal et al. [1998], the set of clusters are identiﬁed, and each cluster is basically a subset of a Cartesian product space. It is assumed that each dimension is numerical and discrete, and hence linearly ordered. The authors proposed to describe each cluster using a disjunctive normal form (DNF) expression. For instance: ((30 ≤ age ≤ 50) ∧ (4K ≤ salary ≤ 8K )) ∨ ((40 ≤ age ≤ 60) ∧ (2K ≤ salary ≤ 6K )) is a cluster in the two dimensional space of age and salary. In the framework of structured sets, the universe is the Cartesian product of values of age and salary, and is with two orders ≤age and ≤salary . A cluster is then a subset to be represented, and the DNF representations are expressions in L P + (U, ≤age , ≤salary ). Minimizing the expression is of course NP-hard as it coincides with the well-studied rectilinear polygon covering problem. Dimensions (such as geography or product) that do not have natural orderings but are categorically structured are treated as ordered dimensions. A simple greedy growth algorithm was proposed in which rectangles are grown until they are bounded by the boundary of the dataset. Lakshmanan and colleagues continued to examine the compact representa- tion of multidimensional subset in Lakshmanan et al. [1999], where they have relaxed the accuracy of the representation in order to gain reduction in the descriptive length. Some points in the product space are “blue”; these are the points to be represented. Some are “red”; these must not be included in the pre- sentation. The rest are “white,” which are considered harmless but unnecessary when included in the presentation. As part of the problem, there is an upper limit to the number of white points that can be included. When the limit is set to zero, the problem reduces to the minimal DNF expressions in Agrawal et al. [1998]. A number of heuristic algorithms were given to solve the MDL-problem when dimensions are spatial. Also in Lakshmanan et al. [1999], they considered the case when all the dimensions are hierarchical. A polynomial time algorithm was given to solve the MDL-problem for hierarchical dimensions. Interestingly enough, this algorithm ﬁnds the “optimal” expression. This is in sharp contrast with our NP-hardness results of multidimensional hierarchical structures in Theorem 6. This seemingly paradoxical disagreement comes from the fact that, in Lakshmanan et al. [1999], the language is much more restricted than even L P + (U, ) for multidimensional hierarchical structures. Speciﬁcally, it does not allow general product expressions which are the source of the complexity. These algorithms, such as greedy growth in Agrawal et al. [1998], can be adapted to handle unordered dimensions, and therefore can serve as heuristics for our L P + (U, ) MDL-problem for multidimensional cover structures. For the multilinear order structures which correspond to when the di- mensions are linearly ordered, the MDL-problem is essentially the polygon decomposition problem. Much is known about the approximation of rectilinear polygon covering [Kumar and Ramesh 1999; Levcopoulos and Gudmundsson 1997]. Kumar and Ramesh provided an approximation algorithm that covers a ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 246 • K. Q. Pu and A. O. Mendelzon given rectilinear polygon with rectangles. It was shown to approximate within a factor of O( log n) from the optimum, where n is the minimum of the number of vertical and horizontal edges. Since the trivial reduction of L P + (U, ≤ X , ≤Y ) MDL-problem from the rectilinear polygon covering problem preserves the approximation factor, it too can be approximated to a factor of O( log n). We do not know the exact complexity of the MDL-problem for L P (U, ≤ X , ≤Y ) (nor for the even more general language L(U, ≤ X , ≤Y )), but it corresponds to the generalized polygon covering problem where both set union and set difference are allowed. Some algorithms ([Batchelor 1980; Tor and Middleditch 1984; Keil 1999]) exist for the generalized polygon decomposition, but none gives necessarily the minimal cover. 5.2 Some Related Query Optimization Techniques Multi-OLAP query optimization has been considered by Liang and Orlowska [2000], Zhao et al. [1998], and Kalnis and Papadias [2001]. The underlying model for an OLAP cube is relational, and they considered the very-low-level costs of physical data access such as input/output cost, table join, and table scan cost. The motivation for multiquery optimization under such a setting is that the access plan may share a common set of physical operations which can be executed once if all queries are evaluated simultaneously. The authors have indicated though not explicitly proved that this is in general NP-hard. Our discussion in Section 4.2.2 relies on the given model of multidimensional database and the very simpliﬁed cost model: the length of the expression for the query. But the conclusion is the same: redundancy ought to be removed, but to do so maximally is intractable. This indicates that optimizing on the query length, though not so rigorously justiﬁed, is a good measure and can serve as a guideline in OLAP query optimization. It is important to point out that in Liang and Orlowska [2000], Zhao et al. [1998], and Kalnis and Papadias [2001], the content of the dimensions is not considered by the optimizer as it is thought of as part of the database. But our computation of query expression reduction makes explicit use of the dimensional structures. We argue that, in the OLAP applications, this is valid for the slow varying nature of dimensions. Using dimensions to rewrite a multidimensional query also appeared in Park et al. [2001], in which the authors took into account the dimension tables while rewriting the query. 6. CONCLUSION AND FUTURE WORK We have deﬁned structured sets, languages expressing their subsets, and the corresponding MDL-problems. The two types of structures we introduced are cover structures and order-structures; the former corresponds to categorical classiﬁcation and the latter to sequential or partial ordering. In both cases, the MDL-problem is NP-complete in the most general setting. We further studied specialized instances of these structures. We restricted the cover structures to partitions and hierarchies, and the order-structures to linear orders, and showed that these restricted structures are simple in the sense that they enjoy enough algebraic regularity that their MDL-problems can be solved ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. Concise Descriptions of Subsets of Structured Sets • 247 in polynomial time. However, when having two dimensions, each of which is simple, the resulting MDL-problem becomes hard. We have shown that the MDL-problem associated with the two-dimensional partition is NP-complete. In the case of a two-dimensional linearly ordered structure, we demonstrated that the MDL-problem with respect to the syntactically restricted language L P + corresponds to the rectangular covering problem, which is well known to ≤ be NP-compete; but the complexity of the MDL-problem over the unrestricted language L≤ remains unknown. We summarize as follows (P = polynomial time, NPC = NP-complete): Cover structure Order-structure Partition Hierarchy General Linear Partial One dimensional P P NPC P NPC Multidimensional NPC NPC NPC ? NPC Structured sets arise naturally in databases. We have seen that simple XML documents can be viewed as hierarchically structured sets and OLAP cubes as multidimensionally structured sets. We showed that compact expressions are useful in succinctly summarizing query answers and in query optimization of SELECT queryies in OLAP. So far, only very simple multidimensional database structures in which dimensions are either hierarchical or linearly ordered have been considered. It would be desirable to generalize this. For instance, in the TIME dimension, we only considered the chronological linear order for months, but months are also hierarchically organized into quarters and then years, making TIME a hybrid between a cover structure and an order-structure. Of course, the most general case of such structures has an intractable MDL-problem, and we are interested in ﬁnding a reasonably relaxed class of tractable hybrid structures and providing an algorithm for computing compact expressions in these structures. We have only considered the simple selection OLAP queries. Our future work will include incorporating aggregation and enriching the language of structured sets with more constructs such that a larger class of queries can be encompassed by the framework. ACKNOWLEDGMENTS We thank the anonymous referees for their helpful comments. REFERENCES AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of SIGMOD 1998. ACM Press, New York, NY, 94–105. AGRAWAL, R., GUPTA, A., AND SARAWAGI, S. 1997. Modeling multidimensional databases. In Proceedings of ICDE 1997. 232–243. BATCHELOR, B. 1980. Hierarchical shape description based on convex hulls of concavities. J. Cybernet. 10, 205–210. CABIBBO, L. AND TORLONE, R. 1997. Querying multidimensional databases. In Proceedings of the 6th DBPL Workshop. 319–335. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 248 • K. Q. Pu and A. O. Mendelzon CABIBBO, L. AND TORLONE, R. 1998. A logical approach to multidimensional databases. In Proceedings of EDBT 1998. 183–197. EDMONDS, J., GRYZ, J., LIANG, D., AND MILLER, R. J. 2001. Mining for empty rectangles in large data sets. In Proceedings of ICDT 2001. 174–188. GAREY, M. R. AND JOHNSON, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York, NY. GYSSENS, M. AND LAKSHMANAN, L. 1997. A foundation for multi-dimensional databases. In Proceedings of VLDB 1997. 106–115. HANSEN, M. H. AND YU, B. 2001. Model selection and the principle of minimum description length. J. Amer. Statist. Assoc. 96, 454, 746–774. HURTADO, C. AND MENDELZON, A. 2002. OLAP dimensional constraints. In Proceedings of PODS 2002. 169–179. KALNIS, P. AND PAPADIAS, D. 2001. Optimization Algorithms for Simultaneous Multidimensional Queries in OLAP Environments. Lecture Notes in Computer Science, vol. 2114. Springer-Verlag, Berlin, Germany, 264–273. KEIL, J. 1999. Polygon decomposition. In Handbook of Computational Geometry. Elsevier Sciences, Amsterdem, The Netherlands, Chap. 11, 491–518. KIMBALL, R. 1996. The Data Warehouse Toolkit. Wiley, New York, NY. KUMAR, V. S. A. AND RAMESH, H. 1999. Covering rectilinear polygons with axis-parallel rectangles. In Proceedings of the ACM Symposium on Theory of Computing 1999. ACM Press, New York, NY, 445–454. LAKSHMANAN, L., NG, R. T., WANG, C. X., ZHOU, X., AND JOHNSON, T. J. 1999. The generalized MDL approach for summarization. In Proceedings of VLDB 1999. 445–454. LAM, W. AND BACCHUS, F. 1994. Learning Bayesian belief networks: An approach based on the MDL principle. Comput. Intel. 10, 269–293. LEVCOPOULOS, C. AND GUDMUNDSSON, J. 1997. Approximation algorithms for covering polygons with squares and similar problems. In Proceedings of the International Workshop on Randomiza- tion and Approximation Techniques in Computer Science. Lecture Notes in Computer Science, vol. 1269. Springer, Berlin, Germany, 27–41. LIANG, W. AND ORLOWSKA, M. 2000. Optimizing multiple dimensional queries simultaneously in multidimensional databases. VLDB J. 8, 319–338. LODI, E., LUCCIO, F., MUGNAI, C., AND PAGLI, L. 1979. On two-dimensional data organization I. Fundam. Inform. 2, 211–226. PARK, C., KIM, M., AND LEE, Y. 2001. Rewriting OLAP queries using materialized views and dimension hierarchies in data warehouses. In Proceedings of ICDE 2001. 515–523. TOR, S. AND MIDDLEDITCH, A. 1984. Convex decomposition of simple polygons. ACM Trans. Graph. 3, 244–265. ZHAO, Y., DESHPANDE, P., NAUGHTON, J., AND SHUKLA, A. 1998. Simultaneous optimization and evaluation of multiple dimensional queries. In Proceedings of SIGMOD 1998. 271–282. Received November 2003; revised June 2004; accepted September 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically GRAHAM CORMODE and S. MUTHUKRISHNAN Rutgers University Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-speciﬁed probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications General Terms: Algorithms, Measurement Additional Key Words and Phrases: Data stream processing, approximate query answering. 1. INTRODUCTION One of the most basic statistics on a database relation is that of which items are hot, that is, they occur frequently, but the set of hot items can change over time. This gives a useful measure of the skew of the data. High-biased and end-biased histograms [Ioannidis and Christodoulakis 1993; Ioannidis and Poosala 1995] speciﬁcally focus on hot items to summarize data distributions for selectivity The ﬁrst author was supported by NSF ITR 0220280 and NSF EIA 02-05116; the second author was supported by NSF EIA 0087022, NSF ITR 0220280, and NSF EIA 02-05116. This is an extended version of an article which originally appeared as Cormode and Muthukrishnan [2003]. Authors’ current addresses: G. Cormode, Room 2B-315, Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974; email: graham@dimacs.rutgers.edu; S. Muthukrishnan, Room 319, CoRE Building, Department of Computer and Information Sciences, 110 Frelinghuysen Road, Piscataway, NJ 08854; email: muthu@cs.rutgers.edu. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0249 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 249–278. 250 • G. Cormode and S. Muthukrishnan estimation. Iceberg queries generalize the notion of hot items in relation to aggregate functions over an attribute (or set of attributes) in order to ﬁnd ag- gregate values above a speciﬁed threshold. Hot item sets in market data are in- ﬂuential in decision support systems. They also inﬂuence caching, load balanc- ing, and other system performance issues. There are other areas—such as data warehousing, data mining, and information retrieval—where hot items ﬁnd applications. Keeping track of hot items also arises in application domains out- side traditional databases. For example, in telecommunication networks such as the Internet and telephone, it is of great importance for network operators to see meaningful statistics about the operation of the network. Keeping track of which network addresses are generating the most trafﬁc allows management of the network, as well as giving a warning sign if this pattern begins to change unexpectedly. This has been studied extensively in the context of anomaly de- tection [Barbara et al. 2001; Demaine et al. 2002; Gilbert et al. 2001; Karp et al. 2003]. Our focus in this article is on dynamically maintaining hot items in the presence of delete and insert transactions. In many of the motivating ap- plications above, the underlying data distribution changes, sometimes quite rapidly. Transactional databases undergo insert and delete operations, and it is important to propagate these changes to the statistics maintained on the database relations in a timely and accurate manner. In the context of continuous iceberg queries, this is apt since the iceberg aggregates have to reﬂect new data items that modify the underlying relations. In the net- working application cited above, network connections start and end over time, and hot items change over time signiﬁcantly. A thorough discussion by Gibbons and Matias [Gibbons and Matias 1999] described many appli- cations for ﬁnding hot items and the challenges in maintaining them over a changing database relation. Also, Fang et al. [1998] presented an inﬂuen- tial case for ﬁnding and maintaining hot items and, more generally, iceberg queries. Formally, the problem is as follows. We imagine that we observe a sequence of n transactions on items. Without loss of generality, we assume that the item identiﬁers are integers in the range 1 to m. Throughout, we will assume the RAM model of computation, where all quantities and item identiﬁers can be encoded in one machine word. The net occurrence of any item x at time t, de- noted nx (t), is the number of times it has been inserted less the number of times it has been deleted. The current frequency of any item is then given by f x (t) = nx (t)/ m n y (t). The most frequent item at time t is the one with y=1 f x (t) = max y f y (t). The k most frequent items at time t are those with the k largest f x (t)’s. We are interested in the related notion of frequent items that we call hot items. An item x is said to be a hot item if f x (t) > 1/(k + 1), that is, if it appears a signiﬁcant fraction of the entire dataset; here k is a parameter. Clearly, there can be at most k hot items, and there may be none. We assume throughout that a basic integrity constraint is maintained, that nx (t) for every item is nonnegative (the number of deletions never exceeds the number of in- sertions). From now on, we drop the index t, and all occurrences will be treated as being taken at the current timestep t. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 251 Our main results are highly efﬁcient, randomized algorithms for main- taining hot items. There are three important characteristics to consider: the space used, the time to update the data structure following each transaction (the update time), and the time to produce the hot items (the query time). Our algorithms monitor the changes to the data distribution and maintain O(k log(k) log(m)) space summary data structures. Processing each transaction takes time O(log(k) log(m)). When queried, we can ﬁnd all hot items in time O(k log(k) log(m)) from the summary data structure, without scanning the un- derlying relation. Additionally, given a user-speciﬁed parameter , the algo- rithms return no items whose frequency is less than k+1 − . More formally, 1 for any user-speciﬁed probability δ, the algorithm succeeds with probability at least 1 − δ, as is standard in randomized algorithms. Since k is typically very small compared to the size of the data, our results here maintain small summary data structures—signiﬁcantly sublinear in the dataset size—and accurately detect hot items at any time in the presence of the full repertoire of inserts and deletes. Despite extensive work on this problem (which will be summarized in Section 2), most of the prior work with comparable guarantees works only for insert-only transactions. Prior work that deals with the fully general situation where both inserts and deletes are present cannot provide the guarantees we give, without rescanning the underlying database relation. Thus, our result is the ﬁrst provable result for maintaining hot items, with small space. A common approach to summarizing data distribution or ﬁnding hot items relies on keeping samples on the underlying database relation. These samples— deterministic or randomized—can be updated if data items are only inserted. Samples can then faithfully represent the underlying data relation. However, in the presence of deletes, in particular cases where the data distribution changes signiﬁcantly over time, samples cannot be maintained without rescanning the database relation. For example, the entire set of sampled values may get erased from the relation by a sequence of deletes if there are very many deletions. We present two different approaches for solving the problem. Our ﬁrst result here relies on random sampling to construct groups (O(k log(k)) sets) of items, but we further group such sets deterministically into a small number (log m) of subgroups. Our summary data structure comprises a sum of the items in each group and subgroup. The grouping is based on error-correcting codes, and the entire procedure may be thought of as “group testing,” which is described in more detail later. The second result makes use of log m small space “sketches” to act as oracles to approximate the count of any item or certain groups of items, and uses an intuitive divide and conquer approach to ﬁnd the hot items. This is a different style of group testing, and the two methods give different guarantees for the problem. We also give additional time and space tradeoffs for both methods, where the time to process each update can be reduced by constant factors, at the cost of devoting extra space to the data structures. We perform a set of experiments on large datasets, which allow us to characterize further the advantages of each approach. We also see that, in practice, the methods given outperform their theoretical guarantees, and can operate very quickly using a small amount of space but still give almost perfect results. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 252 • G. Cormode and S. Muthukrishnan Once the hot items have been identiﬁed, a secondary problem is to approxi- mate the counts nx of these items. We do not focus on this problem, since there are many existing solutions which can be applied to the problem of, given x, estimate nx , in the presence of insertions and deletions [Gilbert et al. 2002b; Charikar et al. 2002; Cormode and Muthukrishnan 2004a]. However, we ob- serve that for the solutions we propose, no additional storage is needed, since the information needed to make estimates of the count of items is already present in the data structures that we propose. We will show how to estimate the counts of individual items, but we do not give experimental results since experiments for these estimators can be found in prior work. The rest of the article is organized as follows. In Section 2, we summarize pre- vious work, which is rather extensive. In Section 3 and Section 4 we present our algorithms and prove their guarantees, and compare the different approaches in Section 5. In Section 6, we present an experimental study of our algorithms using synthetic data as well as real network data addressing the application domain cited earlier and show that our algorithms are effective and practical. Conclusions and closing remarks are given in Section 7. 2. PRELIMINARIES If one is allowed O(m) space, then a simple heap data structure will process each insert or delete operation in O(log m) time and ﬁnd the hot items in O(k log m) time in the worst case [Aho et al. 1987]. Our focus here is on algorithms that only maintain a summary data structure, that is, one that uses sublinear space as it monitors inserts and deletes to the data. In a fundamental article, Alon et al. [1996] proved that estimating f ∗ (t) = maxx f x (t) is impossible with o(m) space. Estimating the k most frequent items is at least as hard. Hence, research in this area studies related, relaxed versions of the problems. For example, ﬁnding hot items, that is, items each of which has frequency above 1/(k + 1), is one such related problem. The lower bound of Alon et al. [1996] does not directly apply to this problem. But a simple information theory argument sufﬁces to show that solving this problem exactly requires the storage of a large amount of information if we give a strong guarantee about the output. We provide the simple argument here for completeness. LEMMA 2.1. Any algorithm which guarantees to ﬁnd all and only items which have frequency greater than 1/(k + 1) must store (m) bits. PROOF. Consider a set S ⊆ {1 · · · m}. Transform S into a sequence of n = |S| insertions of items by including x exactly once if and only if x ∈ S. Now process these transactions with the proposed algorithm. We can then use the algorithm to extract whether x ∈ S or not: for some x, insert n/k copies of x. Suppose x ∈ S; then the frequency of x is n/k /(n + n/k ) = n/k / n(k + 1)/k ≤ n/k /(k +1) n/k = 1/(k +1), and so x will not be output. On the other hand, if x ∈ S then ( n/k + 1)/(n+ n/k ) > (n/k)/(n+ n/k) = 1/(k + 1) and so x will be output. Hence, we can extract the set S, and so the space stored must be (m) since, by an information theoretic argument, the space to store an arbitrary subset S is m bits. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 253 Table I. Summary of Previous Results on Insert-Only Methods (LV (Las Vegas) and MC (Monte Carlo) are types of randomized algorithms. See Motwani and Raghavan [1995] for details.) Algorithm Type Time per item Space Lossy Counting Deterministic O(log(n/k)) (k log(n/k)) [Manku and amortized Motwani 2002] Misra-Gries Deterministic O(log k) amortized O(k) [Misra and Gries 1982] Frequent Randomized (LV) O(1) expected O(k) [Demaine et al. 2002] Count Sketch Approximate, O(log(1/δ)) (k/ 2 log n) [Charikar et al. randomized (MC) 2002] This also applies to randomized algorithms. Any algorithm which guarantees to output all hot items with probability at least 1 − δ, for some constant δ, must also use (m) space. This follows by observing that the above reduction corresponds to the Index problem in communication complexity [Kushilevitz and Nisan 1997], which has one-round communication complexity (m). If the data structure stored was o(m) in size, then it could be sent as a message, and this would contradict the communication complexity lower bound. This argument suggests that, if we are to use less than (m) space, then we must sometimes output items which are not hot, since we will endeavor to include every hot item in the output. In our guarantees, we will instead guarantee that (with arbitrary probability) all hot items are output and no items which are far from being hot will be output. That is, no item which has frequency less than k+1 − will be output, for some user-speciﬁed parameter . 1 2.1 Prior Work Finding which items are hot is a problem that has a history stretching back over two decades. We divide the prior results into groups: those which ﬁnd frequent items by keeping counts of particular items; those which use a ﬁlter to test each item; and those which accommodate deletions in a heuristic fashion. Each of these approaches is explained in detail below. The most relevant works mentioned are summarized in Table I. 2.1.1 Insert-Only Algorithms with Item Counts. The earliest work on ﬁnd- ing frequent items considered the problem of ﬁnding an item which occurred more than half of the time [Boyer and Moore 1982; Fischer and Salzberg 1982]. This procedure can be viewed as a two-pass algorithm: after one pass over the data, a candidate is found, which is guaranteed to be the majority element if any such element exists. A second pass veriﬁes the frequency of the item. Only a constant amount of space is used. A natural generalization of this method to ﬁnd items which occur more than n/k times in two passes was given by Misra and Gries [1982]. The total time to process n items is O(n log k), with space O(k) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 254 • G. Cormode and S. Muthukrishnan (recall that we assume throughout that any item label or counter can be stored in constant space). In the Misra and Gries implementation, the time to process any item is bounded by O(k log k) but this time is only incurred O(n/k) times, giving the amortized time bound. The ﬁrst pass generates a set of at most k candidates for the hot items, and the second pass computes the frequency of each candidate exactly, so the infrequent items can be pruned out. It is possible to drop the second pass, in which case at most k items will be output, among which all hot items are guaranteed to be included. Recent interest in processing data streams, which can be viewed as one- pass algorithms with limited storage, has reopened interest in this problem (see surveys such as those by Muthukrishnan [2003] and Garofalakis et al. [2002]). Several authors [Demaine et al. 2002; Karp et al. 2003] have redis- covered the algorithm of Misra and Gries [1982], and using more sophisticated data structures have been able to process each item in expected O(1) time while still keeping only O(k) space. As before, the output guarantees to include all hot items, but some others will be included in the output, about which no guarantee of frequency is made. A similar idea was used by Manku and Motwani [2002] with the stronger guarantee of ﬁnding all items which occur more than n/k times and not reporting any that occur fewer than n( k − ) times. The space 1 required is bounded by O( log n)—note that ≤ k and so the space is effec- 1 1 tively (k log(n/k)). If we set = k for some small c then it requires time at c worst O(k log(n/k)) per item, but this occurs only every 1/k items, and so the total time is O(n log(n/k)). Another recent contribution was that of Babcock and Olston [2003]. This is not immediately comparable to our work, since their focus was on maintaining the top-k items in a distributed environment, and the goal was to minimize communication. Counts of all items were maintained exactly at each location, so the memory space was (m). All of these mentioned algorithms are deterministic in their operation: the output is solely a function of the input stream and the parameter k. All the methods discussed thus far have certain features in common: in particular, they all hold some number of counters, each of which counts the number of times a single item is seen in the sequence. These counters are incremented whenever their corresponding item is observed, and are decre- mented or reallocated under certain circumstances. As a consequence, it is not possible to directly adapt these algorithms to the dynamic case where items are deleted as well as inserted. We would like the data structure to have the same contents following the deletion of an item, as if that item had never been inserted. But it is possible to insert an item so that it takes up a counter, and then later delete it: it is not possible to decide which item would otherwise have taken up this counter. So the state of the algorithm will be different from that reached without the insertions and deletions of the item. 2.1.2 Insert-Only Algorithms with Filters. An alternative approach to ﬁnd- ing frequent items is based on constructing a data structure which can be used as a ﬁlter. This has been suggested several times, with different ways ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 255 to construct such ﬁlters being suggested. The general procedure is as follows: as each item arrives, the ﬁlter is updated to reﬂect this arrival and then the ﬁlter is used to test whether this item is above the threshold. If it is, then it is retained (for example, in a heap data structure). At output time, all retained items can be rechecked with the ﬁlter, and those which pass the ﬁlter are out- put. An important point to note is that, in the presence of deletions, this ﬁlter approach cannot work directly, since it relies on testing each item as it arrives. In some cases, the ﬁlter can be updated to reﬂect item deletions. However, it is important to realize that this does not allow the current hot items to be found from this: after some deletions, items seen in the past may become hot items. But the ﬁlter method can only pick up items which are hot when they reach the ﬁlter; it cannot retrieve items from the past which have since become frequent. The earliest ﬁlter method appears to be due to Fang et al. [1998], where it was used in the context of iceberg queries. The authors advocated a second pass over the data to count exactly those items which passed the ﬁlter. An article which has stimulated interest in ﬁnding frequent items in the networking community was by Estan and Varghese [2002], who proposed a variety of ﬁlters to detect network addresses which are responsible for a large fraction of the bandwidth. In both these articles, the analysis assumed very strong hash functions which exhibit “perfect” randomness. An important recent result was that of Charikar et al. [2002], who gave a ﬁlter-based method using only limited (pairwise) inde- pendent hash functions. These were used to give an algorithm to ﬁnd k items whose frequency was at least (1− ) times the frequency of the kth most frequent item, with probability 1−δ. If we wish to only ﬁnd items with count greater than n/(k + 1) then the space used is O( k2 log(n/δ)). A heap of frequent items is kept, and if the current items exceed the threshold, then the least frequent item in the heap is ejected, and the current item inserted. We shall return to this work later in Section 4.1, when we adapt and use the ﬁlter as the basis of a more ad- vanced algorithm to ﬁnd hot items. We will describe the algorithm in full detail, and give an analysis of how it can be used as part of a solution to the hot items problem. 2.1.3 Insert and Delete Algorithms. Previous work that studied hot items in the presence of both of inserts and deletes is sparse [Gibbons and Matias 1998, 1999]. These articles have proposed methods to maintain a sample of items and count of the number of times each item occurs in the data set, and focused on the harder problem of monitoring the k most frequent items. These methods work provably for the insert-only case, but provide no guarantees for the fully dynamic case with deletions. However, the authors studied how effective these samples are for the deletion case through experiments. Gibbons et al. [1997] presented methods to maintain various histograms in the presence of inserts and deletes using a “backing sample,” but these methods too need access to large portion of the data periodically in the presence of deletes. A recent theoretical work presented provable algorithms for maintaining histograms with guaranteed accuracy and small space [Gilbert et al. 2002a]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 256 • G. Cormode and S. Muthukrishnan The methods in this article can yield algorithms for maintaining hot items, but the methods are rather sophisticated and use powerful range summable random variables, resulting in k log O(1) n space and time algorithms where the O(1) term is quite large. We draw some inspiration from the methods in this article—we will use ideas similar to the “sketching” developed in Gilbert et al. [2002a], but our overall methods are much simpler and more efﬁcient. Finally, recent work in maintaining quantiles [Gilbert et al. 2002b] is similar to ours since it keeps the sum of items in random subsets. However, our result is, of necessity, more involved, involving a random group generation phase based on group testing, which was not needed in [Gilbert et al. 2002b]. Also, once such groups are generated, we maintain sums of deterministic sets (in contrast to the random sets as in Gilbert et al. [2002b]), given again by error correcting codes. Finally, our algorithm is more efﬁcient than the (k 2 log2 m) space and time algorithms given in Gilbert et al. [2002b]. 2.2 Our Approach We propose some new approaches to this problem, based on ideas from group testing and error-correcting codes. Our algorithms depend on ideas drawn from group testing [Du and Hwang 1993]. The idea of group testing is to arrange a number of tests, each of which groups together a number of the m items in order to ﬁnd up to k items which test “positive.” Each test reports either “positive” or “negative” to indicate whether there is a positive item among the group, or whether none of them is positive. The familiar puzzle of how to use a pan balance to ﬁnd one “positive” coin among n good coins, of equal weight, where the positive coin is heavier than the good coins, is an example of group testing. The goal is to minimize the number of tests, where each test in the group testing is applied to a subset of the items (a group). Our goal of ﬁnding up to k hot items can be neatly mapped onto an instance of group testing: the hot items are the positive items we want to ﬁnd. Group testing methods can be categorized as adaptive or nonadaptive. In adaptive group testing, the members of the next set of groups to test can be speciﬁed after learning the outcome of the previous tests. Each set of tests is called a round, and adaptive group testing methods are evaluated in terms of the number of rounds, as well as the number of tests, required. By contrast, nonadaptive group testing has only one round, and so all groups must be chosen without any information about which groups tested positive. We shall give two main solutions for ﬁnding frequent items, one based on nonadaptive and the other on adaptive group testing. For each, we must describe how the groups are formed from the items, and how the tests are performed. An additional challenge is that our tests here are not perfect, but have some chance of failure (reporting the wrong result). We will prove that, in spite of this, our algorithms can guarantee ﬁnding all hot items with high probability. The algorithms we propose differ in the nature of the guarantees that they give, and result in different time and space guarantees. In our experimental studies, we were able to explore these differences in more detail, and to describe the different situations which each of these algorithms is best suited to. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 257 3. NONADAPTIVE GROUP TESTING Our general procedure is as follows: we divide all items up into several (over- lapping) groups. For each transaction on an item x, we determine which groups it is included in (denoting these G(x)). Each group is associated with a counter, and for an insertion we increment the counter for all G(x); for a deletion, we correspondingly decrement these counters. The test will be whether the count for a subset exceeds a certain threshold: this is evidence that there may a hot item within the set. Identifying the hot items is a matter of putting together the information from the different tests to ﬁnd an overall answer. There are a number of challenges involved in following this approach: (1) bounding the number of groups required; (2) ﬁnding a concise represen- tation of the groups; and (3) giving an efﬁcient way to go from the results of tests to the set of hot items. We shall be able to address all of these issues. To give greater insight into this problem, we ﬁrst give a simple solution to the k = 1 case, which is to ﬁnd an item that occurs more than half of the time. Later, we will consider the more general problem of ﬁnding k > 1 hot items, which will use the procedure given below as a subroutine. 3.1 Finding the Majority Item If an item occurs more than half the time, then it is said to be the majority item. While ﬁnding the majority item is mostly straightforward in the insertions- only case (it is solved in constant space and constant time per insertion by the algorithms of Boyer and Moore [1982] and Fischer and Salzberg [1982]), in the dynamic case, it looks less trivial. We might have identiﬁed an item which is very frequent, only for this item to be the subject of a large number of deletions, meaning that some other item is now in the majority. We give an algorithm to solve this problem by keeping log2 m + 1 counters. The ﬁrst counter, c0 , merely keeps track of n(t) = x nx (t), which is how many items are “live”: in other words, we increment this counter on every insert, and decrement it on every deletion. The remaining counters are denoted c1 · · · c j . We make use of the function bit(x, j ), which reports the value of the j th bit of the binary representation of the integer x; and g t(x, y), which returns 1 if x > y and 0 otherwise. Our procedures are as follows: Insertion of item x: increment each counter c j such that bit(x, j ) = 1 in time O(log m). Deletion of x: decrement each counter c j such that bit(x, j ) = 1 in time O(log m). log2 m Search: if there is a majority, then it is given by j =1 2 j g t(c j , n/2), computed in time O(log m). The arrangement of the counters is shown graphically in Figure 1. The two procedures of this method—one to process updates, another to identify the ma- jority element—are given in Figure 2 (where trans denotes whether the trans- action is an insertion or a deletion). THEOREM 3.1. The algorithm in Figure 2 ﬁnds a majority item if there is one with time O(log m) per update and search operation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 258 • G. Cormode and S. Muthukrishnan Fig. 1. Each test includes half of the range [1 · · · m], corresponding to the binary representation of values. Fig. 2. Algorithm to ﬁnd the majority element in a sequence of update. PROOF. We make two observations: ﬁrst, that the state of the data structure is equivalent to that following a sequence of c0 insertions only, and second, that in the insertions only case, this algorithm identiﬁes a majority element. For the ﬁrst point, it sufﬁces to observe that the effect of each deletion of an element x is to precisely cancel out the effect of a prior insertion of that element. Following a sequence of I insertions and D deletions, the state is precisely that obtained if there had been I − D = n insertions only. The second part relies on the fact that if there is an item whose count is greater than n/2 (that is, it is in the majority), then for any way of dividing the elements into two sets, the set containing the majority element will have weight greater than n/2, and the other will have weight less than n/2. The tests are arranged so that each test determines the value of a particular bit of the index of the majority element. For example, the ﬁrst test determines whether its index is even or odd by dividing on the basis of the least signiﬁcant bit. The log m tests with binary outcomes are necessary and sufﬁcient to determine the index of the majority element. Note that this algorithm is completely deterministic, and guarantees always to ﬁnd the majority item if there is one. If there is no such item, then still some item will be returned, and it will not be possible to distinguish the difference based on the information stored. The simple structure of the tests is standard in group testing, and also resembles the structure of the Hamming single error- correcting code. 3.2 Finding k Hot Items When we perform a test based on comparing the count of items in two buck- ets, we extract from this a single bit of information: whether there is a hot item present in the set or not. This leads immediately to a lower bound on the number ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 259 of tests necessary: to locate k items among m locations requires log2 (m ) ≥ k k log(m/k) bits. We make the following observation: suppose we selected a group of items to monitor which happened to contain exactly one hot item. Then we could apply the algorithm of Section 3.1 to this group (splitting it into a further log m subsets) and, by keeping log m counters, identify which item was the hot one. We would simply have to “weigh” each bucket, and, providing that the total weight of other items in the group were not too much, the hot item would always be in the heavier of the two buckets. We could choose each group as a completely random subset of the items, and apply the algorithm for ﬁnding a single majority item described at the start of this section. But for a completely random selection of items then in order to store the description of the groups, we would have to list every member of every group explicitly. This would consume a very large amount of space, at least would be linear in m. So instead, we shall look for a concise way to describe each group, so that given an item we can quickly determine which groups it is a member of. We shall make use of hash functions, which will map items onto the integers 1 · · · W , for some W that we shall specify later. Each group will consist of all items which are mapped to the same value by a particular hash function. If the hash functions have a concise representation, then this describes the groups in a concise fashion. It is important to understand exactly how strong the hash functions need to be to guarantee good results. 3.2.1 Hash Functions. We will make use of universal hash functions de- rived from those given by Carter and Wegman [1979]. We deﬁne a family of hash functions f a,b as follows: ﬁx a prime P > m > W , and draw a and b uniformly at random in the range [0 · · · P − 1]. Then set f a,b(x) = ((ax + b mod P ) mod W ). Using members of this family of functions will deﬁne our groups. Each hash function is deﬁned by a and b, which are integers less than P . P itself is chosen to be O(m), and so the space required to represent each hash function is O(log m) bits. Fact 3.2 (Proposition 7 of Carter and Wegman [1979]). Over all choices of a and b, for x = y, Pr[ f a,b(x) = f a,b( y)] ≤ 1/W . We can now describe the data structures that we will keep in order to allow us to ﬁnd up to k hot items. 3.2.2 Nonadaptive Group Testing Data Structure. The group testing data structure is initialized with two parameters W and T , and has three components: — a three-dimensional array of counters c, of size T × W × (log(m) + 1); — T universal hash functions h, deﬁned by a[1 · · · T ] and b[1 · · · T ] so hi = f a[i],b[i] ; — the count n of the current number of items. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 260 • G. Cormode and S. Muthukrishnan Fig. 3. Procedures for ﬁnding hot items using nonadaptive group testing. The data structure is initialized by setting all the counters, c[1][0][0] to c[T ][W − 1][log m], to zero, and by choosing values for each entry of a and b uniformly at random in the range [0 · · · P −1]. The space used by the data struc- ture is O(T W log m). We shall specify values for W and T later. We will write hi to indicate the ith hash function, so hi (x) = a[i] ∗ x + b[i] mod P mod W . Let G i, j = {x|hi (x) = j } be the (i, j )th group. We will use c[i][ j ][0] to keep the count of the current number of items within the G i, j . For each such group, we shall also keep counts for log m subgroups, deﬁned as G i, j,l = {x|x ∈ G i, j ∧ bit(x, l ) = 1}. These correspond to the groups we kept for ﬁnding a majority item. We will use c[i][ j ][l ] to keep count of the current number of items within subgroup G i, j,l . This leads to the following update procedure. 3.2.3 Update Procedure. Our procedure in processing an input item x is to determine which groups it belongs to, and to update the log m counters for each of these groups based on the bit representation of x in exactly the same way as the algorithm for ﬁnding a majority element. If the transaction is an insertion, then we add one to the appropriate counters, and subtract one for a deletion. The current count of items is also maintained. This procedure is shown in pseudocode as PROCESSITEM (x, trans, T , W ) in Figure 3. The time to perform an update is the time taken to compute the T hash functions, and to modify O(T log m) counters. At any point, we can search the data structure to ﬁnd hot items. Various checks are made to avoid including in the output any items which are not hot. In group testing terms, the test that we will use is whether the count for a group or subgroup exceeds the threshold needed for an item to be hot, which is n/(k + 1). Note that any group which contains a hot item will pass this test, but that it is possible that a group which does not contain a hot item can also pass this test. We will later analyze the probability of such an event, and show that it can be made quite small. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 261 3.2.4 Search Procedure. For each group, we will use the information about the group and its subgroups to test whether there is a hot item in the group, and if so, to extract the identity of the hot item. We process each group G i, j in turn. First, we test whether there can be a hot item in the group. If c[i][ j ][0] ≤ n/(k + 1) then there cannot be a hot item in the group, and so the group is rejected. Then we look at the count of every subgroup, compared to the count of the whole group, and consider the four possible cases: c[i][ j ][l ] > k+1 ? n c[i][ j ][0] − c[i][ j ][l ] > k+1 ? n Conclusion No No Cannot be a hot item in the group, so reject group No Yes If a hot item x is in group, then bit(l , x) = 0 Yes No If a hot item x is in group, then bit(l , x) = 1 Yes Yes Not possible to identify the hot item, so reject group If the group is not rejected, then the identity of the candidate hot item, x, can be recovered from the tests. Some veriﬁcation of the hot items can then be carried out. — The candidate item must belong to the group it was found in, so check hi (x) = j. — If the candidate item is hot, then every group it belongs in should be above the threshold, so check that c[i][hi (x)][0] > n/(k + 1) for all i. The time to ﬁnd all hot items is O(T 2 W log m). There can be at most T W can- didates returned, and checking them all takes worst-case time O(T ) each. The full algorithms are illustrated in Figure 3. We now show that for appropriate choices of T and W we can ﬁrst ensure that all hot items are found, and second ensure that no items are output which are far from being hot. LEMMA 3.3. Choosing W ≥ 2k and T = log2 ( k ) for a user chosen parameter δ δ ensures that the probability of all hot items being output is at least 1 − δ. PROOF. Consider each hot item x, in turn, remembering that there are at most k of these. Using Fact 3.2 about the hash functions, then the probability for any other item falling into the same group as x under the ith hash function is given by 1/W ≤ 2k . Using linearity of expectation, then the expectation of 1 the total frequency of other items which land in the same group as item x is fy 1 − fx 1 E fy = f y · Pr[hi ( y) = hi (x)] ≤ ≤ ≤ . y=x,hi ( y)=hi (x) y=x y=x 2k 2k 2(k + 1) (1) Our test cannot fail if the total weight of other items which fall in the same bucket is less than 1/(k + 1). This is because each time we compare the counts of items in the group we conclude that the hot item is in the half with greater count. If the total frequency of other items is less than 1/(k + 1), then the hot ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 262 • G. Cormode and S. Muthukrishnan item will always be in the heavier half, and so, using a similar argument to that for the majority case, we will be able to read off the index of the hot item using the results of log m groups. The probability of failing due to the weight of other items in the same bucket being more than 1/(k + 1) is bounded by the Markov inequality as 1 , since this is at least twice the expectation. So 2 the probability that we fail on every one of the T independent tests is less log(k/δ) than 12 = δ/k. Using the union bound, then, over all hot items, the prob- ability of any of them failing is less than δ, and so each hot item is output with probability at least 1 − δ. LEMMA 3.4. For any user speciﬁed fraction ≤ k+1 , if we set W ≥ 2 1 and T = log2 (k/δ), then the probability of outputting any item y with f y < 1 k+1 − is at most δ/k. PROOF. This lemma follows because of the checks we perform on every item before outputting it. Given a candidate item, we check that every group it is a member of is above the threshold. Suppose the frequency of the item y is less than ( k+1 − ). Then the frequency of items which fall in the same group under 1 hash function i must be at least , to push the count for the group over the threshold for the test to return positive. By the same argument as in the above lemma, the probability of this event is at most 1 . So the probability that this 2 1 log k/δ occurs in all groups is bounded by 2 = δ/k. Putting these two lemmas together allows us to state our main result on nonadaptive group testing: THEOREM 3.5. With probability at least 1 − δ, then we can ﬁnd all hot items whose frequency is more than k+1 , and, given ≤ k+1 , with probability at least 1 1 1 − δ/k each item which is output has frequency at least k+1 − using space 1 1 O( log(m) log(k/δ)) words. Each update takes time O(log(m) log(k/δ)). Queries take time no more than O( 1 log2 (k/δ) log m). PROOF. This follows by setting W = 2 and T = log(k/δ), and applying the above two lemmas. To process an item, we compute T hash functions, and update T log m counters, giving the time cost. To extract the hot items involves a scan over the data structure in linear time, plus a check on each hot item found that takes time at most O(T ), giving total time O(T 2 W log m). Next, we describe additional properties of our method which imply its sta- bility and resilience. COROLLARY 3.6. The data structure created with T = log(k/δ) can be used to ﬁnd hot items with parameter k for any k < k with the same probability of success 1 − δ. PROOF. Observe in Lemma 3.3 that, to ﬁnd k hot items, we required W ≥ 2k . If we use a data structure created with W ≥ 2k, then W ≥ 2k > 2k , and so the data structure can be used for any value of k less than the value it was created for. Similarly, we have more tests than we need, which can only ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 263 help the accuracy of the group testing. All other aspects of the data structure are identical. So, if we run the procedure with a higher threshold, then with probability at least 1 − δ, we will ﬁnd the hot items. This property means that we can ﬁx k to be as large as we want, and are then able to ﬁnd hot items with any frequency greater than 1/(k + 1) determined at query time. COROLLARY 3.7. The output of the algorithm is the same for any reordering of the input data. PROOF. During any insertion or deletion, the algorithm takes the same ac- tion and does not inspect the contents of the memory. It just adds or subtracts values from the counters, as a function solely of the item value. Since addition and subtraction commute, the corollary follows. 3.2.5 Estimation of Count of Hot Items. Once the hot items have been iden- tiﬁed, we may wish to additionally estimate the count, nx , of each of these items. One approach would be to keep a second data structure enabling the estimation of the counts to be made. Such data structures are typically compact, fast to update, and give accurate answers for items whose count is large, that is, hot items [Gilbert et al. 2002b; Charikar et al. 2002; Cormode and Muthukrishnan 2004a]. However, note that the data structure that we keep embeds a structure that allows us to compute an estimate of the weight of each item [Cormode and Muthukrishnan 2004a]. COROLLARY 3.8. Computing mini c[i][hi (x)][0] gives a good estimate for nx with probability at least 1 − (δ/k). PROOF. This follows from the proofs of Lemma 3.3 and Lemma 3.4. Each estimate c[i][hi (x)][0] = nx + y=x,hi (x)=hi ( y) n y . But by Lemma 3.3, this addi- tional noise is bounded by n with constant probability at least 1 , as shown in 2 Equation (1). Taking the minimum over all estimates ampliﬁes this probability to 1 − (δ/k). 3.3 Time-Space Tradeoff In certain situations when transactions are occurring at very high rates, it is vital to make the update procedure as fast as possible. One of the drawbacks of the current procedure is that it depends on the product of T and log m, which can be slow for items with large identiﬁers. For reducing the time dependency on T , note that the data structure is intrinsically parallelizable: each of the T hash functions can be applied in parallel, and the relevant counts modiﬁed separately. In the experimental section we will show that good results are ob- served even for very small values of T ; therefore, the main bottleneck is the dependence on log m. The dependency on log m arises because we need to recover the identiﬁer of each hot item, and we do this 1 bit at a time. Our observation here is that we can ﬁnd the identiﬁer in different units, for example, 1 byte at a time, at the expense of extra space usage. Formally, deﬁne dig(x, i, b) to be the ith digit ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 264 • G. Cormode and S. Muthukrishnan in the integer x when x is written in base b ≥ 2. Within each group, we keep (b − 1) × logb m subgroups: the i, j th subgroup counts how many items have dig(x, i, b) = j for i = 1 · · · logb m and j = 1 · · · b − 1. We do not need to keep a subgroup for j = 0 since this count can be computed from the other counts for that group. Note that b = 2 corresponds to the binary case discussed already, and b = m corresponds to the simple strategy of keeping a count for every item. THEOREM 3.9. Using the above procedure, with probability at least 1−δ, then 1 we can ﬁnd all hot items whose frequency is more than k+1 , and with probability at least 1 − (δ/k), each item which is output has frequency at least k+1 − using 1 b space O( logb(m) log(k/δ)) words. Each update takes time O(logb(m) log(k/δ)) and queries take O( b logb(m) log2 (k/δ)) time. PROOF. Each subgroup now allows us to read off one digit in the base-b representation of the identiﬁer of any hot item x. Lemma 3.3 applies to this situation just as before, as does Lemma 3.4. This leads us to set W and T as before. We have to update one counter for each digit in the base b representation of each item for each transaction, which corresponds to logb m counters per test, giving an update time of O(T logb(m)). The space required is for the counters to record the subgroups of T W groups, and there are (b − 1) logb(m) subgroups of every group, giving the space bounds. For efﬁcient implementations, it will generally be preferable to choose b to be a power of 2, since this allows efﬁcient computation of indices using bit- level operations (shifts and masks). The space cost can be relatively high for speedups: choosing b = 28 means that each update operation is eight times faster than for b = 2, but requires 32 times more space. A more modest value of b may strike the right balance: choosing b = 4 doubles the update speed, while the space required increases by 50%. We investigate the effects of this tradeoff further in our experimental study. 4. ADAPTIVE GROUP TESTING The more ﬂexible model of adaptive group testing allows conceptually simpler choices of groups, although the data structures required to support the tests become more involved. The idea is a very natural “divide-and-conquer” style approach, and as such may seem straightforward. We give the full details here to emphasize the relation between viewing this as an adaptive group testing procedure and the above nonadaptive group testing approach. Also, this method does not seem to have been published before, so we give the full description for completeness. Consider again the problem of ﬁnding a majority item, assuming that one exists. Then an adaptive group testing strategy is as follows: test whether the count of all items in the range {1 · · · m/2} is above n/2, and also whether the count of all items in the range {m/2 + 1 · · · m} is over the threshold. Recurse on whichever half contains more than half the items, and the majority item is found in log2 m rounds. The question is: how to support this adaptive strategy as transactions are seen? As counts increase and decrease, we do not know in advance which queries ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 265 Fig. 4. Adaptive group testing algorithms. will be posed, and so the solution seems to be to keep counts for every test that could be posed—but there are (m) such tests, which is too much to store. The solution comes by observing that we do not need to know counts exactly, but rather it sufﬁces to use approximate counts, and these can be supported using a data structure that is much smaller, with size dependent on the quality of approximation. We shall make use of the fact that the range of items can be mapped onto the integers 1 · · · m. We will initially describe an adaptive group testing method in terms of an oracle that is assumed to give exact answers, and then show how this oracle can be realized approximately. Deﬁnition 4.1. A dyadic range sum oracle returns the (approximate) sum of the counts of items in the range l = (i2 j + 1) · · · r = (i + 1)2 j for 0 ≤ j ≤ log m and 0 ≤ i ≤ m/2 j . Using such an oracle, which reﬂects the effect of items arriving and departing, it is possible to ﬁnd all the hot items, with the following binary search divide- and-conquer procedure. For simplicity of presentation, we assume that m, the range of items, is a power of 2. Beginning with the full range, recursively split in two. If the total count of any range is less than n/(k+1), then do not split further. Else, continue splitting until a hot item is found. It follows that O(k log(m/k)) calls are made to the oracle. The procedure is presented as ADAPTIVEGROUPTEST on the right in Figure 4. In order to implement dyadic range sum oracles, deﬁne an approximate count oracle to return the (approximate) count of the item x. A dyadic range sum oracle can be implemented using j = 0 · · · log m approximate count oracles: for each item in the stream x, insert 2xj into the j th approximate count oracle, for all j . Recent work has given several methods of implementing the approximate count oracle, which can be updated to reﬂect the arrival or departure of any item. We now list three examples of these and give their space and update time bounds: — The “tug of war sketch” technique of Alon et al. [1999] uses space and time O( 12 log 1 ) to approximate any count up to n with a probability of at least δ 1 − δ. — The method of random subset sums described in Gilbert et al. [2002b] uses space and time O( 12 log 1 ). δ — The method of Charikar et al. [2002]. builds a structure which can be used to approximate the count of any item correct upto n in space O( 12 log 1 ) and δ time per update O(log 1 ). δ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 266 • G. Cormode and S. Muthukrishnan The fastest of these methods is that of Charikar et al. [2002], and so we shall adopt this as the basis of our adaptive group testing solution. In the next section we describe and analyze the data structure and algorithms for our purpose of ﬁnding hot items. 4.1 CCFC Count Sketch We shall brieﬂy describe and analyze the CCFC count sketch.1 This is a different and shorter analysis compared to that given in Charikar et al. [2002], since here the goal is to estimate each count to within an error in terms of the total count of all items rather than in the count of the kth most frequent item, as was the case in the original article. 4.1.1 Data Structure. The data structure used consists of a table of coun- ters t, with width W and height T , initialized to zero. We also keep T pairs of universal hash functions: h1 · · · hT , which map items onto 1 · · · W , and g 1 · · · g T , which map items onto {−1, +1}. 4.1.2 Update Routine. When an insert transaction of item x occurs, we update t[i][hi (x)] ← t[i][hi (x)]+ g i [x] for all i = 1 · · · T . For a delete transaction, we update t[i][hi (x)] ← t[i][hi (x)] − g i [x] for all i = 1 · · · T . 4.1.3 Estimation. To estimate the count of x, compute mediani (t[i][hi (x)] · g i (x)). 4.1.4 Analysis. Use the random variable X i to denote t[i][hi (x)]· g i (x). The expectation of each estimate is E(X i ) = nx + Pr[hi ( y) = hi (x)] · (Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)]) = nx y=x since Pr[ g i (x) = g i ( y)] = 1 . The variance of each estimate is 2 Var(X i ) = E X i2 − E(X i )2 (2) = E( g i (x) (t[i][hi (x)]) ) − 2 2 n2 x (3) = 2 n y nz Pr[hi ( y) = hi (z)](Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)]) (4) y=x,z + n2 + x g i2 ( y)n2 Pr[hi ( y) = hi (x)] − n2 y x (5) y=x n2 y n2 = ≤ . (6) y=x W W √ Using the Chebyshev inequality, it follows that Pr[|X i − x| > √2n ] < 1 . W 2 Taking the median of T estimates ampliﬁes this probability to 2T/4 , by a stan- dard Chernoff bounds argument [Motwani and Raghavan 1995]. 1 CCFC denotes the initials of the authors of Charikar et al. [2002]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 267 4.1.5 Space and Time. The space used is for the W T counters and the 2T hash functions. The time taken for each update is the time to compute the 2T hash functions, and update T counters. THEOREM 4.2. By setting W = 22 and T = 4 log 1 then we can estimate the δ count of any item up to error ± n with probability at least 1 − δ. 4.2 Adaptive Group Testing Using CCFC Count Sketch We can now implement an adaptive group testing solution to ﬁnding hot items. The basic idea is to apply the adaptive binary search procedure using the above count sketch to implement the dyadic range sum oracle. The full procedure is shown in Figure 4. THEOREM 4.3. Setting W = 22 and T = log k log m allows us to ﬁnd every item δ with frequency greater than k+1 + , and report no item with frequency less than 1 1 k+1 − , with a probability of at least 1−δ. The space used is O( 12 log(m) log k log m ) δ words, and the time to perform each update is O(log(m) log k log m ). The query time δ is O(k log m log k log m ) with a proabability of at least 1 − δ. δ δ PROOF. We set the probability of failure to be low ( k log m ), so that for the O(k log m) queries that we pose to the oracle, there is probability at most δ of any of them failing, by the union bound. Hence, we can assume that with a probability of at least 1 − δ, all approximations are within the ± n error bound. Then, when we search for hot items, any range containing a hot item will have its approximate count reduced by at most n. This will allow us to ﬁnd the hot item, and output it if its frequency is at least k+1 + . Any item which is output 1 must pass the ﬁnal test, based on the count of just that item, which will not happen if its frequency is less than k+1 − . 1 Space is needed for log(m) sketches, each of which has size O(T W ) words. For these settings of T and W , we obtain the space bounds listed in the theorem. The time per update is that needed to compute 2T log(m) hash values, and then to update up to this many counters, which gives the stated update time. 4.2.1 Hot Item Count Estimation. Note that we can immediately extract the estimated counts for each hot item using the data structure, since the count of item x is given by using the lowest-level approximate count. Hence, the count nx is estimated with error at most n in time O(log(m) log k log m ). δ 4.3 Time-Space Tradeoffs As with the nonadaptive group testing method, the time cost for updates de- pends on T and log m. Again, in practice we found that small values of T could be used, and that computation of the hash functions could be parallelized for extra speedup. Here, the dependency on log m is again the limiting factor. A similar trick to the nonadaptive case is possible, to change the update time dependency to logb m for arbitrary b: instead of basing the oracle on dyadic ranges, base it on b-adic ranges. Then only logb m sketches need to be updated for each transaction. However, under this modiﬁcation, the same guarantees ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 268 • G. Cormode and S. Muthukrishnan do not hold. In order to extract the hot items, many more queries are needed: instead of making at most two queries per hot item per level, we make at most b queries per hot item per level, and so we need to reduce the probability of making a mistake to reﬂect this. One solution would be to modify T to give a guarantee—but this can lose the point of the exercise, which is to reduce the cost of each update. So instead we treat this as a heuristic to try out in practice, and to see how well it performs. A more concrete improvement to space and time bounds comes from observ- ing that it is wasteful to keep sketches for high levels in the hierarchy, since there are very few items to monitor. It is therefore an improvement to keep exact counts for items at high levels in the hierarchy. 5. COMPARISON BETWEEN METHODS AND EXTENSIONS We have described two methods to ﬁnd hot items after observing a sequence of insertion and deletion transactions, and proved that they can give guarantees about the quality of their output. These are the ﬁrst methods to be able to give such guarantees in the presence of deletions, and we now go on to compare these two different approaches. We will also brieﬂy discuss how they can be adapted when the input may come in other formats. Under the theoretical analysis, it is clear that the adaptive and nonadap- tive methods have some features in common. Both make use of universal hash functions to map items to counters where counts are maintained. However, the theoretical bounds on the adaptive search procedure look somewhat weaker than those on the nonadaptive methods. To give a guarantee of not outputting items which are more than from being hot items, the adaptive group testing depends on 1/ 2 in space, whereas nonadaptive testing uses 1/ . The update times look quite similar, depending on the product of the number of tests, T , and the bit depth of the universe, logb(m). It will be important to see how these methods perform in practice, since these are only worst-case guarantees. In or- der to compare these methods in concrete terms, we shall use the same values of T and W for adaptive and nonadaptive group testing in our tests, so that both methods are allocated approximately the same amount of space. Another difference is that adaptive group testing requires many more hash function evaluations to process each transaction compared to nonadaptive group testing. This is because adaptive group testing computes a different hash for each of log m preﬁxes of the item, whereas nonadaptive group testing com- putes one hash function to map the item to a group, and then allocates it to subgroups based on its binary representation. Although the universal hash functions can be implemented quite efﬁciently [Thorup 2000], this extra pro- cessing time can become apparent for high transaction rates. 5.1 Other Update Models In this work we assume that we modify counts by one each time to model in- sertions or deletions. But there is no reason to insist on this: the above proofs work for arbitrary count distributions; hence it is possible to allow the counts to be modiﬁed by arbitrary increments or decrements, in the same update time ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 269 bounds. The counts can even include fractional values if so desired. This holds for both the adaptive and nonadaptive methods. Another feature is that it is straightforward to combine the data structures for the merge of two distribu- tions: providing both data structures were created using the same parameters and hash functions, then summing the counters coordinatewise gives the same set of counts as if the whole distribution had been processed by a single data structure. This should be contrasted to other approaches [Babcock and Olston 2003], which also compute the overall hot items from multiple sources, but keep a large amount of space at each location: instead the focus is on minimizing the amount of communication. Immediate comparison of the approaches is not pos- sible, but for periodic updates (say, every minute) it would be interesting to compare the communication used by the two methods. 6. EXPERIMENTS 6.1 Evaluation To evaluate our approach, we implemented our group testing algorithms in C. We also implemented two algorithms which operate on nondynamic data, the algorithm Lossy Counting [Manku and Motwani 2002] and Frequent [De- maine et al. 2002]. Neither algorithm is able to cope with the case of the dele- tion of an item, and there is no obvious modiﬁcation to accommodate dele- tions and still guarantee the quality of the output. We instead performed a “best effort” modiﬁcation: since both algorithms keep counters for certain items, which are incremented when that item is inserted, we modiﬁed the algorithms to decrement the counter whenever the corresponding item was deleted. When an item without a counter was deleted, then we took no action.2 This modiﬁcation ensures that when the algorithms encounter an inserts-only dataset, then their action is the same as the original algorithms. Code for our implementations is available on the Web, from http://www.cs.rutgers. edu/˜muthu/massdal-code-index.html. 6.1.1 Evaluation Criteria. We ran tests on both synthetic and real data, and measured time and space usage of all four methods. Evaluation was carried out on a 2.4-GHz desktop PC with 512-MB RAM. In order to evaluate the quality of the results, we used two standard measures: the recall and the precision. Deﬁnition 6.1. The recall of an experiment to ﬁnd hot items is the pro- portion of the hot items that are found by the method. The precision is the proportion of items identiﬁed by the algorithm which are hot items. It will be interesting to see how these properties interact. For example, if an algorithm outputs every item in the range 1 · · · m then it clearly has perfect recall (every hot item is indeed included in the output), but its precision is very poor. At the other extreme, an algorithm which is able to identify only the 2 Many variations of this theme are possible. Our experimental results here that compare our algorithms to modiﬁcations of Lossy Counting [Manku and Motwani 2002] and Frequent [Demaine et al. 2002] should be considered proof-of-concept only. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 270 • G. Cormode and S. Muthukrishnan Fig. 5. Experiments on a sequence of 107 insertion-only transactions. Left: testing recall (propor- tion of the hot items reported). Right: testing precision (proportion of the output items which were hot). most frequent item will have perfect precision, but may have low recall if there are many hot items. For example, the Frequent algorithm gives guarantees on the recall of its output, but does not strongly bound the precision, whereas, for Lossy Counting, the parameter affects the precision indirectly (depending on the properties of the sequence). Meanwhile, our group testing methods give probabilistic guarantees of perfect recall and good precision. 6.1.2 Setting of Parameters. In all our experiments, we set = k+1 and1 hence set W = k+1 , since this keeps the memory usage quite small. In practice, 2 we found that this setting of gave quite good results for our group testing methods, and that smaller values of did not signiﬁcantly improve the results. In all the experiments, we ran both group testing methods with the same val- ues of W and T , which ensured that on most base experiments they used the same amount of space. In our experiments, we looked at the effect of varying the value of the parameters T and b. We gave the parameter to each algo- rithm and saw how much space it used to give a guarantee based on this . In general, the deterministic methods used less space than the group testing methods. However, when we made additional space available to the determin- istic methods equivalent to that used by the group testing approaches, we did not see any signiﬁcant improvement in their precision and we saw a similar pattern of dependency on the Zipf parameter. 6.2 Insertions-Only Data Although our methods have been designed for the challenges of transaction sequences that contain a mix of insertions and deletions, we ﬁrst evaluated a sequence of transactions which contained only insertions. These were gener- ated by a Zipf distribution, whose parameter was varied from 0 (uniform) to 3 (highly skewed). We set k = 1000, so we were looking for all items with fre- quency 0.1% and higher. Throughout, we worked with a universe of size m = 232 . Our ﬁrst observation on the performance of group testing-based methods is that they gave good results with very small values of T . The plots in Figure 5 show the precision and recall of the methods with T = 2, meaning that each item was placed in two groups in nonadaptive group testing, and two estimates were computed for each count in adaptive group testing. Nonadaptive group ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 271 Fig. 6. Experiments on synthetic data consisting of 107 transactions. testing is denoted as algorithm “NAGT,” and adaptive group testing as algo- rithm “Adapt.” Note that, on this data set, the algorithms Lossy Counting and Frequent both achieved perfect recall, that is, they returned every hot item. This is not surprising: the deterministic guarantees ensure that they will ﬁnd all hot items when the data consists of inserts only. Group testing approaches did pretty well here: nonadaptive got almost perfect recall, and adaptive missed only a few for near uniform distributions. On distributions with a small Zipf parameter, many items had counts which were close to the threshold for be- ing a hot item, meaning that adaptive group testing can easily miss an item which is just over the threshold, or include an item which is just below. This is also visible in the precision results: while nonadaptive group testing included no items which were not hot, adaptive group testing did include some. How- ever, the deterministic methods also did quite badly on precision, frequently including many items which were not hot in its output while, for this value of , Lossy Counting did much better than Frequent, but consistently worse than group testing. As we increased T , both nonadaptive and adaptive group testing got perfect precision and recall on all distributions. For the experiment illustrated, the group testing methods both used about 100 KB of space each, while the deterministic methods used a smaller amount of space (around half as much). 6.3 Synthetic Data with Insertions and Deletions We created synthetic datasets designed to test the behavior when confronted with a sequence including deletes. The datasets were created in three equal parts: ﬁrst, a sequence of insertions distributed uniformly over a small range; next, a sequence of inserts drawn from a Zipf distribution with varying param- eters; last, a sequence of deletes distributed uniformly over the same range as the starting sequence. The net effect of this sequence should be that the ﬁrst and last groups of transactions would (mostly) cancel out, leaving the “true” signal from the Zipf distribution. The dataset was designed to test whether the algorithms could ﬁnd this signal from the added noise. We generated a dataset of 10,000,000 items, so it was possible to compute the exact answers in order to compare, and searched for the k = 1000 hot items while varying the Zipf pa- rameter of the signal. The results are shown in Figure 6, with the recall plotted on the left and the precision on the right. Each data point comes from one trial, rather than averaging over multiple repetitions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 272 • G. Cormode and S. Muthukrishnan The purpose of this experiment was to demonstrate a scenario where insert- only algorithms would not be able to cope when the dataset included many deletes (in this case, one in three of the transactions was a deletion). Lossy Counting performed worst on both recall and precision, while Frequent man- aged to get good recall only when the signal was very skewed, meaning the hot items had very high frequencies compared to all other items. Even when the recall of the other algorithms was reasonably good (ﬁnding around three- quarters of the hot items), their precision was very poor: for every hot item that was reported, around 10 infrequent items were also included in the output, and we could not distinguish between these two types. Meanwhile, both group testing approaches succeeded in ﬁnding almost all hot items, and outputting few infrequent items. There is a price to pay for the extra power of the group testing algorithm: it takes longer to process each item under our implementation, and requires more memory. However, these memory requirements are all very small compared to the size of the dataset: both group testing methods used 187 kB—Lossy Counting allocated 40 kB on average, and Frequent used 136 kB.3 In a later section, we look at the time and space costs of the group testing methods in more detail. 6.4 Real Data with Insertions and Deletions We obtained data from one of AT&Ts networks for part of a day, totaling around 100 MB. This consisted of a sequence of new telephone connections being initi- ated, and subsequently closed. The duration of the connections varied consid- erably, meaning that at any one time there were huge numbers of connections in place. In total, there were 3.5 million transactions. We ran the algorithms on this dynamic sequence in order to test their ability to operate on naturally occurring sequences. After every 100,000 transactions we posed the query to ﬁnd all (source, destination) pairs with a current frequency greater than 1%. We were grouping connections by their regional codes, giving many millions of possible pairs, m, although we discovered that geographically neighboring ar- eas generated the most communication. This meant that there were signiﬁcant numbers of pairings achieving the target frequency. Again, we computed recall and precision for the three algorithms, with the results shown in Figure 7: we set T = 2 again and ran nonadaptive group testing (NAGT) and adaptive group testing (Adapt). The nonadaptive group testing approach is shown to be justiﬁed here on real data. In terms of both recall and precision, it is nearly perfect. On one occasion, it overlooked a hot item, and a few times it included items which were not hot. Under certain circumstances this may be acceptable if the items included are “nearly hot,” that is, are just under the threshold for being considered hot. However, we did not pursue this line. In the same amount of space, adaptive group testing did almost as well, although its recall and precision were both 3 Thesereﬂected the space allocated for the insert-only algorithms based on upper bounds on the space needed. This was done to avoid complicated and costly memory allocation while processing transactions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 273 Fig. 7. Performance results on real data. Fig. 8. Choosing the frequency level at query time: the data structure was built for queries at the 0.5% level, but was then tested with queries ranging from 10% to 0.01%. less good overall than nonadaptive. Both methods reached perfect precision and recall as T was increased: nonadaptive group testing achieved perfect scores for T = 3, and adaptive for T = 7. Lossy Counting performed generally poorly on this dynamic dataset, its quality of results swinging wildly between readings but on average ﬁnding only half the hot items. The recall of the Frequent algorithm looked reasonably good, especially as time progressed, but its precision, which began poorly, appeared to degrade further. One possible explanation is that the algorithm was collecting all items which were ever hot, and outputting these whether they were hot or not. Certainly, it output between two to three times as many items as were currently hot, meaning that its output necessarily contained many infrequent items. Next, we ran tests which demonstrated the ﬂexibility of our approach. As noted in Section 3.2, if we create a set of counters for nonadaptive group testing for a particular frequency level f = 1/(k + 1), then we can use these counters to answer a query for a higher frequency level without any need for recomputation. To test this, we computed the data structure for the ﬁrst million items of the real data set based on a frequency level of 0.5%. We then asked for all hot items for a variety of frequencies between 10% and 0.5%. The results are shown in Figure 8. As predicted, the recall level was the same (100% throughout), and precision was high, with a few nonhot items included at various points. We then examined how much below the designed capability we could push the group testing algorithm, and ran queries asking for hot items with progressively lower frequencies. For nonadaptive group testing with T = 1, the quality of the recall began deteriorating after the query frequency descended below 0.5%, but ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 274 • G. Cormode and S. Muthukrishnan Fig. 9. Timing results on real data. for T = 3 the results maintained an impressive level of recall down to around the 0.05% level, after which the quality deteriorated (around this point, the threshold for being considered a hot item was down to having a count in single ﬁgures, due to deletions removing previously inserted items). Throughout, the precision of both sets of results were very high, close to perfect even when used far below the intended range of operation. 6.5 Timing Results On the real data, we timed how long it took to process transactions, as we varied certain parameters of the methods. We also plotted the time taken by the insert-only methods for comparison. Timing results are shown in Figure 9. On the left are timing results for working through the whole data set. As we would expect, the time scaled roughly linearly with the number of transac- tions processed. Nonadaptive group testing was a few times slower than for the insertion-only methods, which were very fast. With T = 2, nonadaptive group testing processed over a million transactions per second. Adaptive group testing was somewhat slower. Although asymptotically the two methods have the same update cost, here we see the effect of the difference in the methods: since adaptive group testing computes many more hash functions than non- adaptive (see Section 5), the cost of this computation is clear. It is therefore desirable to look at how to reduce the number of hash function computations done by adaptive group testing. Applying the ideas discussed in Sections 3.3 and 4.3, we tried varying the parameter b from 2. The results for this are shown on the right in Figure 9. Here, we plot the time to process two million transactions for different values of b against T , the number of repetitions of the process. It can be seen that increasing b does indeed bring down the cost of adaptive and nonadaptive group testing. For T = 1, nonadaptive group testing becomes competitive with the insertion methods in terms of time to process each transaction. We also measured the output time for each method. The adaptive group testing approach took an average 5 ms per query, while the nonadaptive group testing took 2 ms. The deterministic approaches took less than 1 ms per query. 6.6 Time-Space Tradeoffs To see in more detail the effect of varying b, we plotted the time to process two million transactions for eight different values of b (2, 4, 8, 16, 32, 64, 128, and ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 275 Fig. 10. Time and space costs of varying b. Fig. 11. Precision and recall on real data as b and T vary. 256) and three values of T (1, 2, 3) at k = 100. The results are shown in Figure 10. Although increasing b does improve the update time for every method, the effect becomes much less pronounced for larger values of b, sug- gesting that the most beneﬁt is to be had for small values of b. The beneﬁt seems strongest for adaptive group testing, which has the most to gain. Nonadaptive group testing still computes T functions per item, so eventually the beneﬁt of larger b is insigniﬁcant compared to this ﬁxed cost. For nonadaptive group testing, the space must increase as b increases. We plotted this on the right in Figure 10. It can be seen that the space increases quite signiﬁcantly for large values of b, as predicted. For b = 2 and T = 1, the space used is about 12 kB, while for b = 256, the space has increased to 460 kB. For T = 2 and T = 3, the space used is twice and three times this, respectively. It is important to see the effect of this tradeoff on accuracy as well. For non- adaptive group testing, the precision and recall remained the same (100% for both) as b and T were varied. For adaptive group testing, we kept the space ﬁxed and looked at how the accuracy varied for different values of T . The results are given in Figure 11. It can be seen that there is little variation in the recall with b, but it increases slightly with T , as we would expect. For precision, the difference is more pronounced. For small values of T , increasing b to speed up processing has an immediate effect on the precision: more items which are not hot are included in the output as b increases. For larger values of T , this effect is reduced: increasing b does not affect precision by as much. Note that the transaction processing time is proportional to T/ log(b), so it seems that good tradeoffs are achieved for T = 1 and b = 4 and for T = 3 and b = 8 or 16. Looking at Figure 10, we see that these points achieve similar update times, of approximately one million items per second in our experiments. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 276 • G. Cormode and S. Muthukrishnan 7. CONCLUSIONS We have proposed two new methods for identifying hot items which occur more than some frequency threshold. These are the ﬁrst methods which can cope with dynamic datasets, that is, the removal as well as the addition of items. They perform to a high degree of accuracy in practice, as guaranteed by our analysis of the algorithm, and are quite simple to implement. In our experimental analysis, it seemed that an approach based on nonadaptive group testing was slightly preferable to one based on adaptive group testing, in terms of recall, precision, and time. Recently, we have taken these ideas of using group testing techniques to identify items of interest in small space, and applied them to other problems. For example, consider ﬁnding items which have the biggest frequency differ- ence between two datasets. Using a similar arrangement of groups but a dif- ferent test allows us to ﬁnd such items while processing transactions at very high rates and keeping only small summaries for each dataset [Cormode and Muthukrishnan 2004b]. This is of interest in a number of scenarios, such as trend analysis, ﬁnancial datasets, and anomaly detection [Yi et al. 2000]. One point of interest is that, for that scenario, it is straightforward to generalize the nonadaptive group testing approach, but the adaptive group testing approach cannot be applied so easily. Our approach of group testing may have application to other problems, no- tably in designing summary data structures for the maintenance of other statis- tics of interest and in data stream applications. An interesting open problem is to ﬁnd combinatorial designs which can achieve the same properties as our randomly chosen groups, in order to give a fully deterministic construction for maintaining hot items. The main challenge here is to ﬁnd good “decoding” meth- ods: given the result of testing various groups, how to determine what the hot items are. We need such methods that work quickly in small space. A signiﬁcant problem that we have not approached here is that of continu- ously monitoring the hot items—that is, to maintain a list of all items that are hot, and keep this updated as transactions are observed. A simple solution is to keep the same data structure, and to run the query procedure when needed, say once every second, or whenever n has changed by more than k. (After an item is inserted, it is easy to check whether it is now a hot item. Following deletions, other items can become hot, but the threshold of n/(k + 1) only changes when n has decreased by k + 1.) In our experiments, the cost of running queries is a matter of milliseconds and so is quite a cheap operation to perform. In some situations this is sufﬁcient, but a more general solution is needed for the full version of this problem. ACKNOWLEDGMENTS We thank the anonymous referees for many helpful suggestions. REFERENCES AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. 1987. Data structures and algorithms. Addison- Wesley, Reading, MA. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically • 277 ALON, N., GIBBONS, P., MATIAS, Y., AND SZEGEDY, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems. 10–20. ALON, N., MATIAS, Y., AND SZEGEDY, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Com- puting. 20–29. Journal version in J. Comput. Syst. Sci., 58, 137–147, 1999. BABCOCK, B. AND OLSTON, C. 2003. Distributed top-k monitoring. In Proceedings of ACM SIGMOD International Conference on Management of Data. BARBARA, D., WU, N., AND JAJODIA, S. 2001. Detecting novel network intrusions using Bayes esti- mators. In Proceedings of the First SIAM International Conference on Data Mining. BOYER, B. AND MOORE, J. 1982. A fast majority vote algorithm. Tech. Rep. 35. Institute for Com- puter Science, University of Texas, at Austin, Austin, TX. CARTER, J. L. AND WEGMAN, M. N. 1979. Universal classes of hash functions. J. Comput. Syst. Sci. 18, 2, 143–154. CHARIKAR, M., CHEN, K., AND FARACH-COLTON, M. 2002. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP). 693–703. CORMODE, G. AND MUTHUKRISHNAN, S. 2003. What’s hot and what’s not: Tracking most frequent items dynamically. In Proceedings of ACM Conference on Principles of Database Systems. 296– 306. CORMODE, G. AND MUTHUKRISHNAN, S. 2004a. An improved data stream summary: The count-min sketch and its applications. J. Algorithms. In press. CORMODE, G. AND MUTHUKRISHNAN, S. 2004b. What’s new: Finding signiﬁcant differences in net- work data streams. In Proceedings of IEEE Infocom. ´ DEMAINE, E., LOPEZ-ORTIZ, A., AND MUNRO, J. I. 2002. Frequency estimation of Internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algo- rithms. Lecture Notes in Computer Science, vol. 2461. Springer, Berlin, Germany, 348–360. DU, D.-Z. AND HWANG, F. 1993. Combinatorial Group Testing and Its Applications. Series on Ap- plied Mathematics, vol. 3. World Scientiﬁc, Singapore. ESTAN, C. AND VARGHESE, G. 2002. New directions in trafﬁc measurement and accounting. In Proceedings of ACM SIGCOMM. Journal version in Comput. Commun. Rev. 32, 4, 323–338. FANG, M., SHIVAKUMAR, N., GARCIA-MOLINA, H., MOTWANI, R., AND ULLMAN, J. D. 1998. Computing iceberg queries efﬁciently. In Proceedings of the International Conference on Very Large Data Bases. 299–310. FISCHER, M. AND SALZBERG, S. 1982. Finding a majority among n votes: Solution to problem 81-5. J. Algorith. 3, 4, 376–379. GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Querying and mining data streams: You only get one look. In Proceedings of the ACM SIGMOD International Conference on Management of Data. GIBBONS, P. AND MATIAS, Y. 1998. New sampling-based summary statistics for improving approx- imate query answers. In Proceedings of the ACM SIGMOD International Conference on Manage- ment of Data, Journal version in ACM SIGMOD Rec. 27, 331–342. GIBBONS, P. AND MATIAS, Y. 1999. Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science A. GIBBONS, P. B., MATIAS, Y., AND POOSALA, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the International Conference on Very Large Data Bases. 466– 475. GILBERT, A., GUHA, S., INDYK, P., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th ACM Symposium on the Theory of Computing. 389–398. GILBERT, A., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2001. QuickSAND: Quick summary and analysis of network data. DIMACS Tech. Rep. 2001–43, Available online at http://dimacs. crutgers.edu/Techniclts. GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases. 454–465. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 278 • G. Cormode and S. Muthukrishnan IOANNIDIS, Y. E. AND CHRISTODOULAKIS, S. 1993. Optimal histograms for limiting worst-case error propagation in the size of the join radius. ACM Trans. Database Syst. 18, 4, 709–748. IOANNIDIS, Y. E. AND POOSALA, V. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD International Conference on the Management of Data. 233–244. KARP, R., PAPADIMITRIOU, C., AND SHENKER, S. 2003. A simple algorithm for ﬁnding frequent ele- ments in sets and bags. ACM Trans. Database Syst. 28, 51–55. KUSHILEVITZ, E. AND NISAN, N. 1997. Communication Complexity. Cambridge University Press, Cambridge, U.K. MANKU, G. AND MOTWANI, R. 2002. Approximate frequency counts over data streams. In Proceed- ings of the International Conference on Very Large Data Bases. 346–357. MISRA, J. AND GRIES, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143–152. MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K. MUTHUKRISHNAN, S. 2003. Data streams: Algorithms and applications. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. Available online at http:// athos.rutgers.edu/∼muthu/stream-1-1.ps. THORUP, M. 2000. Even strongly universal hashing is pretty fast. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms. 496–497. YI, B.-K., SIDIROPOULOS, N., JOHNSON, T., JAGADISH, H., FALOUTSOS, C., AND BILIRIS, A. 2000. Online data mining for co-evolving time sequences. In Proceedings of the 16th International Conference on Data Engineering (ICDE’ 00). 13–22. Received October 2003; revised June 2004; accepted September 2004 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings MINOS GAROFALAKIS Bell Labs, Lucent Technologies and AMIT KUMAR Indian Institute of Technology We propose the ﬁrst known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as deﬁned by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2 n log∗ n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst- case distortion bounds. To the best of our knowledge, these are the ﬁrst algorithmic results on low- distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: XML, data streams, data synopses, approximate query pro- cessing, tree-edit distance, metric-space embeddings 1. INTRODUCTION The Extensible Markup Language (XML) is rapidly emerging as the new standard for data representation and exchange on the Internet. The simple, A preliminary version of this article appeared in Proceedings of the 22nd Anuual ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (San Diego, CA, June) [Garofalakis and Kumar 2003]. Authors’ addresses: M. Garofalakis, Bell Labs, Lucent Technologies, 600 Mountain Ave., Murray Hill, NJ 07974; email: minos@research.bell-labs.com; A. Kumar, Department of Computer Sci- ence and Engineering, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India; email: amitk@cse.iitd.ernet.in. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee. C 2005 ACM 0362-5915/05/0300-0279 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 279–332. 280 • M. Garofalakis and A. Kumar self-describing nature of the XML standard promises to enable a broad suite of next-generation Internet applications, ranging from intelligent Web searching and querying to electronic commerce. In many respects, XML documents are instances of semistructured data: the underlying data model comprises an or- dered, labeled tree of element nodes, where each element can be either an atomic data item or a composite data collection consisting of references (represented as edges) to child elements in the XML tree. Further, labels (or tags) stored with XML data elements describe the actual semantics of the data, rather than simply specifying how elements are to be displayed (as in HTML). Thus, XML data is tree-structured and self-describing. The ﬂexibility of the XML data model makes it a very natural and powerful tool for representing data from a wide variety of Internet data sources. Of course, given the typical autonomy of such sources, identical or similar data instances can be represented using different XML-document tree structures. For example, different online news sources may use distinct document type descriptor (DTD) schemas to export their news stories, leading to different node labels and tree structures. Even when the same DTD is used, the resulting XML trees may not have the same structure, due to the presence of optional elements and attributes [Guha et al. 2002]. Given the presence of such structural differences and inconsistencies, it is obvious that correlating XML data across different sources needs to rely on approximate XML-document matching, where the approximation is quanti- ﬁed through an appropriate general distance metric between XML data trees. Such a metric for comparing ordered labeled trees has been developed by the combinatorial pattern matching community in the form of tree-edit distance [Apostolico and Galil 1997; Zhang and Shasha 1989]. In a nutshell, the tree- edit distance metric is the natural generalization of edit distance from the string domain; thus, the tree-edit distance between two tree structures represents the minimum number of basic edit operations (node inserts, deletes, and relabels) needed to transform one tree to the other. Tree-edit distance is a natural metric for correlating and discovering approx- imate matches in XML document collections (e.g., through an appropriately de- ﬁned similarity-join operation).1 The problem becomes particularly challeng- ing in the context of streaming XML data sources, that is, when such cor- relation queries must be evaluated over continuous XML data streams that arrive and need to be processed on a continuous basis, without the beneﬁt of several passes over a static, persistent data image. Algorithms for corre- lating such XML data streams would need to work under very stringent con- straints, typically providing (approximate) results to user queries while (a) look- ing at the relevant XML data only once and in a ﬁxed order (determined by the stream-arrival pattern) and (b) using a small amount of memory (typically, log- arithmic or polylogarithmic in the size of the stream) [Alon et al. 1996, 1999; 1 Speciﬁc semantics associated with XML node labels and tree-edit operations can be captured using a generalized, weighted tree-edit distance metric that associates different weights/costs with different operations. Extending the algorithms and results in this article to weighted tree-edit distance is an interesting open problem. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 281 Fig. 1. Example DTD fragments (a) and (b) and XML Document Trees (c) and (d) for autonomous bibliographic Web sources. Dobra et al. 2002; Gilbert et al. 2001]. Of course, such streaming-XML tech- niques are more generally applicable in the context of huge, terabyte XML databases, where performing multiple passes over the data to compute an ex- act result can be prohibitively expensive. In such scenarios, having single-pass, space-efﬁcient XML query-processing algorithms that produce good-quality ap- proximate answers offers a very viable and attractive alternative [Babcock et al. 2002; Garofalakis et al. 2002]. Example 1.1. Consider the problem of integrating XML data from two au- tonomous, bibliographic Web sources WS1 and WS2 . One of the key issues in such data-integration scenarios is that of detecting (approximate) duplicates across the two sources [Dasu and Johnson 2003]. For autonomously managed XML sources, such duplicate-detection tasks are complicated by the fact that the sources could be using different DTD structures to describe their entries. As a simple example, Figures 1(a) and 1(b) depict the two different DTD fragments employed by WS1 and WS2 (respectively) to describe XML trees for academic publications; clearly, WS1 uses a slightly different set of tags (i.e., article in- stead of paper) as well as a “deeper” DTD structure (by adding the type and authors structuring elements). Figures 1(c) and 1(d) depict two example XML document trees T1 and T2 from WS1 and WS2 , respectively; even though the two trees have structural differences, it is obvious that T1 and T2 represent the same publication. In fact, it is easy to see that T1 and T2 are within a tree-edit distance of 3 (i.e., one relabel and two delete operations on T1 ). Approximate duplicate detection across WS1 and WS2 can be naturally expressed as a tree-edit distance simi- larity join operation that returns the pairs of trees (T1 , T2 ) ∈ WS1 × WS2 that are within a tree-edit distance of τ , where the user/application-deﬁned simi- larity threshold τ is set to a value ≥ 3 to perhaps account for other possible differences in the joining tree structures (e.g., missing or misspelled coauthor ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 282 • M. Garofalakis and A. Kumar names). A single-pass, space-efﬁcient technique for approximating such simi- larity joins as the document trees from the two XML data sources are streaming in would provide an invaluable data-integration tool; for instance, estimates of the similarity-join result size (i.e., the number of approximate duplicate entries) can provide useful indicators of the degree of overlap (i.e., “content similarity”) or coverage (i.e., “completeness”) of autonomous XML data sources [Dasu and Johnson 2003; Florescu et al. 1997]. 1.1 Prior Work Techniques for data reduction and approximate query processing for both re- lational and XML databases have received considerable attention from the database research community in recent years [Acharya et al. 1999; Chakrabarti et al. 2000; Garofalakis and Gibbons 2001; Ioannidis and Poosala 1999; Polyzotis and Garofalakis 2002; Polyzotis et al. 2004; Vitter and Wang 1999]. The vast majority of such proposals, however, rely on the assumption of a static data set which enables either several passes over the data to construct effec- tive data synopses (such as histograms [Ioannidis and Poosala 1999] or Haar wavelets [Chakrabarti et al. 2000; Vitter and Wang 1999]); clearly, this as- sumption renders such solutions inapplicable in a data-stream setting. Mas- sive, continuous data streams arise naturally in a variety of different applica- tion domains, including network monitoring, retail-chain and ATM transaction processing, Web-server record logging, and so on. As a result, we are witnessing a recent surge of interest in data-stream computation, which has led to several (theoretical and practical) studies proposing novel one-pass algorithms with limited memory requirements for different problems; examples include quan- tile and order-statistics computation [Greenwald and Khanna 2001; Gilbert et al. 2002b]; distinct-element counting [Bar-Yossef et al. 2002; Cormode et al. 2002a]; frequent itemset counting [Charikar et al. 2002; Manku and Motwani 2002]; estimating frequency moments, join sizes, and difference norms [Alon et al. 1996, 1999; Dobra et al. 2002; Feigenbaum et al. 1999; Indyk 2000]; and, computing one- or multidimensional histograms or Haar wavelet decomposi- tions [Gilbert et al. 2002a; Gilbert et al. 2001; Thaper et al. 2002]. All these articles rely on an approximate query-processing model, typically based on an appropriate underlying stream-synopsis data structure. (A different approach, explored by the Stanford STREAM project [Arasu et al. 2002], is to character- ize subclasses of queries that can be computed exactly with bounded memory.) The synopses of choice for a number of the above-cited data-streaming articles are based on the key idea of pseudorandom sketches which, essentially, can be thought of as simple, randomized linear projections of the underlying data item(s) (assumed to be points in some numeric vector space). Recent work on XML-based publish/subscribe systems has dealt with XML document streams, but only in the context of simple, predicate-based ﬁltering of individual documents [Altinel and Franklin 2000; Chan et al. 2002; Diao et al. 2003; Gupta and Suciu 2003; Lakshmanan and Parthasarathy 2002]; more re- cent work has also considered possible transformations of the XML documents in order to produce customized output [Diao and Franklin 2003]. Clearly, the ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 283 problem of efﬁciently correlating XML documents across one or more input streams gives rise to a drastically different set of issues. Guha et al. [2002] discussed several different algorithms for performing tree-edit distance joins over XML databases. Their work introduced easier-to-compute bounds on the tree-edit distance metric and other heuristics that can signiﬁcantly reduce the computational cost incurred due to all-pairs tree-edit distance computations. However, Guha et al. focused solely on exact join computation and their al- gorithms require multiple passes over the data; this obviously renders them inapplicable in a data-stream setting. 1.2 Our Contributions All earlier work on correlating continuous data streams (through, e.g., join or norm computations) in small space has relied on the assumption of ﬂat, rela- tional data items over some appropriate numeric vector space; this is certainly the case with the sketch-based synopsis mechanism (discussed above), which has been the algorithmic tool of choice for most of these earlier research efforts. Unfortunately, this limitation renders earlier streaming results useless for di- rectly dealing with streams of structured objects deﬁned over a complex metric space, such as XML-document streams with a tree-edit distance metric. In this article, we propose the ﬁrst known solution to the problem of approx- imating (in small space) the result of correlation queries based on tree-edit dis- tance (such as the tree-edit distance similarity joins described in Example 1.1) over continuous XML data streams. The centerpiece of our solution is a novel algorithm for effectively (i.e., “obliviously” [Indyk 2001]) embedding streaming XML and the tree-edit distance metric into a numeric vector space equipped with the standard L1 distance norm, while guaranteeing a worst-case upper bound of O(log2 n log∗ n) on the distance distortion between any data trees with at most n nodes.2 Our embedding is completely deterministic and relies on parsing an XML tree into a hierarchy of special subtrees. Our parsing makes use of a deterministic coin-tossing process recently introduced by Cormode and Muthukrishnan [2002] for embedding a variant of the string-edit distance (that, in addition to standard string edits, includes an atomic “substring move” op- eration) into L1 ; however, since we are dealing with general trees rather than ﬂat strings, our embedding algorithm and its analysis are signiﬁcantly more complex, and result in different bounds on the distance distortion.3 We also demonstrate how our vector-space embedding construction can be combined with earlier sketching techniques [Alon et al. 1999; Dobra et al. 2002; Indyk 2000] to obtain novel algorithms for (1) constructing a small sketch syn- opsis of a massive, streaming XML data tree that can be used as a concise 2 Alllog’s in this article denote base-2 logarithms; log∗ n denotes the number of log applications required to reduce n to a quantity that is ≤ 1, and is a very slowly increasing function of n. 3 Note that other known techniques for approximating string-edit distance based on the decom- position of strings into q-grams [Ukkonen 1992; Gravano et al. 2001] only give one-sided error guarantees, essentially offering no guaranteed upper bound on the distance distortion. For in- stance, it is not difﬁcult to construct examples of very distinct strings with nearly identical q-gram sets (i.e., arbitrarily large distortion). Furthermore, to the best of our knowledge, the results in Ukkonen [1992] have not been extended to the case of trees and tree-edit distance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 284 • M. Garofalakis and A. Kumar surrogate for the full tree in tree-edit distance computations, and (2) estimat- ing the result size of a tree-edit-distance similarity join over two streams of XML documents. Finally, we present results from an empirical study of our embedding algorithm with both synthetic and real-life XML data trees. Our ex- perimental results offer some preliminary validation of our approach, demon- strating that the average-case behavior of our techniques over realistic data sets is much better than what our theoretical worst-case distortion bounds would predict, and revealing several interesting characteristics of our algorithms in practice. To the best of our knowledge, ours are the ﬁrst algorithmic results on oblivious tree-edit distance embeddings, and on effectively correlating contin- uous, massive streams of XML data. We believe that our embedding algorithm also has other important ap- plications. For instance, exact tree-edit distance computation is typically a computationally-expensive problem that can require up to O(n4 ) time (for the conventional tree-edit distance metric [Apostolico and Galil 1997; Zhang and Shasha 1989]), and is, in fact, N P-hard for the variant of tree-edit distance considered in this article (even for the simpler case of ﬂat strings [Shapira and Storer 2002]). In contrast, our embedding scheme can be used to provide an approximate tree-edit distance (to within a guaranteed O(log2 n log∗ n) factor) in near-linear, that is, O(n log∗ n), time. 1.3 Organization The remainder of this article is organized as follows. Section 2 presents back- ground material on XML, tree-edit distance and data-streaming techniques. In Section 3, we present an overview of our approach for correlating XML data streams based on tree-edit distance embeddings. Section 4 presents our embed- ding algorithm in detail and proves its small-time and low distance-distortion guarantees. We then discuss two important applications of our algorithm for XML stream processing, namely (1) building a sketch synopsis of a massive, streaming XML data tree (Section 5), and (2) approximating similarity joins over streams of XML documents (Section 6). We present the results of our empirical study with synthetic and real-life XML data in Section 7. Finally, Section 8 outlines our conclusions. The Appendix provides ancillary lemmas (and their proofs) for the upper bound result. 2. PRELIMINARIES 2.1 XML Data Model and Tree-Edit Distance An XML document is essentially an ordered, labeled tree T , where each node in T represents an XML element and is characterized by a label taken from a ﬁxed alphabet of string literals σ . Node labels capture the semantics of XML elements, and edges in T capture element nesting in the XML data. Without loss of generality, we assume that the alphabet σ captures all node labels, literals, and atomic values that can appear in an XML tree (e.g., based on the underlying DTD(s)); we also focus on the ordered, labeled tree structure of the XML data and ignore the raw-character data content inside nodes with string ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 285 Fig. 2. Example XML tree and tree-edit operation. labels (PCDATA, CDATA, etc.). We use |T | and |σ | to denote the number of nodes in T and the number of symbols in σ , respectively. Given two XML document trees T1 and T2 , the tree-edit distance between T1 and T2 (denoted by d (T1 , T2 )) is deﬁned as the minimum number of tree-edit op- erations to transform one tree into another. The standard set of tree-edit opera- tions [Apostolico and Galil 1997; Zhang and Shasha 1989] includes (1) relabeling (i.e., changing the label) of a tree node v; (2) deleting a tree node v (and moving all of v’s children under its parent); and (3) inserting a new node v under a node w and moving a contiguous subsequence of w’s children (and their descendants) under the new node v. (Note that the node-insertion operation is essentially the complement of node deletion.) An example XML tree and tree-edit operation are depicted in Figure 2. In this article, we consider a variant of the tree-edit dis- tance metric, termed tree-edit distance with subtree moves, that, in addition to the above three standard edit operations, allows a subtree to be moved under a new node in the tree in one step. We believe that subtree moves make sense as a primitive edit operation in the context of XML data—identical substructures can appear in different locations (for example, due to a slight variation of the DTD), and rearranging such substructures should probably be considered as basic an operation as node insertion or deletion. In the remainder of this article, the term tree-edit distance assumes the four primitive edit operations described above, namely, node relabelings, deletions, insertions, and subtree moves.4 2.2 Data Streams and Basic Pseudorandom Sketching In a data-streaming environment, data-processing algorithms are allowed to see the incoming data records (e.g., relational tuples or XML documents) only once as they are streaming in from (possibly) different data sources [Alon et al. 1996, 1999; Dobra et al. 2002]. Backtracking over the stream and explicit access to past data records are impossible. The data-processing algorithm is also al- lowed a small amount of memory, typically logarithmic or polylogarithmic in the data-stream size, in order to maintain concise synopsis data structures for the input stream(s). In addition to their small-space requirement, these synopses should also be easily computable in a single pass over the data and with small per-record processing time. At any point in time, the algorithm can combine the maintained collection of synopses to produce an approximate result. 4 The problem of designing efﬁcient (i.e., “oblivious”), guaranteed-distortion embedding schemes for the standard tree-edit distance metric remains open; of course, this is also true for the much simpler standard string-edit distance metric (i.e., without “substring moves”) [Cormode and Muthukrishnan 2002]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 286 • M. Garofalakis and A. Kumar We focus on one particular type of stream synopses, namely, pseudorandom sketches; sketches have provided effective solutions for several streaming prob- lems, including join and multijoin processing [Alon et al. 1996, 1999; Dobra et al. 2002], norm computation [Feigenbaum et al. 1999; Indyk 2000], distinct- element counting [Cormode et al. 2002a], and histogram or Haar-wavelet con- struction [Gilbert et al. 2001; Thaper et al. 2002]. We describe the basics of pseudorandom sketching schemes using a simple binary-join cardinality es- timation query [Alon et al. 1999]. More speciﬁcally, assume that we want to estimate Q = COUNT(R1 1 A R2 ), that is, the cardinality of the binary equi- join of two streaming relations R1 and R2 over a (numeric) attribute (or, set of attributes) A, whose values we assume (without loss of generality) to range over {1, . . . , N }. (Note that, by the deﬁnition of the equijoin operator, the two join attributes have identical value domains.) Letting f k (i) (k = 1, 2; i = 1, . . . , N ) denote the frequency of the ith value in Rk , is is easy to see N that Q = i=1 f 1 (i) f 2 (i). Clearly, estimating this join size exactly requires at least (N ) space, making an exact solution impractical for a data-stream setting. In their seminal work, Alon et al. [Alon et al. 1996, 1999] proposed a ran- domized technique that can offer strong probabilistic guarantees on the quality of the resulting join-size estimate while using space that can be signiﬁcantly smaller than N . Brieﬂy, the key idea is to (1) build an atomic sketch X k (essen- tially, a randomized linear projection) of the distribution vector for each input stream Rk (k = 1, 2) (such a sketch can be easily computed over the streaming values of Rk in only O(log N ) space) and (2) use the atomic sketches X 1 and X 2 to deﬁne a random variable X Q such that (a) X Q is an unbiased (i.e., correct on expectation) randomized estimator for the target join size, so that E[X Q ] = Q, and (b) X Q ’s variance (Var[X Q ]) can be appropriately upper-bounded to allow for probabilistic guarantees on the quality of the Q estimate. More formally, this random variable X Q is constructed on-line from the two data streams as follows: — Select a family of four-wise independent binary random variates {ξi : i = 1, . . . , N }, where each ξi ∈ {−1, +1} and P [ξi = +1] = P [ξi = −1] = 1/2 (i.e., E[ξi ] = 0). Informally, the four-wise independence condition means that, for any 4-tuple of ξi variates and for any 4-tuple of {−1, +1} values, the probabil- ity that the values of the variates coincide with those in the {−1, +1} 4-tuple is exactly 1/16 (the product of the equality probabilities for each individual ξi ). The crucial point here is that, by employing known tools (e.g., orthogonal arrays) for the explicit construction of small sample spaces supporting four- wise independence, such families can be efﬁciently constructed on-line using only O(log N ) space [Alon et al. 1996]. — Deﬁne X Q = X 1 · X 2 , where the atomic sketch X k is deﬁned simply as X k = N i=1 f k (i)ξi , for k = 1, 2. Again, note that each X k is a simple randomized linear projection (inner product) of the frequency vector of Rk .A with the vector of ξi ’s that can be efﬁciently generated from the streaming values of A as follows: start a counter with X k = 0 and simply add ξi to X k whenever the ith value of A is observed in the Rk stream. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 287 The quality of the estimation guarantees can be improved using a standard boosting technique that maintains several independent identically distributed (iid) instantiations of the above process, and uses averaging and median- selection operators over the X Q estimates to boost accuracy and probabilis- tic conﬁdence [Alon et al. 1996]. (Independent instances can be constructed by simply selecting independent random seeds for generating the families of four- wise independent ξi ’s for each instance.) We use the term (atomic) AMS sketch to describe a randomized linear projection computed in the above-described manner over a data stream. Letting SJk (k = 1, 2) denote the self-join size of N Rk .A (i.e., SJk = i=1 f k (i)2 ), the following theorem [Alon et al. 1999] shows how sketching can be applied for estimating binary-join sizes in limited space. (By standard Chernoff bounds [Motwani and Raghavan 1995], using median- selection over O(log(1/δ)) of the averages computed in Theorem 2.1 allows the conﬁdence in the estimate to be boosted to 1 − δ, for any pre-speciﬁed δ < 1.) THROEM 2.1 [ALON ET AL. 1999]. Let the atomic AMS sketches X 1 and X 2 be as deﬁned above. Then, E[X Q ] = E[X 1 X 2 ] = Q and Var(X Q ) ≤ 2 · SJ1 · SJ2 . ·SJ Thus, averaging the X Q estimates over O( SJ12 2 2 ) iid instantiations of the basic Q scheme, guarantees an estimate that lies within a relative error of at most from Q with constant probability > 1/2. It should be noted that the space-usage bounds stated in Theorem 2.1 capture the worst-case behavior of AMS-sketching-based estimation—empirical results with synthetic and real-life data sets have demonstrated that the average-case behavior of the AMS scheme is much better [Alon et al. 1999]. More recent work has led to improved AMS-sketching-based estimators with provably better space-usage guarantees (that actually match the lower bounds shown by Alon et al. [1999]) [Ganguly et al. 2004], and has demonstrated that AMS-sketching techniques can be extended to effectively handle one or more complex multi- join aggregate SQL queries over a collection of relational streams [Dobra et al. 2002, 2004]. Indyk [2000] discussed a different type of pseudorandom sketches which N are, once again, deﬁned as randomized linear projections X k = i=1 f k (i)ξi of a streaming input frequency vector for the values in Rk , but using random variates {ξi } drawn from a p-stable distribution (which can again be generated in small space, i.e., O(log N ) space) in the X k computation. The class of p- stable distributions has been studied for some time (see, e.g., Nolan [2004]; [Uchaikin and Zolotarev 1999])—they are known to exist for any p ∈ (0, 2], and include well-known distribution functions, for example, the Cauchy distri- bution (for p = 1) and the Gaussian distribution (for p = 2). As the following theorem demonstrates, such p-stable sketches can provide accurate probabilis- tic estimates for the L p -difference norm of streaming frequency vectors in small space, for any p ∈ (0, 2]. THEOREM 2.2 [INDYK 2000]. Let p ∈ (0, 2], and deﬁne the p-stable sketch for N the Rk stream as X k = i=1 f k (i)ξi , where the {ξi } variates are drawn from a p-stable distribution (k = 1, 2). Assume that we have built l = O( log(1/δ) ) iid 2 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 288 • M. Garofalakis and A. Kumar j j pairs of p-stable sketches {X 1 , X 2 } ( j = 1, . . . , l ), and deﬁne X = median X 1 − X 2 |, . . . , |X 1 − X 2 . 1 1 l l Then, X lies within a relative error of at most of the L p -difference norm || f 1 − f 2 || p = [ i | f 1 (i) − f 2 (i)| p ]1/ p with probability ≥ 1 − δ. More recently, Cormode et al. [2002a] have also shown that, with small values of p (i.e., p → 0), p-stable sketches can provide very effective estimates for the Hamming (i.e., L0 ) norm (or, the number of distinct values) over continuous streams of updates. 3. OUR APPROACH: AN OVERVIEW The key element of our methodology for correlating continuous XML data streams is a novel algorithm that embeds ordered, labeled trees and the tree- edit distance metric as points in a (numeric) multidimensional vector space equipped with the standard L1 vector distance, while guaranteeing a small dis- tortion of the distance metric. In other words, our techniques rely on mapping each XML tree T to a numeric vector V (T ) such that the tree-edit distances be- tween the original trees are well-approximated by the L1 vector distances of the tree images under the mapping; that is, for any two XML trees S and T , the L1 distance V (S) − V (T ) 1 = j |V (S)[ j ] − V (T )[ j ]| gives a good approximation of the tree-edit distance d (S, T ). Besides guaranteeing a small bound on the distance distortion, to be appli- cable in a data-stream setting, such an embedding algorithm needs to satisfy two additional requirements: (1) the embedding should require small space and time per data tree in the stream; and, (2) the embedding should be oblivious, that is, the vector image V (T ) of a tree T cannot depend on other trees in the input stream(s) (since we cannot explicitly store or backtrack to past stream items). Our embedding algorithm satisﬁes all these requirements. There is an extensive literature on low-distortion embeddings of metric spaces into normed vector spaces; for an excellent survey of the results in this area, please see the recent article by Indyk [2001]. A key result in this area is Bourgain’s lemma proving that an arbitrary ﬁnite metric space is embeddable in an L2 vector space with logarithmic distortion; unfortunately, Bourgain’s technique is neither small space nor oblivious (i.e., it requires knowledge of the complete metric space), so there is no obvious way to apply it in a data-stream setting [Indyk 2001]. To the best of our knowledge, our algorithm gives the ﬁrst oblivious, small space/time vector-space embedding for a complex tree-edit distance metric. Given our algorithm for approximately embedding streaming XML trees and tree-edit distance in an L1 vector space, known streaming techniques (like the sketching methods discussed in Section 2.2) now become relevant. In this ar- ticle, we focus on two important applications of our results in the context of streaming XML, and propose novel algorithms for (1) building a small sketch synopsis of a massive, streaming XML data tree, and (2) approximating the size of a similarity join over XML streams. Once again, these are the ﬁrst results on ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 289 correlating (in small space) massive XML data streams based on the tree-edit distance metric. 3.1 Technical Roadmap The development of the technical material in this article is organized as fol- lows. Section 4 describes our embedding algorithm for the tree-edit distance metric (termed TREEEMBED) in detail. In a nutshell, TREEEMBED constructs a hierarchical parsing of an input XML tree by iteratively contracting edges to produce successively smaller trees; our parsing makes repeated use of a re- cently proposed label-grouping procedure [Cormode and Muthukrishnan 2002] for contracting chains and leaf siblings in the tree. The bulk of Section 4 is de- voted to proving the small-time and low distance-distortion guarantees of our TREEEMBED algorithm (Theorem 4.2). Then, in Section 5, we demonstrate how our embedding algorithm can be combined with the 1-stable sketching tech- nique of Indyk [2000] to build a small sketch synopsis of a massive, streaming XML tree that can be used as a concise surrogate for the tree in approximate tree-edit distance computations. Most importantly, we show that the proper- ties of our embedding allow us to parse the tree and build this sketch in small space and in one pass, as nodes of the tree are streaming by without ever backtracking on the data (Theorem 5.1). Finally, Section 6 shows how to combine our embedding algorithm with both 1-stable and AMS sketching in order to estimate (in limited space) the result size of an approximate tree- edit-distance similarity join over two continuous streams of XML documents (Theorem 6.1). 4. OUR TREE-EDIT DISTANCE EMBEDDING ALGORITHM 4.1 Deﬁnitions and Overview In this section, we describe our embedding algorithm for the tree-edit distance metric (termed TREEEMBED) in detail, and prove its small-time and low distance- distortion guarantees. We start by introducing some necessary deﬁnitions and notational conventions. Consider an ordered, labeled tree T over alphabet σ , and let n = |T |. Also, let v be a node in T , and let s denote a contiguous subsequence of children of node v in T . If the nodes in s are all leaves, then we refer to s as a contiguous leaf-child subsequence of v. (A leaf child of v that is not adjacent to any other leaf child of v is called a lone leaf child of v.) We use T [v, s] to denote the subtree of T obtained as the union of all subtrees rooted at nodes in s and node v itself, retaining all node labels. We also use the notation T [v, s] to denote exactly the same subtree as T [v, s], except that we do not associate any label with the root node v of the subtree. We deﬁne a valid subtree of T as any subtree of the form T [v, s], T [v, s], or a path of degree-2 nodes (i.e., a chain) possibly ending in leaf node in T . At a high level, our TREEEMBED algorithm produces a hierarchical parsing of T into a multiset T (T ) of special valid subtrees by stepping through a number of edge-contraction phases producing successively smaller trees. A key component ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 290 • M. Garofalakis and A. Kumar of our solution (discussed later in this section) is the recently proposed de- terministic coin tossing procedure of Cormode and Muthukrishnan [2002] for grouping symbols in a string—TREEEMBED employs that procedure repeatedly during each contraction phase to merge tree nodes in a chain as well as sibling leaf nodes. The vector image V (T ) of T is essentially the “characteristic vector” for the multiset T (T ) (over the space of all possible valid subtrees). Our analysis shows that the number of edge-contraction phases in T ’s parsing is O(log n), and that, even though the dimensionality of V (T ) is, in general, exponential in n, our construction guarantees that V (T ) is also very sparse: the total number of nonzero components in V (T ) is only O(n). Furthermore, we demonstrate that our TREEEMBED algorithm runs in near-linear, that is, O(n log∗ n) time. Finally, we prove the upper and lower bounds on the distance distortion guaranteed by our embedding scheme. 4.2 The Cormode-Muthukrishnan Grouping Procedure Clearly, the technical crux lies in the details of our hierarchical parsing pro- cess for T that produces the valid-subtree multiset T (T ). A basic element of our solution is the string-processing subroutine presented by Cormode and Muthukrishnan [2002] that uses deterministic coin tossing to ﬁnd landmarks in an input string S, which are then used to split S into groups of two or three consecutive symbols. A landmark is essentially a symbol y (say, at lo- cation j ) of the input string S with the following key property: if S is trans- formed into S by an edit operation (say, a symbol insertion) at location l far away from j (i.e., |l − j | >> 1), then the Cormode-Muthukrishnan string- processing algorithm ensures that y is still designated as a landmark in S . Due to space constraints, we do not give the details of their elegant landmark- based grouping technique (termed CM-Group in the remainder of this arti- cle) in our discussion—they can be found in Cormode and Muthukrishnan [2002]. Here, we only summarize a couple of the key properties of CM-Group that are required for the analysis of our embedding scheme in the following theorem. THEOREM 4.1 [CORMODE AND MUTHUKRISHNAN 2002]. Given a string of length k, the CM-Group procedure runs in time O(k log∗ k). Furthermore, the closest landmark to any symbol x in the string is determined by at most log∗ k + 5 consecutive symbols to the left of x, and at most ﬁve consecutive symbols to the right of x. Intuitively, Theorem 4.1 states that, for any given symbol x in a string of length k, the group of (two or three) consecutive symbols chosen (by CM-Group) to include x depends only on the symbols lying in a radius of at most log∗ k + 5 to the left and right of x. Thus, a string-edit operation occurring outside this local neighborhood of symbol x is guaranteed not to affect the group formed containing x. As we will see, this property of the CM-Group procedure is crucial in proving the distance-distortion bounds for our TREEEMBED algorithm. Similarly, the O(k log∗ k) complexity of CM-Group plays an important role in determining the running time of TREEEMBED. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 291 4.3 The TREEEMBED Algorithm As mentioned earlier, our TREEEMBED algorithm constructs a hierarchical pars- ing of T in several phases. In phase i, the algorithm builds an ordered, labeled tree T i that is obtained from the tree of the previous phase T i−1 by contract- ing certain edges. (The initial tree T 0 is exactly the original input tree T .) Thus, each node v ∈ T i corresponds to a connected subtree of T —in fact, by construction, our TREEEMBED algorithm guarantees that this subtree will be a valid subtree of T . Let v(T ) denote the valid subtree of T corresponding to node v ∈ T i . Determining the node label for v uses a hash function h() that maps the set of all valid subtrees of T to new labels in a one-to-one fashion with high prob- ability; thus, the label of v ∈ T i is deﬁned as the hash-function value h(v(T )). As we demonstrate in Section 7.1, such a valid-subtree-naming function can be computed in small space/time using an adaptation of the Karp-Rabin string ﬁn- gerprinting algorithm [Karp and Rabin 1987]. Note that the existence of such an efﬁcient naming function is crucial in guaranteeing the small space/time properties for our embedding algorithm since maintaining the exact valid sub- trees v(T ) is infeasible; for example, near the end of our parsing, such subtrees are of size O(|T |).5 The pseudocode description of our TREEEMBED embedding algorithm is de- picted in Figure 3. As described above, our algorithm builds a hierarchical parsing structure (i.e., a hierarchy of contracted trees T i ) over the input tree T , until the tree is contracted to a single node (|T i | = 1). The multiset T (T ) of valid subtrees produced by our parsing for T contains all valid subtrees corre- sponding to all nodes of the ﬁnal hierarchical parsing structure tagged with a phase label to distinguish between subtrees in different phases; that is, T (T ) comprises all < v(T i ), i > for all nodes v ∈ T i over all phases i (Step 18). Finally, we deﬁne the L1 vector image V (T ) of T to be the “characteristic vector” of the multi-set T (T ); in other words, V (T )[< t, i >] := number of times the < t, i > subtree-phase combination appears in T (T ). (We use the notation Vi (T ) to denote the restriction of V (T ) to only subtrees occurring at phase i.) A small example execution of the hierarchical tree parsing in our embedding algorithm is depicted pictorially in Figure 4. The L1 distance between the vector images of two trees S and T is deﬁned in the standard manner, that is, V (T ) − V (S) 1 = x∈T (T )∪T (S) |V (T )[x] − V (S)[x]|. In the remainder of this section, we prove our main theorem on the near-linear time complexity of our L1 embedding algorithm and the logarithmic distortion bounds that our embedding guarantees for the tree-edit distance metric. 5 An implicit assumption made in our running-time analysis of TREEEMBED (which is also present in the complexity analysis of CM-Group in Cormode and Muthukrishnan [2002]—see Theorem 4.1) is that the ﬁngerprints produced by the naming function h() ﬁt in a single memory word and, thus, can be manipulated in constant (i.e., O(1)) time. If that is not the case, then an additional multiplicative factor of O(log |T |) must be included in the running-time complexity to account for the length of such ﬁngerprints (see Section 7.1). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 292 • M. Garofalakis and A. Kumar Fig. 3. Our tree-embedding algorithm. Fig. 4. Example of hierarchical tree parsing. THEOREM 4.1. The TREEEMBED algorithm constructs the vector image V (T ) of an input tree T in time O(|T | log∗ |T |); further, the vector V (T ) contains at most O(|T |) nonzero components. Finally, given two trees S and T with n = max{|S|, |T |}, we have d (S, T ) ≤ 5 · V (T ) − V (S) 1 = O(log2 n log∗ n) · d (S, T ). It is important to note here that, for certain special cases (i.e., when T is a simple chain or a “star”), our TREEEMBED algorithm essentially degrades to ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 293 Fig. 5. Example of parsing steps for the special case of a full binary tree. the string-edit distance embedding algorithm of Cormode and Muthukrishnan [2002]. This, of course, implies that, for such special cases, their even tighter O(log n log∗ n) bounds on the worst-case distance distortion are applicable. As another special-case example, Figure 5 depicts the initial steps in the parsing of a full binary tree T ; note that, after two contraction phases, our parsing essentially reduces a full binary tree of depth h to one of depth h − 1 (thus decreasing the size of the tree by a factor of about 1/2). As a ﬁrst step in the proof of Theorem 4.1, we demonstrate the following lemma which bounds the number of parsing phases. The key here is to show that the number of tree nodes goes down by a constant factor during each contraction phase of our embedding algorithm (Steps 3–16). LEMMA 4.3. The number of phases for our TREEEMBED algorithm on an input tree T is O(log |T |). PROOF. We partition the node set of T into several subsets as follows. First, deﬁne A(T ) = {v ∈ T : v is a nonroot node with degree 2 (i.e., with only one child) or v is a leaf child of a non-root node of degree 2}, and B(T ) = {v ∈ T : v is a node of degree ≥ 3 (i.e., with at least two children) or v is the root node of T }. Clearly, A(T ) ∪ B(T ) contains all internal (i.e., nonleaf) nodes of T ; in particular, A(T ) contains all nodes appearing in (degree-2) chains in T (including potential leaf nodes at the end of such chains). Thus, the set of remaining nodes of T , say L(T ), comprises only leaf nodes of T which have at least one sibling or are children of the root. Let v be a leaf child of some node u, and let sv denote the maximal contiguous set of leaf children of u which contains v. We further partition the leftover set of leaf nodes L(T ) as follows: L1 (T ) = {v ∈ L(T ) : |sv | ≥ 2}, L2 (T ) = {v ∈ L(T ) : |sv | = 1 and v is the leftmost such child of its parent}, and L3 (T ) = L(T ) − L1 (T ) − L2 (T ) = {v ∈ L(T ) : |sv | = 1 and v is not the leftmost such child of its parent}. For notational convenience, we also use A(T ) to denote the set cardinality |A(T )|, and similarly for other sets. We ﬁrst prove the following ancillary claim. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 294 • M. Garofalakis and A. Kumar CLAIM 4.4. For any rooted tree T with at least two nodes, L3 (T ) ≤ L2 (T ) + A(T )/2 − 1. PROOF. We prove this claim by induction on the number of nodes in T . Sup- pose T has only two nodes. Then, clearly, L3 (T ) = 0, L2 (T ) = 1, and A(T ) = 0. Thus, the claim is true for the base case. Suppose the claim is true for all rooted trees with less than n nodes. Let T have n nodes and let r be the root of T . First, consider the case when r has only one child node (say, s), and let T be the subtree rooted at s. By induction, L3 (T ) ≤ L2 (T ) + A(T )/2 − 1. Clearly, L3 (T ) = L3 (T ). Is L2 (T ) equal to L2 (T )? It is not hard to see that the only case when a node u can occur in L2 (T ) but not in L2 (T ) is when s has only one child, u, which also happens to be a leaf. In this case, obviously, u ∈ L2 (T ) (since it is the sole leaf child of the root), whereas in / T u is the end-leaf of a chain node, so it is counted in A(T ) and, thus, u ∈ L2 (T ). On the other hand, it is easy to see that both s and r are in A(T ) − A(T ) in this case, so that L2 (T ) + A(T )/2 = L2 (T ) + A(T )/2. Thus, the claim is true in this case as well. Now, consider the case when the root node r of T has at least two children. We construct several smaller subtrees, each of which is rooted at r (but contains only a subset of r’s descendants). Let u1 , . . . , uk be the leaf children of r such that sui = {ui } (i.e., have no leaf siblings); thus, by deﬁnition, u1 ∈ L2 (T ), whereas ui ∈ L3 (T ) for all i = 2, . . . , k. We deﬁne the subtrees T1 , . . . , Tk+1 as follows. For each i = 1, . . . , k + 1, Ti is the set of all descendants of r (including r itself) that lie to the right of leaf ui−1 and to the left of leaf ui (as special cases, T1 is the subtree to the left of u1 and Tk+1 is the subtree to the right of uk ). Note that T1 and Tk+1 my not contain any nodes (other than the root node r), but, by the deﬁnition of ui ’s, all other Ti subtrees are guaranteed to contain at least one node other than r. Now, by induction, we have that L3 (Ti ) ≤ L2 (Ti ) + A(Ti )/2 − 1 for all subtrees Ti , except perhaps for T1 and Tk+1 (if they only comprise a sole root node, in which case, of course, the L2 , L3 , and A subsets above are all empty). Adding all these inequalities, we have L3 (Ti ) ≤ L2 (Ti ) + A(Ti )/2 − (k − 1), (1) i i i where we only have k − 1 on the right-hand side since T1 and Tk+1 may not contribute a −1 to this summation. Now, it is easy to see that, if u ∈ A(Ti ), then u ∈ A(T ) as well; thus, A(T ) = i A(Ti ). Suppose u ∈ L2 (Ti ), and let w denote the parent of u. Note that w cannot be the root node r of T , Ti . Indeed, suppose that w = r; then, since u ∈ {u1 , . . . , uk }, su contains a leaf node other than u which is also not in Ti (since u ∈ L2 (Ti ))). But then, it must be the case that u is adjacent to one of the leaves u1 , . . . , uk , which is impossible; thus, w = r which, of course, implies that u ∈ L2 (T ) as well. Conversely, suppose that u ∈ L2 (T ); then, either u = u1 or the parent of u is in one of the subtrees Ti . In the latter case, u ∈ D2 (Hi ). Thus, L2 (H) = i L2 (Ti ) + 1. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 295 Finally, we can argue in a similar manner that, for each i = 1, . . . , k + 1, L3 (Ti ) ⊂ L3 (T ). Furthermore, if u ∈ L3 (T ), then either u ∈ {u2 , . . . , uk } or u ∈ L3 (Ti ). Thus, L3 (T ) = i L3 (Ti ) + k − 1. Putting everything together, we have L3 (T ) = L3 (Ti ) + k − 1 i ≤ L2 (Ti ) + A(Ti )/2 (by Inequality (1)) i i = L2 (T ) + A(T )/2 − 1. This completes the inductive proof argument. With Claim 4.4 in place, we now proceed to show that the number of nodes in the tree goes down by a constant factor after each contraction phase of our parsing. Recall that T i is the tree at the beginning of the (i + 1)th phase, and let L (T i+1 ) ⊆ L(T i+1 ) denote the subset of leaf nodes in L(T i+1 ) that are created by contracting a chain in T i . We claim that A(T i ) B(T i+1 ) ≤ B(T i ) and B(T i+1 ) + A(T i+1 ) + L (T i+1 ) ≤ B(T i ) + . (2) 2 Indeed, it is easy to see that all nodes with degree at least three (i.e., ≥ two chil- dren) in T i+1 must have had degree at least three in T i as well; this obviously proves the ﬁrst inequality. Furthermore, note that any node in B(T i+1 ) corre- sponds to a unique node in B(T i ). Now, consider a node u in A(T i+1 ) ∪ L (T i+1 ). There are two possible cases depending on how node u is formed. In the ﬁrst case, u is formed by collapsing some degree-2 (i.e., chain) nodes (and, possibly, a chain-terminating leaf) in A(T i )—then, by virtue of the CM-Group procedure, u corresponds to at least two distinct nodes of A(T i ). In the second case, there is a node w ∈ B(T i ) and a leaf child of w that is collapsed into w to get u—then, u corresponds to a unique node of B(T i ). The second inequality follows easily from the above discussion. During the (i + 1)th contraction phase, the number of leaves in L1 (T i ) is clearly reduced by at least one-half (again, due to the properties of CM-Group). Furthermore, note that all leaves in L2 (T i ) are merged into their parent nodes and, thus, disappear. Now, the leaves in L3 (T i ) do not change; so, we need to bound the size of this leaf-node set. By Claim 4.4, we have that L3 (T i ) ≤ L2 (T i ) + A(T i )/2—adding 2 · L3 (T i ) on both sides and multiplying across with 1/3, this inequality gives L2 (T i ) 2 A(T i ) L3 (T i ) ≤ + L3 (T i ) + . 3 3 6 Thus, the number of leaf nodes in L(T i+1 ) − L (T i+1 ) can be upper-bounded as follows: L1 (T i ) L2 (T i ) 2 A(T i ) 2 A(T i ) L(T i+1 ) − L (T i+1 ) ≤ + + L3 (T i ) + ≤ L(T i ) + . 2 3 3 6 3 6 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 296 • M. Garofalakis and A. Kumar Combined with Inequality (2), this implies that the total number of nodes in T i+1 is A(T i ) 2 A(T i ) A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤ + B(T i ) + L(T i ) + 2 3 6 2 ≤ B(T ) + (A(T ) + L(T )). i i i 3 Now, observe that B(T i ) ≤ A(T i )+ L(T i ) (the number of nodes of degree more than two is at most the number of leaves in any tree)—the above inequality then gives 5 2 1 A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤ B(T i ) + (A(T i ) + L(T i )) + B(T i ) 6 3 6 5 ≤ (A(T ) + B(T ) + L(T )). i i i 6 Thus, when going from tree T i to T i+1 , the number of nodes goes down by a constant factor ≤ 5 . This obviously implies that the number of parsing phases 6 for our TREEEMBED algorithm is O(log |T |), and completes the proof. The proof of Lemma 4.3 immediately implies that the total number of nodes in the entire hierarchical parsing structure for T is only O(|T |). Thus, the vector image V (T ) built by our algorithm is a very sparse vector. To see this, note that the number of all possible ordered, labeled trees of size at most n that can be built using the label alphabet σ is O((4|σ |)n ) (see, e.g., Knuth [1973]); thus, by Lemma 4.3, the dimensionality needed for our vector image V () to capture input trees of size n is O((4|σ |)n log n). However, for a given tree T , only O(|T |) of these dimensions can contain nonzero counts. Lemma 4.3, in conjunction with the fact that the CM-Group procedure runs in time O(k log∗ k) for a string of size k (Theorem 4.1), also implies that our TREEEMBED algorithm runs in O(|T | log∗ |T |) time on input T . The following two subsections establish the distance-distortion bounds stated in Theorem 4.1. An immediate implication of the above results is that we can use our embedding algorithm to compute the approximate (to within a guaranteed O(log2 n log∗ n) factor) tree-edit distance between T and S in O(n log∗ n) (i.e., near-linear) time. The time complexity of exact tree-edit distance computation is signiﬁcantly higher: conventional tree-edit distance (without subtree moves) is solvable in O(|T S|d T d S ) time (where, d T (d S ) is the depth of T (respec- tively, S)) [Apostolico and Galil 1997; Zhang and Shasha 1989], whereas in the presence of subtree moves the problem becomes N P-hard even for the simple case of ﬂat strings [Shapira and Storer 2002]. 4.4 Upper-Bound Proof Suppose we are given a tree T with n nodes and let denote the quantity log∗ n + 5. As a ﬁrst step in our proof, we demonstrate that showing the upper- bound result in Theorem 4.2 can actually be reduced to a simpler problem, namely, that of bounding the L1 distance between the vector image of T and the vector image of a 2-tree forest created when removing a valid subtree from ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 297 T . More formally, consider a (valid) subtree of T of the form T [v, s] for some contiguous subset of children s of v (recall that the root of T [v, s] has no label). Let us delete T [v, s] from T , and let T2 denote the resulting subtree; further- more, let T1 denote the deleted subtree T [v, s]. Thus, we have broken T into a 2-tree forest comprising T1 = T [v, s] and T2 = T − T1 (see the leftmost portion of Figure 8 for an example). We now compare the following two vectors. The ﬁrst vector V (T ) is obtained by applying our TREEEMBED parsing procedure to T . For the second vector, we apply TREEEMBED to each of the trees T1 and T2 individually, and then add the corresponding vectors V (T1 ) and V (T2 ) component-wise—call this vector V (T1 + T2 ) = V (T1 ) + V (T2 ). (Throughout this section, we use (T1 + T2 ) to denote the 2-tree forest composed of T1 and T2 .) Our goal is to prove the following theorem. THEOREM 4.5. The L1 distance between vectors V (T ) and V (T1 + T2 ) is at most O(log2 n log∗ n). Let us ﬁrst see how this result directly implies the upper bound stated in Theorem 4.2. PROOF OF THE UPPER BOUND IN THEOREM 4.2. It is sufﬁcient to consider the case when the tree-edit distance between S and T is 1 and show that, in this case, the L1 distance between V (S) and V (T ) is ≤ O(log2 n log∗ n). First, assume that T is obtained from S by deleting a leaf node v. Let the parent of v be w. Deﬁne s = {v}, and delete S [w, s] from S. This splits S into T and S [w, s]— call this S1 . Theorem 4.5 then implies that V (S) − V (T + S1 ) 1 = V (S) − (V (T ) + V (S1 )) 1 ≤ O(log2 n log∗ n). But, it is easy to see that the vector V (S1 ) only has three nonzero components, all equal to 1; this is since S1 is basically a 2-node tree that is reduced to a single node after one contraction phase of TREEEMBED. Thus, V (S1 ) 1 = (V (T ) + V (S1 )) − V (T ) 1 ≤ 3. Then, a simple application of the triangle inequality for the L1 norm gives V (S) − V (T ) 1 ≤ O(log2 n log∗ n). Note that, since insertion of a leaf node is the inverse of a leaf- node deletion, the same holds for this case as well. Now, let v be a node in S and s be a contiguous set of children of v. Suppose T is obtained from S by moving the subtree S [v, s], that is, deleting this subtree and making it a child of another node x in S.6 Let S1 denote S [v, s], and let S2 denote the tree obtained by deleting S1 from S. Theorem 4.5 implies that V (S) − V (S1 + S2 ) 1 ≤ O(log2 n log∗ n). Note, however, that we can also picture (S1 + S2 ) as the forest obtained by deleting S1 from T . Thus, V (T ) − V (S1 + S2 ) 1 is also ≤ O(log2 n log∗ n). Once again, the triangle inequality for L1 easily implies the result. Finally, suppose we delete a nonleaf node v from S. Let the parent of v be w. All children of v now become children of w. We can think of this process as follows. Let s be the children of v. First, we move S [v, s] and make it a child of w. At this point, v is a leaf node, so we are just deleting a leaf node now. Thus, 6 This is a slightly “generalized” subtree move, since it allows for a contiguous (sub)sequence of sibling subtrees to be moved in one step. However, it is easy to see that it can be simulated with only three simpler edit operations, namely, a node insertion, a single-subtree move, and a node deletion. Thus, our results trivially carry over to the case of “single-subtree move” edit operations. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 298 • M. Garofalakis and A. Kumar the result for this case follows easily from the arguments above for deleting a leaf node and moving a subtree. As a consequence, it is sufﬁcient to prove Theorem 4.5. Our proof proceeds along the following lines. We deﬁne an inﬂuence region for each tree T i in our hierarchical parsing (i = 0, . . . , O(log n))—the intuition here is that the inﬂuence region for T i captures the complete set of nodes in T i whose parsing could have been affected by the change (i.e., the splitting of T into (T1 + T2 )). Initially (i.e., tree T 0 ), this region is just the node v at which we deleted the T1 subtree. But, obviously, this region grows as we proceed to subsequent phases in our parsing. We then argue that, if we ignore this inﬂuence region in T i and the corresponding region in the parsing of the (T1 +T2 ) forest, then the resulting sets of valid subtrees look very similar (in any phase i). Thus, if we can bound the rate at which this inﬂuence region grows during our hierarchical parsing, we can also bound the L1 distance between the two resulting characteristic vectors. The key intuition behind bounding the size of the inﬂuence region is as follows: when we effect a change at some node v of T , nodes far away from v in the tree remain unaffected, in the sense that the subtree in which such nodes are grouped during the next phase of our hierarchical parsing remains unchanged. As we will see, this fact hinges on the properties of the CM-Group procedure used for grouping nodes during each phase of TREEEMBED (Theorem 4.1). The discussion of our proof in the remainder of this section is structured as follows. First, we formally deﬁne inﬂuence regions, giving the set of rules for “growing” such regions of nodes across consecutive phases of our parsing. Second, we demonstrate that, for any parsing phase i, if we ignore the inﬂuence regions in the current (i.e., phase-(i + 1)) trees produced by TREEEMBED on input T and (T1 + T2 ), then we can ﬁnd a one-to-one, onto mapping between the nodes in the remaining portions of the current T and (T1 + T2 ) that pairs up identical valid subtrees. Third, we bound the size of the inﬂuence region during each phase of our parsing. Finally, we show that the upper bound on the L1 distance of V (T ) and V (T1 + T2 ) follows as a direct consequence of the above facts. We now proceed with the proof of Theorem 4.5. Deﬁne (T1 + T2 )i as the 2-tree forest corresponding to (T1 + T2 ) at the beginning of the (i + 1)th parsing phase. We say that a node x ∈ T i+1 contains a node x ∈ T i if the set of nodes in T i which are merged to form x contains x . As earlier, any node w in T i corresponds to a valid subtree w(T ) of T ; furthermore, it is easy to see that if w and w are two distinct nodes of T i , then the w(T ) and w (T ) subtrees are disjoint. (The same obviously holds for the parsing of each of T1 , T2 .) For each tree T i , we mark certain nodes; intuitively, this node-marking de- ﬁnes the inﬂuence region of T i mentioned above. Let M i be the set of marked nodes (i.e., inﬂuence region) in T i (see Figure 6(a) for an example). The generic structure of the inﬂuence region M i satisﬁes the following: (1) M i is a connected subtree of T i that always contains the node v (at which the T1 subtree was re- moved), that is, the node in T i which contains v (denoted by vi ) is always in M i ; (2) there is a center node ci ∈ M i , and M i may contain some ancestor nodes of ci —but all such ancestors (except perhaps for ci itself) must be of degree 2 only, and should form a connected path; and (3) M i may also contain some ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 299 Fig. 6. (a) The subtree induced by the bold edges corresponds to the nodes in M i . (b) Node z becomes the center of N i . descendants of the center node ci . Finally, certain (unmarked) nodes in T i − M i are identiﬁed as corner nodes—intuitively, these are nodes whose parsing will be affected when they are shrunk down to a leaf node. Once again, the key idea is that the inﬂuence region M i captures the set of those nodes in T i whose parsing in TREEEMBED may have been affected by the change we made at node v. Now, in the next phase, the changes in M i can potentially affect some more nodes. Thus, we now try to determine which nodes M i can affect; that is, assuming the change at v has inﬂuenced all nodes in M i , which are the nodes in T i whose parsing (during phase (i + 1)) can change as a result of this. To capture this newly affected set of nodes, we deﬁne an extended inﬂuence region N i in T i —this intuitively corresponds to the (worst- case) subset of nodes in T i whose parsing can potentially be affected by the changes in M i . First, add all nodes in M i to N i . We deﬁne the center node z of the extended inﬂuence region N i as follows. We say that a descendant node u of vi (which contains v) in T i is a removed descendant of vi if and only if its corresponding subtree u(T ) in the base tree T is entirely contained within the removed subtree T [v, s]. (Note that, initially, v0 = v is trivially a removed descendant of v0 .) Now, let w be the highest node in M i —clearly, w is an ancestor of the current center node ci as well as the vi node in T i . If all the descendants of w are either in M i or are removed descendants of vi , then deﬁne the center z to be the parent of node w, and add z to N i (see Figure 6(b)); otherwise, deﬁne the center z of N i to be same as ci . The idea here is that the grouping of w’s parent in the next phase can change only if the entire subtree under w has been affected by the removal of the T [v, s] subtree. Otherwise, if there exist nodes under w in T i whose parsing remains unchanged and that have not been deleted by the subtree removal, then the mere existence of these nodes in T i means that it is impossible for TREEEMBED to group w’s parent in a different manner during the next phase of the (T1 + T2 ) parsing in any case. Once the center node z of N i has been ﬁxed, we also add nodes to N i according to the following set of rules (see Figures 7(a) and (b) for examples). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 300 • M. Garofalakis and A. Kumar Fig. 7. (a) The nodes in dotted circles get added to N i due to Rules (i), (ii), and (iii). (b) The nodes in the dotted circle get added to N i due to Rule (iv)—note that all descendants of the center z which are not descendants of u are in M i . (c) Node u moves up to z, turning nodes a and b into corner nodes. (i) Suppose u is a leaf child of the (new) center z or the vi node in T i ; fur- thermore, assume there is some sibling u of u such that the following conditions are satisﬁed: u ∈ M i or u is a corner leaf node, the set of nodes s(u, u ) between u and u are leaves, and |s(u, u )| ≤ . Then, add u to N i . (In particular, note that any leaf child of z which is a corner node gets added to N i .) (ii) Let u be the leftmost lone leaf child of the center z which is not already in M i (if such a child exists); then, add u to N i . Similarly, for the vi node in T i , let u be a leaf child of vi such that one of the following condi- tions is satisﬁed: (a) u is the leftmost lone leaf child of vi when consid- ering only the removed descendants of vi ; or (b) u is the leftmost lone leaf child of vi when ignoring all removed descendants of vi . Then, add u to N i. (iii) Let w be the highest node in M i ∪ {z} (so it is an ancestor of the center node z). Let u be an ancestor of w. Suppose it is the case that all nodes between u and w, except perhaps w, have degree 2, and the length of the path joining u and w is at most ; then, add u to N i . (iv) Suppose there is a child u of the center z or the vi node in T i such that one of the following conditions is satisﬁed: (a) u is not a removed descendant of vi and all descendants of all siblings of u (other than u itself) are either already in M i or are removed descendants of vi ; or (b) u is a removed descendant of vi (and, hence, a child of vi ) and all removed descendants of vi which are not descendants of u are in M i . Then, let u be the lowest descendant of u which is in M i . If u is any descendant of u such that the path joining them contains degree-2 nodes only (including the end-points), and has length at most , then add u to N i . Let us brieﬂy describe why we need these four rules. We basically want to make sure that we include all those nodes in N i whose parsing can potentially ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 301 be affected if we delete or modify the nodes in M i (given, of course, the removal of the T [v, s] subtree). The ﬁrst three rules, in conjunction with the properties of our TREEEMBED parsing, are easily seen to capture this fact. The last rule is a little more subtle. Suppose u is a child of z (so that we are in clause (a) of Rule (iv)); furthermore, assume that all descendants of z except perhaps those of u are either already in M i or have been deleted with the removal of T [v, s]. Remember that all nodes in M i have been modiﬁed due to the change effected at v, so they may not be present at all in the corresponding picture for (T1 + T2 ) (i.e., the (T1 +T2 )i forest). But, if we just ignore M i and the removed descendants of vi , then z becomes a node of degree 2 only, which would obviously affect how u and its degree-2 descendants are parsed in (T1 + T2 )i (compared to their parsing in T i ). Rule (iv) is designed to capture exactly such scenarios; in particular, note that clauses (a) and (b) in the rule are meant to capture the potential creation i of such degree-2 chains in the remainder subtree T2 and the deleted subtree i T1 , respectively. We now consider the rule for marking corner nodes in T i . Once again, the intuition is that certain (unaffected) nodes in T i − M i (actually, in T i − N i ) are marked as corner nodes so that we can “remember” that their parsing will be affected when they are shrunk down to a leaf. Suppose the center node z has at least two children, and a leftmost lone leaf child u—note that, by Rule (ii), u ∈ N i . If any of the two immediate siblings of u are not in N i , then we mark them as corner nodes (see Figure 7(c)). The key observation here is that, when parsing T i , u is going to be merged into z and disappear; however, we need to somehow “remember” that a (potentially) affected node u was there, since its existence could affect the parsing of its sibling nodes when they are shrunk down to leaves. Marking u’s immediate siblings in T i as corner nodes essentially achieves this effect. Having described the (worst-case) extended inﬂuence region N i in T i , let us now deﬁne M i+1 , that is, the inﬂuence region at the next level of our parsing of T . M i+1 is precisely the set of those nodes in T i+1 which contain a node of N i . The center of M i+1 is the node which contains the center node z of N i ; furthermore, any node in T i+1 which contains a corner node is again marked as a corner node. Initially, deﬁne M 0 = {v} (and, obviously, v0 = c0 = v). Furthermore, if v has a child node immediately on the left (right) of the removed child subsequence s, then that node as well as the leftmost (respectively, rightmost) node in s are marked as corner nodes. The reason, of course, is that these ≤ 4 nodes may be parsed in a different manner when they are shrunk down to leaves during the parsing of T1 and T2 . Based on the above set of rules, it is easy to see that M i and N i are always connected subtrees of T i . It is also important to note that the extended inﬂuence region N i is deﬁned in such a manner that the parsing of all nodes in T i − N i cannot be affected by the changes in M i . This fact should become clear as we proceed with the details of the proofs in the remainder of this section. Example 4.6. Figure 8 depicts the ﬁrst three phases of a simple example parsing for T and (T1 + T2 ), in the case of a 4-level full binary tree T that ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 302 • M. Garofalakis and A. Kumar Fig. 8. Example of TREEEMBED parsing phases for T and (T1 + T2 ) in the case of a full binary tree, highlighting the inﬂuence regions M i in T i and the corresponding P i regions in (T1 + T2 )i (“o” denotes an unlabeled node). is split by removing the right subtree of the root (i.e., T1 = T [x3 , {x6 , x7 }], T2 = T − T1 ). We use subscripted x’s and y’s to label the nodes in T i and (T1 + T2 )i to emphasize the fact that these tree nodes are parsed independently by TREEEMBED; furthermore, we employ the subscripts to capture the original subtrees of T and (T1 + T2 ) represented by nodes in later phases of our parsing. Of course, it should be clear that x and y nodes with identical subscripts refer to identical (valid) subtrees of the original tree T ; for instance, both x4,8,9 ∈ T 2 and y 4,8,9 ∈ T22 represent the same subtree T [x4 , {x8 , x9 }] = {x4 , x8 , x9 } of T . As depicted in Figure 8, the initial inﬂuence region of T is simply M 0 = {x3 } (with v0 = c0 = x3 ). Since, clearly, all descendants of x3 are removed descendants of v0 , the center z for the extended inﬂuence region N 0 moves up to the parent node x1 of x3 (and none of our other rules are applicable); thus, N 0 = {x1 , x3 } and, obviously, M 1 = {x1 , x3 }. This is crucial since (as shown in Figure 8), due to the removal of T1 , nodes y 1 and y 3 are processed in a very different manner in the remainder subtree T20 (i.e., y 3 is merged up into y 1 as its leftmost lone leaf child). Now, for T 1 , none of our rules for extending the inﬂuence region apply and, consequently, N 1 = M 2 = {x1 , x3 }. The key thing to note here is that, for each parsing phase i, ignoring the nodes in the inﬂuence region M i (and the “corresponding” nodes in (T1 + T2 )i ), the remaining nodes of T i and (T1 + T2 )i have been parsed in an identical manner by TREEEMBED (and correspond to an identical subset of valid subtrees in T ); in other words, their corresponding characteristic vectors in our embedding are exactly the same. We now proceed to formalize these observations. Given the inﬂuence region M i of T i , we deﬁne a corresponding node set, P , in the (T1 + T2 )i forest. In what follows, we prove that the nodes in T i − i M i and (T1 + T2 )i − P i can be matched in some manner, so that each pair of matched nodes correspond to identical valid subtrees in T and (T1 + T2 ), ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 303 Fig. 9. f maps from T i − M i to (T1 + T2 )i − P i . respectively. The node set P i in (T1 + T2 )i is deﬁned as follows (see Figure 8 i for examples). P i always contains the root node of T1 . Furthermore, a node u ∈ (T1 + T2 ) is in P , if and only if there exists a node u ∈ M i such that the i i intersection u(T1 + T2 ) ∩ u (T ) is nonempty (as expected, u(T1 + T2 ) denotes the valid subtree corresponding to u in (T1 + T2 )). We demonstrate that our solution always maintains the following invariant. INVARIANT 4.7. Given any node x ∈ T i − M i , there exists a node y = f (x) in (T1 + T2 )i − P i such that x(T ) and y(T1 + T2 ) are identical valid subtrees on the exact same subset of nodes in the original tree T . Conversely, given a node y ∈ (T1 + T2 )i − P i , there exists a node x ∈ T i − M i such that x(T ) = y(T1 + T2 ). Thus, there always exists a one-to-one, onto mapping f from T i − M i to (T1 + T2 )i − P i (Figure 9). In other words, if we ignore M i and P i from T i and (T1 + T2 )i (respectively), then the two remaining forests of valid subtrees in this phase are identical. Example 4.8. Continuing with our binary-tree parsing example in Figure 8, it is easy to see that, in this case, the mapping f : T i − M i −→ (T1 +T2 )i − P i simply maps every x node in T i − M i to the y node in (T1 +T2 )i − P i with the same subscript that, obviously, corresponds to exactly the same valid subtree of T ; for instance, y 10,11 = f (x10,11 ) and both nodes correspond to the same valid subtree T [x5 , {x10 , x11 }]. Thus, the collections of valid subtrees for T i − M i and (T1 + T2 )i − P i are identical (i.e., the L1 distance of their cor- responding characteristic vectors is zero); this implies that, for example, the contribution of T 1 and (T1 + T2 )1 to the difference of the embedding vectors V (T ) and V (T1 + T2 ) is upper-bounded by |M 1 | = 2. Clearly, Invariant 4.7 is true in the beginning (i.e., M 0 = {v}, P 0 = {v, root(T1 )}). Suppose our invariant remains true for T i and (T1 + T2 )i . We now need to prove it for T i+1 and (T1 + T2 )i+1 . As previously, let N i ⊇ M i be the extended inﬂuence region in T i . Fix a node w in T i − N i , and let w be the corresponding node in (T1 + T2 )i − P i (i.e., w = f (w)). Suppose w is contained in node q ∈ T i+1 and w is contained in node q ∈ (T1 + T2 )i+1 . LEMMA 4.9. Given a node w in T i − N i , let q, q be as deﬁned above. If q(T ) and q (T1 +T2 ) are identical subtrees for any node w ∈ T i − N i , then Invariant 4.7 holds for T i+1 and (T1 + T2 )i+1 as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 304 • M. Garofalakis and A. Kumar PROOF. We have to demonstrate the following facts. If x is a node in T i+1 − M i+1 , then there exists a node y ∈ (T1 +T2 )i+1 −P i+1 such that y(T1 +T2 ) = x(T ). Conversely, given a node y ∈ (T1 + T2 )i+1 − P i+1 , there is a node x ∈ T i+1 − M i+1 such that x(T ) = y(T1 + T2 ). Suppose the condition in the lemma holds. Let x be a node in T i+1 − M i+1 . / Let x be a node in T i such that x contains x . Clearly, x ∈ N i , otherwise x would be in M i+1 . Let y = f (x ), and let y be the node in (T1 + T2 )i+1 which contains y . By the hypothesis of the lemma, x(T ) and y(T1 + T2 ) are identical / subtrees. It remains to check that y ∈ P i+1 . Since y(T1 + T2 ) = x(T ), y(T1 + T2 ) is disjoint from z(T ) for any z ∈ T i+1 , z = x. By the deﬁnition of the P i node / sets, since x ∈ M i+1 , we have that y ∈ (T1 + T2 )i+1 − P i+1 . Let us prove the converse now. Suppose y ∈ (T1 + T2 )i+1 − P i+1 . Let y be a node in (T1 + T2 )i such that y contains y . If y ∈ P i , then (by deﬁnition) there exists a node x ∈ M i such that x (T ) ∩ y (T1 + T2 ) = ∅. Let x be the node in T i+1 which contains x . Since x ∈ N i , x ∈ M i+1 . Now, x(T ) ∩ y(T1 + T2 ) ⊇ x (T )∩ y (T1 + T2 ) = ∅. But then y should be in P i+1 , a contradiction. Therefore, / y ∈ P i . By the invariant for T i , there is a node x ∈ T i −M i such that y = f (x ). Let x be the node in T i+1 containing x . Again, if x ∈ N i , then x ∈ M i+1 . But then x(T ) ∩ y(T1 + T2 ) ⊇ x (T ) ∩ y (T1 + T2 ), which is nonempty because / x (T ) = y (T1 + T2 ). This would imply that y ∈ P i+1 . So, x ∈ N i . But then, by the hypothesis of the lemma, x(T ) = y(T1 + T2 ). Further, x cannot be in M i+1 , otherwise y will be in P i+1 . Thus, the lemma is true. It is, therefore, sufﬁcient to prove that, for any pair of nodes w ∈ T i − N i , w = f (w) ∈ (T1 + T2 )i − P i , the corresponding encompassing nodes q ∈ T i+1 and q ∈ (T1 + T2 )i+1 map to identical valid subtrees, that is, q(T ) = q (T1 + T2 ). This is what we seek to do next. Our proof uses a detailed, case-by-case analysis of how node w gets parsed in T i . For each case, we demonstrate that w will also get parsed in exactly the same manner in the forest (T1 + T2 )i . In the interest of space and continuity, we defer the details of this proof to the Appendix. Thus, we have established the fact that, if we look at the vectors V (T ) and V (T1 + T2 ), the nodes corresponding to phase i of V (T ) which are not present in V (T1 + T2 ) are guaranteed to be a subset of M i . Our next step is to bound the size of M i . LEMMA 4.10. The inﬂuence region M i for tree T i consists of at most O(i log∗ n) nodes. PROOF. Note that, during each parsing phase, Rule (iii) adds at most nodes of degree at most 2 to the extended inﬂuence region N i . It is not difﬁcult to see that Rule (iv) also adds at most 4 nodes of degree at most 2 to N i during each phase; indeed, note that, for instance, there is at most one child node u of z which is not in M i and satisﬁes one of the clauses of Rule (iv). So, adding over the ﬁrst i stages of our algorithm the number of such nodes in M i can be at most O(i log∗ n). Thus, we only need to bound the number of nodes that get added to the inﬂuence region due to Rules (i) and (ii). We now want to count the number of leaf children of the center node ci which are in M i . Let ki be the number of children of ci which become leaves for the ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 305 ﬁrst time in T i and are marked as corner nodes. Let Ci be the nodes in M i which were added as the leaf children of the center node of T i , for some i < i. Then, we claim that Ci can be partitioned into at most 1 + i−1 k j contiguous j =1 sets such that each set has at most 4 elements. We prove this by induction on i. So, suppose it is true for T i . Consider such a contiguous set of leaves in Ci , call it C1 , where |C1 | ≤ 4 . i i i i We may add up to consecutive leaf children of c on either side of C1 to the i extended inﬂuence region N . Thus, this set may grow to a size of 6 contiguous leaves. But when we parse this set (using CM-Group), we reduce its size by at least half. Thus, this set will now contain at most 3 leaves (which is at most 4 ). Therefore, each of the 1 + i−1 k j contiguous sets in Ci correspond to a j =1 contiguous set in T i+1 of size at most 4 . Now, we may add other leaf children of ci to N i . This can happen only if a corner node becomes a leaf. In this case, at most consecutive leaves on either side of this node are added to N i (by Rule (i)); thus, we may add ki more such sets of consecutive leaves to N i . This completes our inductive argument. But note that, in any phase, at most two new corner nodes (i.e., the immediate siblings of the center node’s leftmost lone leaf child) can be added. (And, of course, we also start out with at most four nodes marked as corners inside and next to the removed child subsequence s.) So, ij =1 k j ≤ 2i + 2. This shows that the number of nodes in Ci is O(i log∗ n). The contribution toward M i of the leaf children of the vi node can also be upper bounded by O(i log∗ n) using a very similar argument. This completes the proof. We now need to bound the nodes in (T1 + T2 )i which are not in T i . But this can be done in exactly analogous manner if we switch the roles of T and T1 + T2 in the proofs above. Thus, we can deﬁne a subset Q i of (T1 + T2 )i and a one-to-one, onto mapping g from (T1 + T2 )i − Q i to a subset of T i such that g (w)(T ) = w(T1 + T2 ) for every w ∈ (T1 + T2 )i − Q i . Furthermore, we can show in a similar manner that |Q i | ≤ O(i log∗ n). We are now ready to complete the proof of Theorem 4.5. PROOF OF THEOREM 4.5. Fix a phase i. Consider those subtrees t such that Vi (T )[< t, i >] ≥ Vi (T1 +T2 )[< t, i >]. In other words, t appears more frequently in the parsed tree T i than in (T1 + T2 )i . Let the set of such subtrees be denoted by S. We ﬁrst observe that |M i | ≥ Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]. t∈S Indeed, consider a tree t ∈ S. Let V1 be the set of vertices u in T i such that u(T ) = t. Similarly, deﬁne the set V2 in (T1 + T2 )i . So, |V1 | − |V2 | = Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]. Now, the function f must map a vertex in V1 − M i to a vertex in V2 . Since f is one-to-one, V1 − M i can have at most |V2 | nodes. In other words, M i must contain |V1 |− |V2 | nodes from V1 . Adding this up for all such subtrees in S gives us the inequality above. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 306 • M. Garofalakis and A. Kumar We can write a similar inequality for Q i . Adding these up, we get |M i | + |Q i | ≥ |Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]|, t where the summation is over all subtrees t. Adding over all parsing phases i, we have O(log n) V (T ) − V (T1 + T2 ) 1 ≤ O(i log∗ n) = O(log2 n log∗ n). i=1 This completes our proof argument. 4.5 Lower-Bound Proof Our proof follows along the lower-bound proof of Cormode and Muthukrishnan [2002], in that it does not make use of any special properties of our hier- archical tree parsing; instead, we only assume that the parsing structure built on top of the data tree is of bounded degree k (in our case, of course, k = 3). The idea is then to show how, given two data trees S and T , we can use the “credit” from the L1 difference of their vector embeddings V (T ) − V (S) 1 to transform S into T . As in Cormode and Muthukrishnan [2002], our proof is constructive and shows how the overall parsing structure for S (including S itself at the leaves) can be transformed into that for T ; the transformation is performed level-by-level in a bottom-up fashion (start- ing from the leaves of the parsing structure). (The distance-distortion lower bound for our embedding is an immediate consequence of Lemma 4.11 with k = 3.7 ) LEMMA 4.11. Assuming a hierarchical parsing structure with degree at most k (k ≥ 2), the overall parsing structure for tree S can be transformed into exactly that of tree T with at most (2k − 1) V (T ) − V (S) 1 tree-edit operations (node inserts, deletes, relabels, and subtree moves). PROOF. As in Cormode and Muthukrishnan [2002], we ﬁrst perform a top- down pass over the parsing structure of S, marking all nodes x whose subgraph appears in the both parse-tree structures, making sure that the number of marked x nodes at level (i.e., phase) i of the parse tree does not exceed Vi (T )[x] (we use x instead of v(x) to also denote the valid subtree corresponding to x in order to simplify the notation). Descendants of marked nodes are also marked. Marked nodes are “protected” during the parse-tree transformation process described below, in the sense that we do not allow an edit operation to split a marked node. We proceed bottom-up over the parsing structure for S in O(log n) rounds (where n = max{|S|, |T |}), ensuring that after the end of round i we have created an Si such that Vi (T ) − Vi (Si ) 1 = 0. The base case (i.e., level 0) deals with 7 It is probably worth noting at this point that the subtree-move operation is needed only to establish the distortion lower-bound result in this section; that is, the upper bound shown in Section 4.1 holds for the standard tree-edit distance metric as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 307 Fig. 10. Forming a level-i node x. simple node labels and creates S0 in a fairly straightforward way: for each label a, if V0 (S)[a] > V0 (T )[a] then we delete (V0 (S)[a]− V0 (T )[a]) unmarked copies of a; otherwise, if V0 (S)[a] < V0 (T )[a], then we add (V0 (T )[a] − V0 (S)[a]) leaf nodes labeled a at some location of S. In each case, we perform |V0 (S)[a] − V0 (T )[a]| edit operations which is exactly the contribution of label a to V0 (T )− V0 (S) 1 . It is easy to see that, at the end of the above process, we have V0 (T ) − V0 (S0 ) 1 = 0. Inductively, assume that, when we start the transformation at level i, we have enough nodes at level i − 1; that is, Vi−1 (T ) − Vi−1 (Si−1 ) 1 = 0. We show how to create Si using at most (2k−1) Vi (T )−Vi (Si ) 1 subtree-move operations. Consider a node x at level i (again, to simplify the notation, we also use x to denote the corresponding valid subtree). If Vi (S)[x] > Vi (T )[x], then we have exactly Vi (T )[x] marked x nodes at level i of S’s parse tree that we will not alter; the remaining copies will be split to form other level-i nodes as described next. If Vi (S)[x] < Vi (T )[x], then we need to build an extra (Vi (T )[x] − Vi (S)[x]) copies of the x node at level i. We demonstrate how each such copy can be built by using ≤ (2k − 1) subtree move operations in order to bring together ≤ k level-(i − 1) nodes to form x (note that the existence of these level-(i − 1) nodes is guaranteed by the fact that Vi−1 (T ) − Vi−1 (Si−1 ) 1 = 0). Since (Vi (T )[x] − Vi (S)[x]) is exactly the contribution of x to Vi (T ) − Vi (Si ) 1 , the overall transformation for level i requires at most (2k − 1) Vi (T ) − Vi (Si ) 1 edit operations. To see how we form the x node at level i note that, based on our embedding algorithm, there are three distinct cases for the formation of x from level-(i − 1) nodes, as depicted in Figures 10(a)–10(c). In case (a), x is formed by “folding” the (no-siblings) leftmost leaf child v2 of a node v1 into its parent; we can create the scenario depicted in Figure 10(a) easily with two subtree moves: one to remove any potential subtree rooted at the level-(i − 1) node v2 (we can place it under v2 ’s original parent at the level-(i − 1) tree), and one to move the (leaf) v2 under the v1 node. Similarly, for the scenarios depicted in cases (b) and (c), we basically need at most k subtree moves to turn the nodes involved into leaves, and at most k − 1 additional moves to move these leaves into the right formation around one of these ≤ k nodes. Thus, we can create each copy of x with ≤ (2k − 1) subtree move operations. At the end of this process, we have Vi (T ) − Vi (Si ) 1 = 0. Note that we do not care where in the level-i tree we create the x node; the exact placement will be taken care of at higher levels of the parsing structure. This completes the proof. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 308 • M. Garofalakis and A. Kumar 5. SKETCHING A MASSIVE, STREAMING XML DATA TREE In this section, we describe how our tree-edit distance embedding algorithm can be used to obtain a small, pseudorandom sketch synopsis of a massive XML data tree in the streaming model. This sketch synopsis requires only small (logarithmic) space, and it can be used as a much smaller surrogate for the entire data tree in approximate tree-edit distance computations with guaranteed error bounds on the quality of the approximation based on the distortion bounds guaranteed from our embedding. Most importantly, as we show in this section, the properties of our embedding algorithm are the key that allows us to build this sketch synopsis in small space as nodes of the tree are streaming by without ever backtracking on the data. More speciﬁcally, consider the problem of embedding a data tree T of size n into a vector space, but this time assume that T is truly massive (i.e., n far exceeds the amount of available storage). Instead, we assume that we see the nodes of T as a continuous data stream in some apriori determined order. In the theorem below, we assume that the nodes of T arrive in the order of a preorder (i.e., depth-ﬁrst and left-to-right) traversal of T . (Note, for example, that this is exactly the ordering of XML elements produced by the event-based SAX parsing interface (sax.sourceforge.net/).) The theorem demonstrates that the vector V (T ) constructed for T by our L1 embedding algorithm can then be constructed in space O(d log2 n log∗ n), where d denotes the depth of T . The sketch of T is essentially a sketch of the V (T ) vector (denoted by sketch(V (T ))) that can be used for L1 distance calculations in the embedding vector space. Such an L1 sketch of V (T ) can be obtained (in small space) using the 1-stable sketching algorithms of Indyk [2000] (see Theorem 2.2). THEOREM 5.1. A sketch sketch(V (T )) to allow approximate tree-edit dis- tance computations can be computed over the stream of nodes in the preorder traversal of an n-node XML data tree T using O(d log2 n log∗ n) space and O(log d log2 n(log∗ n)2 ) time per node , where d denotes the depth of T . Then, as- suming sketch vectors of size O(log 1 ) and for an appropriate combining function δ f (), f (sketch(V (S)), sketch(V (T ))) gives an estimate of the tree-edit distance d (S, T ) to within a relative error of O(log2 n log∗ n) with probability of at least 1 − δ. The proof of Theorem 5.1 hinges on the fact that, based on our proof in Sec- tion 4.4, given a node v on a root-to-leaf path of T and for each of the O(log n) lev- els of the parsing structure above v, we only need to retain a local neighborhood (i.e., inﬂuence region) of nodes of size at most O(log n log∗ n) to determine the effect of adding an incoming subtree under T . The O(d ) multiplicative factor is needed since, as the tree is streaming in preorder, we do not really know where a new node will attach itself to T ; thus, we have to maintain O(d ) such inﬂuence regions. Given that most real-life XML data trees are reasonably “bushy,” we expect that, typically, d << n, or d = O(polylog(n)). The f () combining function is basically a median-selection over the absolute component-wise differences of the two sketch vectors (Theorem 2.2). The details of the proof for Theorem 5.1 follow easily from the above discussion and the results of Indyk [2000]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 309 6. APPROXIMATE SIMILARITY JOINS OVER XML DOCUMENT STREAMS We now consider the problem of computing (in limited space) the cardinality of an approximate tree-edit-distance similarity join over two continuous data streams of XML documents S1 and S2 . Note that this is a distinctly different streaming problem from the one examined in Section 5: we now assume mas- sive, continuous streams of short XML documents that we want to join based on tree-edit distance; thus, the limiting factor is no longer the size of an indi- vidual data tree (which is assumed small and constant), but rather the number of trees in the stream(s). The documents in each Si stream can arrive in any order, and our goal is to produce an accurate estimate for the similarity-join cardinality |SimJoin(S1 , S2 , τ )| = |{(S, T ) ∈ S1 × S2 : d (S, T ) ≤ τ }|, that is, the number of pairs in S1 × S2 that are within a tree-edit distance of τ from each other (where the similarity threshold τ is a user/application-deﬁned param- eter). Such a space-efﬁcient, one-pass approximate similarity join algorithm would obviously be very useful in processing huge XML databases, integrating streaming XML data sources, and so on. Once again, the ﬁrst key step is to utilize our tree-edit distance embedding algorithm on each streaming document tree T ∈ Si (i = 1, 2) to construct a (low-distortion) image V (T ) of T as a point in an appropriate multidimensional vector space. We then obtain a lower-dimensional vector of 1-stable sketches of V (T ) that approximately preserves L1 distances in the original vector space, as described by Indyk [2000]. Our tree-edit distance similarity join has now essen- tially been transformed into an L1 -distance similarity join in the embedding, low-dimensional vector space. The ﬁnal step then performs an additional level of AMS sketching over the stream of points in the embedding L1 vector space in order to build a randomized, sketch-based estimate for |SimJoin(S1 , S2 , τ )|.8 The following theorem shows how an atomic sketch-based estimate can be constructed in small space over the streaming XML data trees; to boost ac- curacy and probabilistic conﬁdence, several independent atomic-estimate in- stances can be used (as in Alon et al. [1996, 1999]; Dobra et al. [2002]; see also Theorem 2.1). THEOREM 6.1. Let |SimJoin(S1 , S2 , τ )| denote the cardinality of the tree-edit distance similarity join between two XML document streams S1 and S2 , where document distances are approximated to within a factor of O(log2 b log∗ b) with constant probability, and b is a (constant) upper bound on the size of each document tree. Deﬁne k = k(δ, ) = O( log(1/δ) ) O(1/ ) . An atomic, sketch-based estimate for |SimJoin(S1 , S2 , τ )| can be constructed in O(b + k(δ, ) log N ) space and O(b log∗ b + k(δ, ) log N ) time per document, where δ, are constants < 1 that control the accuracy of the distance estimates and N denotes the length of the input stream(s). 8 Assuming constant-sized trees, a straightforward approach to our similarity-join problem would be to exhaustively build all trees within a τ -radius of an incoming tree, and then just sketch (the ﬁnger- prints of) these trees directly using AMS for the similarity-join estimate. The key problem with such a “direct” approach is the computational cost per incoming tree: given a tree T with b nodes and an edit-distance radius of τ , the cost of the brute-force enumeration of all trees in the τ -neighborhood of T would be at least O(bτ ), which is probably prohibitive (except for very small values of b and τ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 310 • M. Garofalakis and A. Kumar PROOF. Our algorithm for producing an atomic sketch estimate for the simi- larity join cardinality uses two distinct levels of sketching. Assume an input tree T (in one of the input streams). The ﬁrst level of sketching uses our L1 embed- ding algorithm in conjunction with the L1 -sketching technique of Indyk [2000] (i.e., with 1-stable (Cauchy) random variates) to map T to a lower-dimensional vector of O(k(δ, )) iid sketching values sketch(V (T )). This mapping of an input tree T to a point in an O(k(δ, ))-dimensional vector space can be done in space O(b + k(δ, ) log N ): this covers the O(b) space to store and parse the tree,9 and the O(log N ) space required to generate the 1-stable random variates for each of the O(k(δ, )) sketch-value computations (and store the sketch values them- selves). (Note that O(log N ) space is sufﬁcient since we know that there are at most O(N b) nonzero components in all the V (T ) vectors in the entire data stream.) A key property of this mapping is that the L1 distances of the V (T ) vectors are approximately preserved in this new O(k(δ, ))-dimensional vector space with constant probability, as stated in the following theorem from Indyk [2000]. THEOREM 6.2 (INDYK 2000). Let f 1 and f 2 denote N dimensional numeric j j vectors rendered as a stream of updates, and let {X 1 , X 2 : j = 1, . . . , k} denote j N j k = k(δ, ) = O( log(1/δ) ) O(1/ ) iid pairs of 1-stable sketches X l = i=1 f l (i)ξi ; j also, deﬁne X l as the k-dimensional vector (X l , . . . , X l ) (l = 1, 2; {ξi } are 1 k 1-stable (Cauchy) random variates). Then, the L1 -difference norm of the k- dimensional sketch vectors X 1 − X 2 1 satisﬁes (1) X1 − X2 1 ≥ f 1 − f 2 1 with probability ≥ 1 − δ; and (2) X1 − X2 1 ≤ (1 + ) · f 1 − f 2 1 with probability ≥ . Intuitively, Theorem 6.2 states that, if we use 1-stable sketches as a dimensionality-reduction tool for L1 (that is, for mapping a point in a high, O(N)-dimensional L1 -normed space to a lower, k-dimensional L1 -normed space, instead of using median selection as in Theorem 2.2), then we can only provide weaker, asymmetric guarantees on the L1 distance distortion. In short, we can guarantee small distance contraction with high probability (i.e., 1 − δ), but we can guarantee small distance expansion only with constant probability (i.e., ). (Note that the exact manner in which the δ, parameters control the error and conﬁdence in the approximate L1 -distance estimates is formally stated in The- orem 6.2.) The reason for using this version of Indyk’s results in our similarity- join scenario is that, as mentioned earlier, we need to perform an (approxi- mate) streaming similarity-join computation over the mapped space of sketch vectors, which appears to be infeasible when the median-selection operator is used. The second level of sketching in our construction will produce a pseudorandom AMS sketch (Section 2.2) of the point-distribution (in the embed- ding vector space) for each input data stream. To deal with an L1 τ -similarity join, the basic equi-join AMS-sketching technique discussed in Section 2.2 needs 9 Ofcourse, for large trees, the small-space optimizations of Section 5 can be used (assuming pre- order node arrivals). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 311 to be appropriately adapted. The key idea here is to view each incoming “point” sketch(V (T )) in one of the two data streams, say S1 , as an L1 region of points (i.e., a multidimensional hypercube) of radius τ centered around sketch(V (T )) in the embedding O(k(δ, ))-dimensional vector space when building an AMS-sketch synopsis for stream S1 . Essentially, this means that when T (i.e., sketch(V (T ))) is seen in the S1 input, instead of simply adding the random variate ξi (where, the index i = sketch(V (T ))) to the atomic AMS-sketch estimate X S1 for S1 , we update X S1 by adding j ∈n(i,τ ) ξ j , where n(i, τ ) denotes the L1 neighbor- hood of radius τ of i = sketch(V (T )) in the embedding vector space (i.e., n(i, τ ) = { j : i − j 1 ≤ τ }). Note that this special processing is only carried out on the S1 stream; the AMS-sketch X S2 for the second XML stream S2 is up- dated in the standard manner. It is then fairly simple to show (see Section 2.2) that the product X S1 · X S2 gives an unbiased, atomic sketching estimate for the cardinality of the L1 τ -similarity join of S1 and S2 in the embedding O(k(δ, ))- dimensional vector space. In terms of processing time per document, note that, in addition to time cost of our embedding process, the ﬁrst level of (1-stable) sketching can be done in small time using the techniques discussed by Indyk [2000]. The second level of (AMS) sketching can also be implemented using standard AMS-sketching tech- niques, with the difference that (for one of the two streams) updating would require summation of ξ variates over an L1 neighborhood of radius τ in an O(k(δ, ))-dimensional vector space. Thus, a naive, brute force technique that simply iterates over all these variates would increase the per-document sketch- ing cost by a multiplicative factor of O(|n(i, τ )|) = O(τ k(δ, ) ) ≈ O((1/δ)k(δ, ) ) in the worst case; however, efﬁciently range-summable sketching variates, as in Feigenbaum et al. [1999], can be used to reduce this multiplicative factor to only O(log |n(i, τ )|) = O(k(δ, )). Again, note that, by Indyk’s L1 dimensionality-reduction result (Theo- rem 6.2), Theorem 6.1 only guarantees that our estimation algorithm ap- proximates tree-edit distances with constant probability. In other words, this means that a constant fraction of the points in the τ -neighborhood of a given point could be missed. Furthermore, the very recent results of Charikar and Sahai [2002] prove that no sketching method (based on randomized linear projections) can provide a high-probability dimensionality-reduction tool for L1 ; in other words, there is no analogue of the Johnson-Lindenstrauss (JL) lemma [Johnson and Lindenstrauss 1984] for the L1 norm. Thus, there seems to be no obvious way to strengthen Theorem 6.1 with high-probability distance estimates. The following corollary √ shows that high-probability estimates are possible if we allow for an extra O( b) multiplicative factor in the distance distortion. The idea here is to use L2 vector norms to approximate L1 norms, exploiting the fact that each V (T ) vector has at most O(b) nonzero components, and then use standard, high-probability L2 dimensionality reduction (e.g., through the JL construction). Of course, a different approach that could give stronger results would be to try to embed tree-edit distance directly into L2 , but this remains an open problem. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 312 • M. Garofalakis and A. Kumar COROLLARY 6.3. The tree-edit distances for the estimation of the similarity- join cardinality |SimJoin(S1 , S2 , τ )|√ Theorem 6.1 can be estimated with high in probability to within a factor of O( b log2 b log∗ b). 7. EXPERIMENTAL STUDY In this section, we present the results of an empirical study that we have conducted using the oblivious tree-edit distance embedding algorithm devel- oped in this article. Several earlier studies have veriﬁed (both analytically and experimentally) the effectiveness of the pseudorandom sketching techniques employed in Sections 5 and 6 in approximating join cardinalities and differ- ent vector norms; see, for example, Alon et al. [1996, 1999]; Cormode et al. [2002a, 2002b]; Dobra et al. [2002]; Gilbert et al. [2002a]; Indyk et al. [2000], Indyk [2000]; Thaper et al. [2002]. Thus, the primary focus of our experimental study here is to quantify the average-case behavior of our embedding algorithm (TREEEMBED) in terms of the observed tree-edit distance distortion on realistic (both synthetic and real-life) XML data trees. As our ﬁndings demonstrate, the average-case behavior of our TREEEMBED algorithm is indeed signiﬁcantly better than that predicted by the theoretical (worst-case) distortion bounds shown ear- lier in this article. Furthermore, our experimental results reveal several other properties and characteristics of our embedding scheme with interesting impli- cations for its potential use in practice. Our implementation was carried out in C++; all experiments reported in this section were performed on a 1-GHz Intel Pentium-IV machine with 256 MB of main memory running RedHat Linux 9.0. 7.1 Implementation, Testbed, and Methodology 7.1.1 Implementation Details: Subtree Fingerprinting. A key point in our implementation was the use of Karp-Rabin (KR) probabilistic ﬁnger- prints [Karp and Rabin 1987] for assigning hash labels h(t) to valid subtrees t of the input tree T in a one-to-one manner (with high probability). The KR algorithm was originally designed for strings so, in order to use it for trees, our implementation makes use of the ﬂattened, parenthesized string repre- sentation of valid subtrees of T to obtain the corresponding tree ﬁngerprint (treating parentheses as special delimiter labels in the underlying alphabet). An important property of the KR string-ﬁngerprinting scheme is its ability to easily produce the ﬁngerprint h(s1 s2 ) of the concatenation of two strings s1 and s2 given only their individual ﬁngerprints h(s1 ) and h(s2 ) [Karp and Rabin 1987]. This is especially important in the context of our data-stream processing algo- rithms since, clearly, we cannot afford to retain entire subtrees of the original (streaming) XML data tree T in order to compute the corresponding ﬁngerprint in the current phase of our hierarchical tree parsing—the result would be space requirements linear in |T | for each parsing phase. Thus, we need to be able to compute the ﬁngerprints of valid subtrees corresponding to nodes v ∈ T i using only the ﬁngerprints from the nodes in T i−1 that were contracted by TREEEMBED to obtain node v. This turns out to be nontrivial since, unlike the string case where the only possible options are left or right concatenation, TREEEMBED can merge the underlying subtrees of T in several different ways (Figures 3 and 4). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 313 Fig. 11. Example of subtree ﬁngerprint propagation. (“⊥” denotes an empty ﬁngerprint and “∗” separates the left and right part of an incomplete ﬁngerprint.) The solution we adopted in our TREEEMBED implementation is based on the idea of maintaining, for each node in the current phase v ∈ T i , a collection of subtree ﬁngerprints corresponding to the child subtrees of v in the original data tree T . Brieﬂy, all of v’s ﬁngerprints start out in an empty state, and a ﬁn- gerprint becomes complete (meaning that it contains the complete ﬁngerprint for the corresponding child subtree) once the last node along that branch is merged into node v. Child ﬁngerprints of v can also be in an incomplete state, meaning that the subtree along the corresponding branch has only been par- tially merged into v; in order to correctly merge in the remaining subtree, an incomplete ﬁngerprint consists of both a left and a right part that will eventu- ally enclose the ﬁngerprint propagated up by the rest of the subtree. The key to our solution is, of course, that we can always compute the ﬁngerprint of a valid subtree at phase i by simple concatenations of the ﬁngerprints from nodes in phase i − 1. (Note that a sequence of complete child ﬁngerprints can always be concatenated to save space, if necessary.) Figure 11 illustrates the key ideas in our subtree ﬁngerprinting scheme following a simple example scenario of edge contractions. To simplify the exposition, the ﬁgure uses parenthesized nodela- bel strings instead of actual numeric KR ﬁngerprints of these strings; again, the key here is that ﬁngerprints for new nodes (obtained through contractions in the current phase) are computed by simply concatenating existing KR ﬁn- gerprints. Fingerprinting and merging for subtrees rooted at unlabeled nodes (Figure 4) can also be easily handled in our scheme. The KR-ﬁngerprinting scheme maps each string in an input collection of strings to a number in the range [0, p], where p is a prime number large enough to ensure that distinct input strings are mapped to distinct numbers with suf- ﬁciently high probability. Given that the total number of valid subtrees created during our hierarchical parsing of an input XML tree T is guaranteed to be only O(|T |), we chose the prime p for our subtree ﬁngerprinting to be p = (|T |2 )— this clearly sufﬁces to ensure high-probability one-to-one ﬁngerprints in our scheme. 7.1.2 Experimental Methodology. One of the main metrics used in our study to gauge the effectiveness of our tree-edit distance embedding scheme is the distance-distortion ratio which is deﬁned, for a given pair of XML data trees S and T , as the quantity DDR(S, T ) = V (S)−V (T ) 1 (where d (S, T ) is the tree- d (S,T ) edit distance of S and T and V (S), V (T ) are the vector embeddings computed ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 314 • M. Garofalakis and A. Kumar by our TREEEMBED algorithm). Based on our initial experimental results, we also decided to employ a heuristic, normalized distance-distortion ratio metric NormDDR(S, T ) in which the L1 vector distance V (S) − V (T ) 1 is normalized by the maximum of the depths of the parse trees (produced by TREEEMBED) for S and T ; in other words, letting ρ(S), ρ(T ) denote the number of TREEEMBED pars- ing phases for S and T (respectively), we deﬁne NormDDR(S, T ) = max{ρ(S),ρ(T )} . DDR(S,T ) (We discuss the rationale behind our NormDDR metric later in this section.) Unfortunately, the problem of computing the exact tree-edit distance with subtree-move operations (i.e., d (S, T ) above) turns out to be N P-hard—this is a direct implication of the recent N P-hardness result of Shapira and Storer [2002] for the simpler string-edit distance problem in the presence of substring moves. Furthermore, to the best of our knowledge, no other efﬁcient approxi- mation algorithms have been proposed for our tree-edit distance computation problem. Given the intractability of exact d (S, T ) computation and the lack of other viable alternatives (the sizes of our data sets preclude any brute-force, exhaustive technique), we decided to base our experimental methodology on the idea of performing random tree-edit perturbations on input XML trees. Brieﬂy, given an XML tree T , we apply a script rndEdits() of random tree-edit opera- tions (inserts, deletes, relabels, and subtree moves) on randomly selected nodes of T to obtain a perturbed tree rndEdits(T ). Special care is taken in the creation of the rndEdits() edit-script in order to avoid redundant operations. Speciﬁcally, the key idea is to grow the rndEdits() script incrementally, storing a signature for each randomly chosen (node, operation) combination inside a set data struc- ture; then, once a new random (node, operation) pair is selected, we employ our stored set of signatures together with a simple set of rules to check that the new edit operation is not redundant before entering it into rndEdits(). Examples of such redundant-operation checks include the following: (1) do not relabel the same node more than once, (2) do not move the same subtree more than once, (3) do not delete a previously inserted node, (4) do not insert a node in exactly the same location as a previously-deleted node, and so on. Even though our set of rules is not guaranteed to eliminate all possible redundancies in rndEdits(), we have found it to be quite effective in practice. Finally, we compute an (approx- imate) distance-distortion ratio DDR(T, rndEdits(T )), where d (T, rndEdits(T )) is approximated as d (T, rndEdits(T )) ≈ |rndEdits()|, that is, the number of tree-edit operations in our random script—since we explicitly try to avoid re- dundant edits, this is bound to be a reasonably good approximation of the true tree-edit distance (with moves) between the original and modiﬁed tree. 7.1.3 Data Sets. We used both synthetic and real-life XML data trees of varying sizes in our empirical study. These trees were obtained from (1) XMark [Schmidt et al. 2002], a synthetic XML data benchmark intended to model the activities of an on-line auction site (www.xml-benchmark.org/), and (2) SwissProt, a real-life XML data set comprising curated protein sequences and accompanying annotations (us.expasy.org/sprot/). We controlled the size of the XMark data trees using the “scaling factor” input to the XMark data gener- ator. SwissProt is a fairly large real-life XML data collection (of total size over 165 MB)—in order to control the size of our input SwissProt trees, we used a ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 315 Fig. 12. TREEEMBED distance distortion ratios for small (a) XMark, and (b) SwissProt data trees. Fig. 13. TREEEMBED distance distortion ratios for medium (a) XMark, and (b) SwissProt data trees. simple sampling procedure that randomly selects a subset of top-level <Entry> nodes from SwissProt’s full tree with a certain sampling probability (where, of course, larger sampling probabilities imply larger generated subtrees). For both data sets, we partitioned the set of input data trees into three broad classes: (1) a small class comprising trees with sizes approximately between 400 and 1200 nodes; (2) a medium class with trees of sizes approximately between 2000 and 20,000 nodes; and (3) a large class with trees of sizes approximately between 100,000 and 600,000 nodes. The number of random tree-edit operations in our edit scripts (|rndEdits()|) was typically varied between 20–200 for small trees, 20–600 for medium trees, and 200–20, 000 for large trees; in order to smooth out randomization effects, our results were averaged over ﬁve distinct runs of our algorithms using different random seeds for generating the random tree- edit script. The numbers presented in the following section are indicative of our results on all data sets tested. 7.2 Experimental Results 7.2.1 TREEEMBED Distance Distortions for Varying Data-Set Sizes. The plots in Figures 12, 13, and 14 depict several observed tree-edit distance- distortion ratios obtained through our TREEEMBED algorithm for (a) XMark and ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 316 • M. Garofalakis and A. Kumar Fig. 14. TREEEMBED distance distortion ratios for large (a) XMark, and (b) SwissProt data trees. (b) SwissProt, in the case of small, medium, and large data trees (respectively). We plot the distance-distortion ratio as a function of the number of random tree edits in our edit script; thus, based on our discussion in Section 7.1, the x axis in our plots essentially corresponds to the true tree-edit distance value between the original and modiﬁed input trees. Our numbers clearly show that the distortions imposed by our L1 vector embedding scheme on the true tree- edit distance typically vary between a factor of 4–20 on small inputs, a factor of 5–30 on medium inputs, and a factor of 10–35 on the large XMark and SwissProt trees. It is important to note that these experimental distortion ratios are obvi- ously much better (by an order of magnitude or more) than what the pessimistic worst-case bounds in our analysis would predict for TREEEMBED. More speciﬁ- cally, based on the size of the trees (n) in our experiments, it is easy to verify that our worst-case distortion bound of log2 n log∗ n (even ignoring all the con- stant factors in our analysis and those in Cormode and Muthukrishnan [2002]) gives values in the (approximate) ranges 230–300 (for small trees), 360–600 (for medium trees), and 850–1,100 (for large trees); our experimental distor- tion numbers are clearly much better. An additional interesting ﬁnding in all of our experiments (with both XMark and SwissProt data) is that our tree-edit distance estimates based on the L1 dif- ference of the embedding vector images consistently overestimate (i.e., expand) the actual distance; in other words, for all of our experimental runs, DDR(T, S) ≥ 1. Furthermore, note that the range of our experimental (over)estimation errors appears to grow quite slowly over a wide range of values for the tree-size pa- rameter n (for instance, when moving from trees with n ≈ 4, 000 nodes to trees with n ≈ 600, 000 nodes). These observations along with a closer examination of some of our experimental results and the speciﬁcs of our TREEEMBED embed- ding procedure motivate the introduction of our normalized distance-distortion ratio metric (discussed below). 7.2.2 A Heuristic for Normalizing the L1 Difference: The NormDDR Metric. Our experimental distance-distortion ratio numbers clearly demonstrate that our TREEEMBED algorithm satisﬁes the theoretical worst-case distortion guar- antees shown in this article, typically improving on these worst-case bounds by ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 317 well over an order of magnitude on synthetic and real-life data. Still, it is not en- tirely clear how to interpret the importance of these numbers for real-life, prac- tical XML-processing scenarios. Distance overestimation ratios in the range of 5–30 are obviously quite high and could potentially lead to poor sketching- based query estimates (e.g., for a streaming XML similarity join). Based on our experimental observations and the details of our TREEEMBED algorithm, we now propose a simple heuristic rule for normalizing the L1 difference of the image vectors that could potentially be used to provide more useful tree-edit distance estimates. Consider an input tree T to our TREEEMBED procedure, and let ρ(T ) denote the number of phases in the parsing of T . Now, assume that we effect a sin- gle edit operation (e.g., a node relabel) at the bottom level (i.e., tree T 0 = T ) of our parsing to convert T to a new tree S. It is not difﬁcult to see that this one edit operation is going to “hit” (i.e., affect the corresponding valid subtree of) at least one node at each of the ρ(T ) parsing phases of T , thus resulting in an L1 difference V (T ) − V (S) 1 in the order of ρ(T ). In other words, even though d (T, S) = 1, just by going through the different parsing phases, the effect of that single edit operation on T is ampliﬁed by a factor of O(ρ(T )) in the resulting L1 distance. Generalizing from this simple scenario, consider a situation where T is modiﬁed by a relatively small number of edit opera- tions (with respect to the size of T ) applied to nodes randomly spread through- out T . The key observation here is that, since we have a small number of changes at locations spread throughout T , the effects of these changes on the different parsing phases of T will remain pretty much independent until near the end of the parsing; in other words, the nodes “hit”/affected by different edit operations will not be merged until the very late stages of our hierarchi- cal parsing. Thus, under this scenario, we would once again expect the orig- inal edit distance to be ampliﬁed by a factor of O(ρ(T )) in the resulting L1 distance. A closer examination of some of our experimental results validated the above intuition. Remember that our rndEdits() script does in fact choose the target nodes for tree-edit operations randomly throughout the input tree T ; further- more, as expected, the impact of the parse-tree depth ρ(T ) on the approximate tree-edit distance estimates is more evident when the number of edit opera- tions in rndEdits() is relatively small compared to the size of T . This obvi- ously explains the clear downward trend for the distance-distortion ratios in Figures 12–14. Based on the above discussion, we propose normalizing the L1 distance of the image vectors in our embedding by the maximum parse-tree depth; that V (S)−V (T ) 1 is, we estimate d (S, T ) using the ratio max{ρ(S),ρ(T )} . Figure 15 depicts our ex- perimental numbers for the corresponding normalized distance-distortion ra- tio NormDDR(S, T ) = max{ρ(S),ρ(T )} for several XMark and SwissProt data trees of DDR(S,T ) varying sizes. Clearly, the normalized L1 distance gives us much better tree- edit distance estimates in our experimental setting, typically ranging between a factor of 0.5 and 2.0 of the true tree-edit distance. Such distortions could be acceptable for several real-life application scenarios, especially when dealing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 318 • M. Garofalakis and A. Kumar Fig. 15. TREEEMBED normalized distance distortion ratios for (a) XMark, and (b) SwissProt data trees. with data collections with well-deﬁned, well-separated structural clusters of XML documents (as we typically expect to be the case in practice). Of course, we should stress that normalizing the L1 distance estimate by the parse-tree depth is only a heuristic solution that is not directly supported by the theoretical analysis of TREEEMBED (Section 4). This heuristic may work well for the case of a small number of randomly-spread edit operations; however, when such operations are “clustered” in T or their number is fairly large with respect to |T |, dividing by ρ(T ) may result in signiﬁcantly underestimating the actual tree-edit distance (see the clear trend in Figure 15). Still, our normalization heuristic may prove useful in certain scenarios, for example, when dealing with streams of large XML documents that, based on some prior knowledge, cannot be radically different from each other (i.e., they are all within an edit-distance radius which is much smaller than the document sizes). 7.2.3 Effect of Tree Depth. SwissProt is a fairly shallow XML data set (of maximum depth ≤ 5); thus, to study the potential effect of tree depth on the estimation accuracy of our embedding we concentrate solely on trees produced from the XMark data generator. More speciﬁcally, our methodology is as follows. We generate large (400,000-node) XMark data trees and, for a given value of the tree-depth parameter, we prune all nodes below that depth. Then, we make sure that the resulting pruned trees T at different depths all have the same approximate target size t using the following iterative rule: while |T | is larger (smaller) than t pick a random node x in T and delete (respectively, replicate) its subtree at x’s parent (making, of course, sure that the tree resulting from this operation is not too far from our target size and that the depth of the tree does not change). Finally, we run our rndEdits() scripts on these pruned trees with varying numbers of speciﬁed edit operations, and measure the observed normalized and unnormalized distance-distortion ratios for each depth value. The plots in Figure 16 depict the observed unnormalized and normalized distance-distortion ratios as a function of tree depth for a pruned-tree target size of 100,000 nodes and for different numbers of tree-edit operations. Our experi- mental numbers clearly indicate that the estimation accuracy of our embedding ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 319 Fig. 16. TREEEMBED unnormalized (a) and normalized (b) distance distortion ratios as a function of tree depth for 100,000-node (pruned) XMark data trees. scheme does not have any direct dependence on the depth of the input tree(s)— the key experimental parameters affecting the quality of our estimates appear to be the size of the tree and the number of tree-edit operations. 7.2.4 Using TREEEMBED for Approximate Document Ranking. We now ex- perimentally explore a different potential use for our XML-tree signatures in the context of approximate XML-document ranking based on the tree-edit dis- tance similarity metric. In this setup, we are given a target XML document T and a number of incoming XML documents that are within different tree- edit distance ranges from T . The goal is to quickly rank incoming documents based on their tree-edit distance from T , such that if d (S1 , T ) < d (S2 , T ) then S1 is ranked “higher” than S2 . Since computing the exact tree-edit distances can be very expensive computationally, we would like to have efﬁcient, easy-to- compute tree-edit distance estimates that can be used to approximately rank incoming documents. Our idea is to use TREEEMBED to produce L1 vector signa- tures for both the target document T and each incoming document Si , and use the L1 distances V (T ) − V (Si ) 1 for the approximate ranking of Si ’s. The key observation here, of course, is that, for effective document ranking, it is cru- cial for our estimation techniques to preserve the relative ranking of individual tree-edit distances (rather than to accurately estimate each distance). Our ex- perimental results demonstrate that our embedding schemes could provide a useful tool in this context. For our document-ranking experiments, we vary the size of the target docu- ment T between 10,000 and 200,000 nodes. For a given target T and tree-edit distance d , we generate 40 different trees Si at distance d from T (using dif- ferent runs of our rndEdits() script). We vary the tree-edit distance d in three distinct ranges (10–50, 100–500, and 1000–3000) and, for a given value of d , we measure the observed range of (a) L1 distances V (T ) − V (Si ) 1 , and (b) V (T )−V (Si ) 1 normalized L1 distances max{ρ(T ),ρ(Si )} , over the corresponding set of 40Si trees. Our experimental results for 50,000-node XMark and SwissProt data trees are shown in Table I. Note that in almost all cases, the approximate tree-edit dis- tance ranges provided by our two L1 -distance metrics for the Si sets (1) are ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 320 • M. Garofalakis and A. Kumar Table I. Approximate Document-Ranking Results: 50K-Node XMark and SwissProt Data Trees XMark Data SwissProt Data Normalized Normalized d (T, Si ) L1 Distance L1 Distance L1 Distance L1 Distance 10 386–822 22.7–48.3 325–562 20.3–35.1 20 766–1083 45–63.7 688–883 43–55.2 30 994–1417 58.4–83.3 901–1212 56.3–75.7 40 1318–1580 77.5–92.9 1258–1542 78.6–96.4 50 1499–1915 88.1–112.6 1400–1667 87.5–104.2 100 2581–3194 151.8–187.8 2461–2792 153.8–174.5 200 4519–4992 265.8–293.6 4278–4831 267.4–301.9 300 6181–6571 363.5–386.5 6040–6294 377.5–393.4 400 7700–8411 452.9–494.7 7437–8126 464.8–507.9 500 8940–9653 525.8–567.8 8696–9246 543.5–577.9 1000 15278–16083 898.7–946 14615–15443 913.4–965.2 1500 20933–21393 1231.3–1258.4 19357–20171 1209.8–1260.7 2000 25114–25974 1477.29–1527.9 27599–27916 1724.9–1744.7 2500 29537–30251 1737.4–1779.4 28562–29331 1785.1–1822.2 3000 33228–34199 1954.6–2011.7 30545–31452 1909–1965.7 completely disjoint and (2) preserve the ranking of the corresponding true edit distances d (T, Si )—this, of course, implies that our L1 estimates correctly rank all the Si input trees in most of our test cases. The only situation where our observed L1 -estimate ranges show some (typically small) overlap is for very small differences in tree-edit distance, that is, when |d (T, Si ) − d (T, S j )| = 10 in Table I. Thus, for such small edit-distance separations (remember that we are dealing with 50,000-node trees), it is possible for our L1 estimates to mis- classify certain input documents; still, a closer examination of our results shows that, even in these cases, the percentage of misclassiﬁcations is always below 17.5% (i.e., at most 7 out of 40 documents). It is worth noting that our approximate document-ranking setup is, in fact, closely related to a simple version of the approximate similarity-join scenar- ios discussed in Section 6. In a sense, our goal here is to correctly identify the “closest” approximate duplicates of a target document T in a collection of input documents Si that are within different tree-edit distances of T —these closest duplicates essentially represent the subset of Si documents that would join with T (for an appropriate setting of the similarity threshold to account for the distance distortion; see Theorem 6.1). Thus, assuming that the L1 /AMS sketch- ing techniques developed in Section 6 correctly preserve the L1 -distance ranges of the underlying image vectors, our ranking results provide an indication of the percentages of false positives/negatives in the approximate similarity-join operation (based on the overlap between different distance ranges), and the required “distance separation” between document clusters in the joined XML streams to suppress such estimation errors. 7.2.5 Running Time and Space Requirements. Table II depicts the ob- served running-times and memory footprints for our TREEEMBED embedding ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 321 Table II. TREEEMBED Running Times and Memory Footprints: XMark Data Trees TREEEMBED TREEEMBED Tree Size Document Size Running Time Memory Footprint 20K nodes 1.1 MB 2.5 s 3.0 MB 50K nodes 2.9 MB 6.3 s 7.9 MB 100K nodes 5.7 MB 12.7 s 16.7 MB 150K nodes 8.9 MB 19.7 s 25.0 MB 200K nodes 11.7 MB 26.5 s 33.3 MB 250K nodes 14.5 MB 34.4 s 41.9 MB 300K nodes 17.4 MB 41.2 s 49.1 MB 350K nodes 20.5 MB 49.9 s 57.9 MB 400K nodes 23.5 MB 58.2 s 66.4 MB algorithm over XMark data trees of various sizes (the results for SwissProt are very similar and are omitted). We should, of course, note here that our current TREEEMBED implementation does not employ the small-space optimizations dis- cussed in Section 5 (that is, we always build the full XML tree in memory); still, as our numbers show, the memory requirements of our scheme grow only linearly (with a small constant factor ≤3) in the size of the input document. Furthermore, our embedding algorithm gives very fast running times; for in- stance, our TREEEMBED code takes less than 1 min to build the L1 vector image of a 400,000-node XML tree. Thus, once again, compared to computationally expensive, exact tree-edit distance calculations, our techniques can provide a very efﬁcient, approximate alternative. 8. CONCLUSIONS In this article, we have presented the ﬁrst algorithmic results on the problem of effectively correlating (in small space) massive XML data streams based on ap- proximate tree-edit distance computations. Our solution relies on a novel algo- rithm for obliviously embedding XML trees as points in an L1 vector space while guaranteeing a logarithmic worst-case upper bound on the distance distortion. We have combined our embedding algorithm with pseudorandom sketching techniques to obtain novel, small-space algorithms for building concise sketch synopses and approximating similarity joins over streaming XML data. An em- pirical study with synthetic and real-life data sets has validated our approach, demonstrating that the behavior of our embedding scheme over realistic XML trees is much better than what would be predicted based on our worst-case dis- tortion bounds, and revealing several interesting properties of our algorithms in practice. Our embedding result also has other important algorithmic ap- plications, for example, as a tool for very fast, approximate tree-edit distance computations. APPENDIX: ANCILLARY LEMMAS FOR THE UPPER-BOUND PROOF In this section, we complete the upper bound proof for the distortion of our embedding algorithm. Recall the terminology of Section 4.4. We have deﬁned sets M i and P i as subsets of T i and (T1 + T2 )i , respectively. We have assumed by induction that for every x ∈ T i − M i there exists a unique f (x) ∈ (T1 + T 2 )i ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 322 • M. Garofalakis and A. Kumar such that the trees x(T ) and f (x)(T1 + T2 ) look identical. In other words, if we forget about the regions M i and P i , the remaining forests are basically the same. Now, we want to prove this fact for the next stage of parsing. In order to do this, we deﬁned another region N i which encloses M i and in some way represents the region which can get inﬂuenced by M i in the next stage of parsing. So we pick a node w ∈ T i − N i and w is the corresponding node in (T1 + T2 )i . So we know that w(T ) and w (T1 + T2 ) are the same trees. Now, w gets absorbed in a node q in T i+1 and w in q in (T1 + T2 )i+1 . Our goal now is to show that q(T ) and q (T1 + T2 ) are identical trees. We will do this by following the a natural procedure—we will show that w and w get parsed in exactly the same manner, that is, if w gets merged with a leaf, then the same thing happens to w . But note that it is not enough to just show this fact. For example, if w and w get merged with leaves l and l , we have to show that l (T ) and l (T1 + T2 ) are also identical subtrees. Thus, we ﬁrst need to go through a set of technical lemmas, which show that the mapping f preserves the neighborhood of w as well. For example, we need to show facts like parent of w and parent of w get associated by f as well. So, we need to explore the properties of f and ﬁrst show that it preserves sibling relations and parent-child relations. Once we are armed with these lemmas, we just need to prove the following facts: — If w is a leaf and if merged with its parent, then the same happens to w . — If w is a leaf and is merged with some of its leaf siblings, then w is also a leaf and gets merged with the corresponding siblings. — If w has a leaf child which gets merged into it, then w also has a correspond- ing leaf child which gets merged into it. — If w is a degree-2 node in a chain, and gets merged with some other such nodes, then the same fact applies to w . Clearly, the above facts will be enough to prove the result we want. As we mentioned before, we need to analyze some properties of the parsing algorithm and the function f , so that we can set up a correspondence between neighbor- hoods of w and w . We proceed to do this ﬁrst. We ﬁrst show the connection between a node w and the associated tree w(T ). The following fact is easy to see. CLAIM A.1. Let x and y be two distinct nodes in T i . x is a parent of y iff x(T ) contains a node which is the parent of a node in y(T ). PROOF. Proof is by induction on i. Let us say the fact is true for i −1. Suppose x is a parent of y in T i . Let X be the set of nodes in T i−1 which got merged to form x. Deﬁne Y similarly. Then there must be a node in X which is the parent of a node in Y . The rest follows by induction on these two nodes. The reverse direction is similar. LEMMA A.2. Suppose x ∈ T i is a leaf. Then x(T ) has the following property— if y ∈ x(T ), then all descendants of y are in x(T ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 323 PROOF. Suppose not. Then there exist nodes a, b ∈ T such that a is the / parent of b, and yet a ∈ x(T ), b ∈ x(T ). Suppose b is in a node y in T i . But then the claim above implies that x is not a leaf, a contradiction. Now, we go on to consider the case when x has at least two children. Consider the nodes in T i−1 which formed x. Only two cases can happen—either x was present in T i−1 or T i−1 contains a node u with at least two children and x is obtained by collapsing a leaf child into u. In either case, x corresponds to a unique node with at least two children in T i−1 . Carrying this argument back all the way to T 0 , we get the following fact. CLAIM A.3. Let x be a node with at least two children in T i . Then x(T ) is a subtree of T which looks as follows: there is a unique node with at least two children, call it x0 , such that all nodes in x(T ) are descendants of x0 . Further, if y ∈ x(T ), y = x0 , then all descendants of y (in T ) are also in x(T ). The proof of the fact above is again by induction and using the previous two claims. CLAIM A.4. Let x and y be two nodes in T i . Suppose x is a sibling of y. Then, x(T ) contains a node which is a sibling of a node in y(T ). Conversely, if x(T ) contains a node which is a sibling of a node in y(T ), then x and y are either siblings or one of them is the parent of the other. PROOF. Suppose x is a sibling of y. Let w be their common parent. w has at least two children. By Claim A.3, w(T ) contains a node w0 such that if z ∈ w(T ), z = w0 , then w(T ) contains all descendants of z. Claim A.1 implies that there is a node a ∈ x(T ) and a node b ∈ w(T ) such that a is a child of b. We claim that b = w0 . Indeed, otherwise all descendants of b, in particular, a, should have been in w(T ). Similarly, there is a node c in y(T ) whose parent is w0 . But then a and c are siblings in T . Conversely, suppose there is a node in x(T ) which is a sibling of a node in y(T ). Let the common parent of these two nodes in T be w. Let w be the node in T i containing w. If w is x, then x is the parent of y. So, assume w is not x or y. It follows from Claim A.1 that w is the parent of x and y. So, x and y are siblings. Recall that we associate a set P i with M i . We already know that M i is a connected subtree. Of course, we can not say the same for P i because T1 + T2 itself is not connected. But we can prove the following fact. i i LEMMA A.5. P i restricted to T1 or T2 is a connected set. PROOF. Suppose P i is not connected. Then there exist two nodes x, y in the same component of (T1 + T2 )i , such that x, y ∈ P i but at least one internal node in the path between x and y is not in P i . We can in fact assume that all internal nodes in this path are not in P i (otherwise, we can replace x and y by two nodes on this path which are in P i but none of the nodes between them are not in P i ). Let this path be x, a1 , . . . , an , y. Let bi ∈ T i − M i be such that f (bi ) = ai . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 324 • M. Garofalakis and A. Kumar First observe that bi is adjacent with bi−1 . Indeed, suppose ai is the parent of ai−1 (the other case will be ai is a child of ai−1 , which is similar). By Claim A.1, there is a node in ai (T1 + T2 ) which is the parent of a node in ai−1 (T1 + T2 ). Since ai (T1 + T2 ) = bi (T ) and ai−1 (T1 + T2 ) = bi−1 (T ), we see that there is a node in bi (T ) which is the parent of a node in bi−1 (T ). Applying Claim A.1, we see that bi is adjacent with bi−1 . Thus, b1 , . . . , bn is a path. Since x is adjacent with a1 , there is a node x0 in x(T1 + T2 ) which is adjacent with a node c1 in a1 (T1 + T2 ) (again, using Claim A.1). If x0 is not the root of T1 , then x0 is also a node in T . So, there is a node x ∈ T i , such that x0 ∈ x (T ). Clearly x ∈ M i ; otherwise, f (x ) must be x (since f (x )(T1 + T2 ) and x(T1 + T2 ) will share the node x0 and so must be the same). Now, x must be adjacent with b1 (because x0 is adjacent with c1 in T —note that c1 is a node in T as well). The other case arises when x0 is the root of T1 . So, x0 is the parent of c1 . But then v is the parent of c1 in T . Let x be the node containing v in the tree T i . So x ∈ M i and is adjacent with b1 . Thus, we get a node x ∈ M i , such that x is adjacent with b1 . In fact, if x is the parent (child) of a1 , then the same applies to x and b1 (and vice versa). Similarly, there is a node y ∈ M i adjacent with bn . So if x and y are different nodes in M i , then this contradicts the fact that M i is connected. So x = y . To avoid any cycles in T i , it must be the case that b1 = · · · = bn . First observe that, in this case, x and y are children of a1 . The only other possibility is that x is the parent of a1 and y a child of a1 —but then x is the parent of b1 and y a child of b1 and so x , y cannot be the same nodes. Thus, we have that x, y are children of a1 and x is a child of b1 . By Claim A.3, a1 (T1 + T2 ) has a node a which is the parent of a node x0 ∈ x(T1 + T2 ) and a node y 0 ∈ y(T1 + T2 ). By deﬁnition of x (and y ), x0 , y 0 ∈ x (T ). Further, a ∈ b1 (T ). Consider the largest integer i such that the nodes containing x0 and y 0 in the tree T i were different—call these nodes x and y . Let b be the node containing a . So x and y are children of b . Also i < i. When we parse T i , we merge x and y into a single node, x ∈ T i +1 . However, in (T1 + T2 )i +1 , the nodes containing x0 and y 0 are different. So, x ∈ M i +1 . So one of the nodes of T i which merged into x must have been in N i . Since x and y are siblings and we merge them, it must be the case that they are leaves. Thus, the only nodes in T i which get merged to form x are leaf children of b . So one of these leaf children is in N i . Since N i is a connected set and has size at least 2, b ∈ N i . But then b1 ∈ M i , a contradiction. We now show the fact that f preserves parent-child and sibling relations. LEMMA A.6. Suppose x and y are two nodes in T i − M i . If x is the parent of y, then f (x) is the parent of f ( y). If x is a sibling of y, and f (x), f ( y) are in the same component of (T1 + T2 )i , then f (x) is a sibling of f ( y). The converse of these facts is also true. PROOF. Suppose x is the parent of y. By Claim A.1, there is a node x ∈ x(T ) and a node y ∈ y(T ) such that x is the parent of y in T . x and y are nodes in T1 ∪ T2 as well. Unless x = v, x is the parent of y in T1 ∪ T2 as well. If x = v, x will be in M i , which is not the case. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 325 Thus, x is the parent of y in (T1 ∪ T2 ). Since f (x)(T1 + T2 ) = x(T ), f ( y)(T1 + T2 ) = y(T ), another application of Claim A.1 to T1 ∪ T2 implies that f (x) is a parent of f ( y). The other fact can be proved similarly using Claim A.4. The converse can be shown similarly. LEMMA A.7. Suppose x is a leaf node in T i − M i . Then f (x) is a leaf in (T1 + T2 )i − P i . The converse is also true. PROOF. Suppose f (x) has a child z in (T1 + T2 )i . So there are nodes a ∈ w (T1 + T2 ) and b ∈ z (T1 + T2 ) such that a is the parent of b in T1 ∪ T2 . Let z be the node in T i which contains b. Since f (x)(T1 + T2 ) = x(T ), x should be the parent of z, which is a contradiction. So f (x) is a leaf as well. The converse can be shown similarly. LEMMA A.8. Suppose a node u in M i has at least two children. Let x be a child of u. If there is a node in x(T ) which is an immediate sibling of a node in u(T ), then x is a corner node. PROOF. By Claim A.3, u(T ) has a node u0 such that any other node in u(T ) has all its descendants in u(T ). Now, x(T ) has a node x0 and u(T ) has a node u1 such that x0 and u1 have common parent. So this common parent must be u0 . Consider the highest i for which the node in T i containing u0 was distinct from the node in T i containing u1 —call these y and z, respectively. Clearly, i < i. While parsing T i , we moved z up to its parent y. But then we must have marked all immediate siblings of z as corner nodes. In particular, the node containing x0 must have been a corner node. This implies that x must be a corner node. We now state a useful property of the set M i . LEMMA A.9. Suppose x is a node in M i which has at least two children and at least one child of x is not in M i . Then x is either ci or vi . Similarly, if x is a node in N i which has at least two children such that at least one of them is not in N i , then x is either vi or the center node in N i . PROOF. The proof follows easily by induction on i. When i = 1, M i is simply vi , so there is nothing to prove. So assume the induction hypothesis is true for some value of i. The only case when N i − M i will have a with more than two children is when the center z of N i is different from ci . If ci = vi , we have nothing to prove because the new center of M i+1 will be the node containing z. So assume that ci is not same as vi . But then, all children of ci must be in M i (only then we shall move the center for N i to a new node). So the node containing ci in M i+1 will have all its children in M i+1 . This proves the lemma. Recall that w is deﬁned to be a node in T i − N i and w = f (w) ∈ (T1 + T2 )i . We want to show that w and w are parsed in the same manner, that is, q(T ) = q (T1 + T2 ) (using the notation in Section 4.4). We now show that the two nodes will be parsed identically. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 326 • M. Garofalakis and A. Kumar Fig. 17. Proof of lemma A.11 when w ∈ P i . LEMMA A.10. Suppose w is a leaf (so w is a leaf as well). If w is a lone leaf child of its parent, then so is w . The converse is also true. PROOF. Suppose w is a lone leaf child of its parent, call it u. Let u be the parent of w in (T1 + T2 )i . Suppose, for the sake of contradiction, that w is not a lone child of u . So it has an immediate, say left, sibling w , which is also a leaf. We ﬁrst argue that w ∈ P i . Suppose not. Now, w corresponds to a node x in T i − M i , that is, w = f (x). So x is also a leaf and a left sibling of w. But w is a lone child of u. So it must happen that there is a nonleaf child of u, call it y, between x and w (because x is a lone child). Observe that y ∈ M i ; otherwise f ( y) will lie between w and w . Since this holds for all nodes y between w and x, we should have added w to N i (according to Rule (i)), a contradiction. / So it follows that w ∈ P i . Let x be the immediate left sibling of w (if any), x ∈ i M ; otherwise we would have added w to N i . So f (x) is a left sibling of w (if they are in the same component). So w lies between f (x) and w . Since there is no node between x and w in T i , all nodes in w (T1 +T2 ) (we can think of this as nodes of T ) must be part of u(T ). But then, by Lemma A.8, w should have been a corner node and should have been added to N i . The converse can be shown similarly. LEMMA A.11. Suppose w is the leftmost lone leaf child of its parent u. Then / u ∈ M i . Further, w is the leftmost lone leaf child of its parent u . PROOF. Suppose u ∈ M i . Then we would have added w to N i . Deﬁne u = f (u). Then u is the parent of w (using Claim A.1). We already know that w is a lone leaf child of u (using the lemma above). Suppose it is not the leftmost such child. Let w be a lone leaf child of u which is to the left of w . First we / argue that w ∈ P i . Suppose w ∈ P i (see Figure 17). Consider the nodes in w (T1 + T2 ). One of these nodes must be a child of a node in u (T1 + T2 ). Let this node be z 0 . z 0 is also a node in T . Further, z 0 is a child of a node in u(T ) because u(T ) = u (T1 + T2 ). Let z be the node in T i containing z 0 . We claim that z ∈ M i . Otherwise, f (z)(T1 + T2 ) = z(T ). So z 0 ∈ f (z)(T1 + T2 ). But z 0 ∈ w (T1 + T2 ). / Then, it must be the case that w = f (z). But then w ∈ P i , a contradiction. So z ∈ M i . Note that z is a child of u. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 327 / Let y be a descendant of z in T i . Suppose y ∈ M i . Then f ( y) is a node in (T1 + T2 )i such that y(T ) = f ( y)(T1 + T2 ). Since y is a descendant of u, f ( y) is also a descendant of u . If all nodes between f ( y) and u were not in P i , then we will get a path between y and u in T i such that no internal node is in M i . But this is not true (since z ∈ M i ). Thus, there is an internal node in this path which is in P i —call this node z . So we have the situation that u has a leaf child w and another descendant / z which are in P i . But u ∈ P i . This violates the fact that P i is connected (Lemma A.5). So all descendants of z are in M i . But then we would have added u to N i and, consequently, w to N i . / Thus, w ∈ P i . So there is a node a ∈ T i − M i such that f (a) = w . So a is a leaf as well. Suppose a ∈ N i . Since N i has at least two nodes and N i is a connected set, it must be the case that u ∈ N i . Note that we never add a node with at least two children to N i using Rules (i)—(iv). So it must be the case that u is the center of N i . But then we would have added w to N i , a contradiction. / Thus, it follows that a ∈ N i . But then, by Lemma A.10, a is a lone leaf child as well. This contradicts the fact that w has this property. The lemma above shows that if w is merged with its parent, then so is w and q(T ) = q (T ). LEMMA A.12. Suppose w is a leaf node. Let its immediate siblings on the right (left) be w0 = w, w1 , . . . , wk , where all nodes except perhaps wk are leaves. Further, suppose k < . Then w0 , . . . , wk are not in M i . Moreover, the immediate right (left) siblings of f (w) in (T1 + T2 )i are f (w0 ), f (w1 ), . . . , f (wk ). PROOF. Let u be the parent of w. If w j ∈ M i , then w will be added to N i . So none of the nodes in w0 , . . . , wk are in M i . All we have to show now is that f (w j ) is an immediate left sibling of f (w j +1 ). Suppose not. Let x be a sibling between f (w j ) and f (w j +1 ). All nodes in x (T1 + T2 ) must be in u(T ). Lemma A.8 now implies that w j must be a corner node. But then w should be in N i , a contradiction again. The lemma above shows that, if w was merged with a set of siblings, then w will be merged with the same set of siblings. LEMMA A.13. Suppose w is a node with at least two children. Then f (w) also / has at least 2 children. Let y be the leftmost lone leaf child of w. Then y ∈ M i and f ( y) is the leftmost lone child of f (w). PROOF. Suppose w has a child u such that all descendants of u are in M i . Then w will be added to N i . So not all descendants of of u1 or u2 are in M i . Since M i is a connected set, / it can contain at most one of u1 and u2 . So suppose u1 ∈ M i . Then f (u1 ) is a child of f (w). Further, let x be the descendant of u2 which is not in M i . Then f (x) is a descendant of f (w), but not a descendant of f (u1 ). / So, f (w) has at least two children. We claim that y ∈ N i . Indeed, if y ∈ N i , and the fact that N i is a connected set, implies that N i = { y}. So M i = { y}. / But then w ∈ N i , a contradiction. So y ∈ N i . But then Lemma A.11 implies that f ( y) is a leftmost lone leaf child of w as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 328 • M. Garofalakis and A. Kumar The lemma above implies that if w is merged with a leaf child, then w is also merged with the same leaf child. LEMMA A.14. Suppose w is a degree-2 node. Let (w0 = w, w1 , . . . , wk ) be an ancestor to descendant path of length at most − 1, such that all nodes except perhaps wk are of degree-2. Then w0 , . . . , wk−1 are in T i − M i . Further, f (w0 ), . . . , f (wk−1 ) forms a path of degree-2 nodes in (T1 + T2 )i . / If wk is a degree-2 node, then wk ∈ M i and f (wk ) is adjacent with f (wk−1 ). If wk has degree at least 3, then the neighbor of f (wk−1 ) other than f (wk−2 ) is of degree at least 3 as well. PROOF. First consider the case when w0 , . . . , wk are nodes of degree 2. None of them can be in M i ; otherwise w ∈ N i . Thus, f (w0 ), . . . , f (wk ) is also a chain in (T1 + T2 )i . Now we need to argue that these nodes have degree 2 as well. But this is true from the fact that wi and f (wi ) represent the same tree in T . / So suppose wk has at least two children. If wk ∈ M i , then f (wk ) also has at least two children. Hence assume that wk ∈ M i . If w is an ancestor of wk , then w will be added to N i . So assume w is a descendant of wk . So w is a descendant of wk . Further, if all descendants of wk which are not descendants of wk−1 are / in M i , then w ∈ N i . So wk has a descendant x such that x ∈ M i and x is not a descendant of wk−1 . Now consider the tree wk (T )—since wk has at least two children, wk (T ) has a unique highest node, z, such that all nodes in wk (T ) are descendants of z. There is a node in wk−1 (T ) which is a child of z. Let wk be the node in (T1 + T2 )i containing z. Then wk is the parent of f (wk−1 ). Further, f (x) is a descendant of wk , but not a descendant of f (wk−1 ). So wk has degree at least 3. The lemma above shows that if w is a degree-2 node which is merged with some other degree-2 nodes, then w will be merged with the same nodes. One ﬁnal case remains. LEMMA A.15. Suppose w is not merged with any node in T i . Then w is also not merged with any node in (T1 + T2 )i . PROOF. First consider the case when w is a leaf. Let the parent of w be u. We know that w is also a leaf. Let u be its parent. If w has siblings which are leaves, then Lemma A.10 implies that w is also not a lone leaf. But then w will be merged with a sibling, which is a contradiction. So w must be a lone child of u . Suppose w is the leftmost lone leaf child of u . Let x be the leftmost lone leaf child of u. We know that x = w; otherwise w will merge with its parent. If / x ∈ N i , then Lemma A.11 implies that f (x) is the leftmost lone leaf child of u , which is not true because f (w) = w and f is 1-1. So x ∈ N i . Now we claim that u ∈ N i . Indeed, if not, the fact that N i is a connected set implies that N i = {x}. But M i is a subset of N i , and so M i must also be {x}. But then u will be added to N i . So we can assume u ∈ N i . Now, Lemma A.9 implies that u is either vi or the center of N i . Suppose u is the center z of N i . Let y be the leftmost lone leaf child of u which is not in M i — y cannot be same as w, because y gets added to N i . But then, f ( y) is a lone leaf child to the left of w —a contradiction. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 329 So now assume u is same as vi . Two cases can happen—either w is a removed descendant of vi or not—and in either cases we can argue by using Rule (ii) that it should be in N i . Thus, we have shown that if w is a leaf node, then w also does not get merged with any other node. / Now suppose w has a lone leaf child x . If x ∈ P i , then there is a leaf node / x ∈ N i such that f (x) = x . But then, x is also a lone leaf child of w (Lemma A.10). So w will also merge with one its leaf children, a contradiction. Finally, suppose w is a degree-2 node, and either the parent or the child of w is of degree 2. First observe that w must also be of degree 2—otherwise it has at least two children. Since f (w) = w , and w has only one child, both these children must be in M i —but then due to connectedness of M i , w is also in M i , a contradiction. So w also has only one child, call it x. Let the child of w be x . Suppose x has only one child. x must have at least two children, otherwise w will also be merged with a node. Now all but at most one child of x will be in M i . If x has more than two children, then the fact that M i is connected implies that x is in M i as well. But then w will be added to N i . So x has exactly two children—one of these is in M i , the other not in M i . Now x ∈ N i ; otherwise x will also have at least two children. So x is the new center node of N i . But then w will be added to the set N i , a contradiction. Now let the parent of w be y and that of w be y . Suppose y has only one child. Then, Lemma A.10 implies that y also has only one child. But then, w will be merged with one of the nodes—a contradiction. Thus, w is not merged with any node as well. Thus, we have demonstrated the invariant for T i+1 and (T1 + T2 )i+1 . ACKNOWLEDGMENTS Most of this work was done while the second author was with Bell Labs. The authors thank the anonymous referees for insightful comments on the article, and Graham Cormode for helpful discussions related to this work. REFERENCES ACHARYA, S., GIBBONS, P. B., POOSALA, V., AND RAMASWAMY, S. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Man- agement of Data (Philadelphia, PA). 275–286. ALON, N., GIBBONS, P. B., MATIAS, Y., AND SZEGEDY, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Philadeplphia, PA). ALON, N., MATIAS, Y., AND SZEGEDY, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (Philadelphia, PA). 20–29. ALTINEL, M. AND FRANKLIN, M. J. 2000. Efﬁcient ﬁltering of XML documents for selective dissem- ination of information. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 53–64. APOSTOLICO, A. AND GALIL, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, U.K. ARASU, A., BABCOCK, B., BABU, S., MCALISTER, J., AND WIDOM, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the Twenty-ﬁrst ACM ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 330 • M. Garofalakis and A. Kumar SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221– 232. BABCOCK, B., BABU, S., DATAR, M., MOTWANI, R., AND WIDOM, J. 2002. Models and issues in data stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 1–16. BAR-YOSSEF, Z., JAYRAM, T., KUMAR, R., SIVAKUMAR, D., AND TREVISAN, L. 2002. Counting dis- tinct elements in a data stream. In Proceedings of the 6th International Workshop on Ran- domization and Approximation Techniques in Computer Science (RANDOM’02), (Cambridge, MA). CHAKRABARTI, K., GAROFALAKIS, M., RASTOGI, R., AND SHIM, K. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 111–122. CHAN, C.-Y., FELBER, P., GAROFALAKIS, M., AND RASTOGI, R. 2002. Efﬁcient ﬁltering of XML doc- uments with XPath expressions. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). CHARIKAR, M., CHEN, K., AND FARACH-COLTON, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (Malaga, Spain). CHARIKAR, M. AND SAHAI, A. 2002. Dimension reduction in the l 1 norm. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada). CORMODE, G., DATAR, M., INDYK, P., AND MUTHUKRISHNAN, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335–345. CORMODE, G., INDYK, P., KOUDAS, N., AND MUTHUKRISHNAN, S. 2002b. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). CORMODE, G. AND MUTHUKRISHNAN, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA). DASU, T. AND JOHNSON, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY. DIAO, Y., ALTINEL, M., FRANKLIN, M. J., ZHANG, H., AND FISCHER, P. 2003. Path sharing and predicate evaluation for high-performance XML ﬁltering. ACM Trans. Database Syst. 28, 4 (Dec.), 467–516. DIAO, Y. AND FRANKLIN, M. 2003. Query processing for high-volume XML message broker- ing. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany). DOBRA, A., GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 61–72. DOBRA, A., GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2004. Sketch-based multi-query process- ing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT’2004, Heraklion-Crete, Greece). FEIGENBAUM, J., KANNAN, S., STRAUSS, M., AND VISWANATHAN, M. 1999. An approximate L1 -difference algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (New York City, NY). FLORESCU, D., KOLLER, D., AND LEVY, A. 1997. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece). GANGULY, S., GAROFALAKIS, M., AND RASTOGI, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT’2004, Heraklion-Crete, Greece). GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). GAROFALAKIS, M. AND GIBBONS, P. B. 2001. Approximate query processing: Taming the terabytes (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma, Italy). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. XML Stream Processing Using Tree-Edit Distance Embeddings • 331 GAROFALAKIS, M. AND KUMAR, A. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (San Diego, CA). 143–154. GILBERT, A. C., GUHA, S., INDYK, P., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th An- nual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada). GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2001. Surﬁng wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454–465. GRAVANO, L., IPEIROTIS, P. G., JAGADISH, H., KOUDAS, N., MUTHUKSRISHNAN, S., AND SRIVASTAVA, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). GREENWALD, M. AND KHANNA, S. 2001. Space-efﬁcient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA). GUHA, S., JAGADISH, H., KOUDAS, N., SRIVASTAVA, D., AND YU, T. 2002. Approximate XML joins. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). GUPTA, A. AND SUCIU, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA). INDYK, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (Redondo Beach, CA). 189–197. INDYK, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science (Las Vegas, NV). INDYK, P., KOUDAS, N., AND MUTHUKRISHNAN, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 363–372. IOANNIDIS, Y. E. AND POOSALA, V. 1999. Histogram-based approximation of set-valued query an- swers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, Scotland). JOHNSON, W. B. AND LINDENSTRAUSS, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189–206. KARP, R. M. AND RABIN, M. O. 1987. Efﬁcient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2 (Mar.), 249–260. KNUTH, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA. LAKSHMANAN, L. V. S. AND PARTHASARATHY, S. 2002. On efﬁcient matching of streaming XML doc- uments and queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT’2002, Prague, Czech Republic). MANKU, G. S. AND MOTWANI, R. 2002. Approximate frequency counts over data streams. In Pro- ceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346–357. MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K. NOLAN, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/˜ jpnolan/stable/stable.html. POLYZOTIS, N. AND GAROFALAKIS, M. 2002. Statistical synopses for graph-structured XML databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). POLYZOTIS, N., GAROFALAKIS, M., AND IOANNIDIS, Y. 2004. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005. 332 • M. Garofalakis and A. Kumar SCHMIDT, A., WAAS, F., KERSTEN, M., CAREY, M. J., MANOLESCU, I., AND BUSSE, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). SHAPIRA, D. AND STORER, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM’2002), Fukuoka, Japan). 85–98. THAPER, N., GUHA, S., INDYK, P., AND KOUDAS, N. 2002. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 428–439. UCHAIKIN, V. V. AND ZOLOTAREV, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland. UKKONEN, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191–211. VITTER, J. S. AND WANG, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). ZHANG, K. AND SHASHA, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245–1262.