Exchanging Intensional XML Data

Document Sample
Exchanging Intensional XML Data Powered By Docstoc
					Exchanging Intensional XML Data
TOVA MILO
INRIA and Tel-Aviv University
SERGE ABITEBOUL
INRIA
BERND AMANN
Cedric-CNAM and INRIA-Futurs
and
OMAR BENJELLOUN and FRED DANG NGOC
INRIA


XML is becoming the universal format for data exchange between applications. Recently, the emer-
gence of Web services as standard means of publishing and accessing data on the Web introduced
a new class of XML documents, which we call intensional documents. These are XML documents
where some of the data is given explicitly while other parts are defined only intensionally by means
of embedded calls to Web services.
    When such documents are exchanged between applications, one has the choice of whether or
not to materialize the intensional data (i.e., to invoke the embedded calls) before the document
is sent. This choice may be influenced by various parameters, such as performance and security
considerations. This article addresses the problem of guiding this materialization process.
                                                            `
    We argue that—like for regular XML data—schemas (a la DTD and XML Schema) can be used
to control the exchange of intensional data and, in particular, to determine which data should be
materialized before sending a document, and which should not. We formalize the problem and
provide algorithms to solve it. We also present an implementation that complies with real-life
standards for XML data, schemas, and Web services, and is used in the Active XML system. We
illustrate the usefulness of this approach through a real-life application for peer-to-peer news
exchange.
Categories and Subject Descriptors: H.2.5 [Database Management]: Heterogeneous Databases
General Terms: Algorithms, Languages, Verification
Additional Key Words and Phrases: Data exchange, intensional information, typing, Web services,
XML


This work was partially supported by EU IST project DBGlobe (IST 2001-32645).
This work was done while T. Milo, O. Benjelloun, and F. D. Ngoc were at INRIA-Futurs.
Authors’ current addresses: T. Milo, School of Computer Science, Tel Aviv University, Ramat Aviv,
Tel Aviv 69978, Israel; email: milo@cs.tau.ac.il; S. Abiteboul and B. Amann, INRIA-Futurs, Parc
Club Orsay-University, 4 Rue Jean Monod, 91893 Orsay Cedex, France; email: {serge,abiteboul,
bernd.amann}@inria.fr; O. Benjelloun, Gates Hall 4A, Room 433, Stanford University, Stanford,
CA 94305-9040; email: benjelloun@db.stanford.edu; F. D. Ngoc, France Telecom R&D and LRI,
                  e e
38–40, rue du G´ n´ ral Leclerc, 92794 Issy-Les Moulineaux, France; email: Frederic.dangngoc@
rd.francetelecom.com.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0001 $5.00


                           ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 1–40.
2       •    T. Milo et al.

1. INTRODUCTION
XML, a self-describing semistructured data model, is becoming the standard
format for data exchange between applications. Recently, the use of XML doc-
uments where some parts of the data are given explicitly, while others consist
of programs that generate data, started gaining popularity. We refer to such
documents as intensional documents, since some of their data are defined by
programs. We term materialization the process of evaluating some of the pro-
grams included in an intensional XML document and replacing them by their
results. The goal of this article is to study the new issues raised by the exchange
of such intensional XML documents between applications, and, in particular,
how to decide which parts of the data should be materialized before the docu-
ment is sent and which should not.
   This work was developed in the context of the Active XML system
[Abiteboul et al. 2002, 2003b] (also see the Active XML homepage of Web
site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). The latter is cen-
tered around the notion of Active XML documents, which are XML documents
where parts of the content is explicit XML data whereas other parts are gener-
ated by calls to Web services. In the present article, we are only concerned with
certain aspects of Active XML that are also relevant to many other systems.
Therefore, we use the more general term of intensional documents to denote
documents with such features.
   To understand the problem, let us first highlight an essential difference be-
tween the exchange of regular XML data and that of intensional XML data.
In frameworks such as those of Sun1 or PHP,2 intensional data is provided
by programming constructs embedded inside documents. Upon request, all the
code is evaluated and replaced by its result to obtain a regular, fully mate-
rialized HTML or XML document, which is then sent. In other terms, only
extensional data is exchanged. This simple scenario has recently changed due
to the emergence of standards for Web services such as SOAP, WSDL,3 and
UDDI.4 Web services are becoming the standard means to access, describe and
advertise valuable, dynamic, up-to-date sources of information over the Web.
Recent frameworks such as Active XML, but also Macromedia MX5 and Apache
Jelly6 started allowing for the definition of intensional data, by embedding calls
to Web services inside documents.
   This new generation of intensional documents have a property that we view
here as crucial: since Web services can essentially be called from everywhere on
the Web, one does not need to materialize all the intensional data before sending
a document. Instead, a more flexible data exchange paradigm is possible, where
the sender sends an intensional document, and gives the receiver the freedom

1 See Sun’s Java server pages (JSP) online at http://java.sun.com/products/jsp.
2 See the PHP hypertext preprocessor at http://www.php.net.
3 See the W3C Web services activity at http://www.w3.org/2002/ws.
4 UDDI stands for Universal Description, Discovery, and Integration of Business for the Web. Go

online to http://www.uddi.org.
5 Macromedia Coldfusion MX. Go online to http://www.macromedia.com/.
6 Jelly: Executable xml. Go online to http://jakarta.apache.org/commons/sandbox/jelly.


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                          Exchanging Intensional XML Data              •      3

to materialize the data if and when needed. In general, one can use a hybrid
approach, where some data is materialized by the sender before the document
is sent, and some by the receiver.
   As a simple example, consider an intensional document for the Web page
of a local newspaper. It may contain some extensional XML data, such as its
name, address, and some general information about the newspaper, and some
intensional fragments, for example, one for the current temperature in the
city, obtained from a weather forecast Web service, and a list of current art
exhibits, obtained, say, from the TimeOut local guide. In the traditional setting,
upon request, all calls would be activated, and the resulting fully materialized
document would be sent to the client. We allow for more flexible scenarios, where
the newspaper reader could also receive a (smaller) intensional document, or
one where some of the data is materialized (e.g., the art exhibits) and some is
left intensional (e.g., the temperature). A benefit that can be seen immediately
is that the user is now able to get the weather forecast whenever she pleases,
just by activating the corresponding service call, without having to reload the
whole newspaper document.
   Before getting to the description of the technical solution we propose, let us
first see some of the considerations that may guide the choice of whether or not
to materialize some intensional data:
— Performance. The decision of whether to execute calls before or after the data
  transfer may be influenced by the current system load or the cost of commu-
  nication. For instance, if the sender’s system is overloaded or communication
  is expensive, the sender may prefer to send smaller files and delegate as
  much materialization of the data as possible to the receiver. Otherwise, it
  may decide to materialize as much data as possible before transmission, in
  order to reduce the processing on the receiver’s side.
— Capabilities. Although Web services may in principle be called remotely from
  everywhere on the Internet, it may be the case that the particular receiver
  of the intensional document cannot perform them, for example, a newspa-
  per reader’s browser may not be able to handle the intensional parts of a
  document. And even if it does, the user may not have access to a particular
  service, for example, because of the lack of access rights. In such cases, it is
  compulsory to materialize the corresponding information before sending the
  document.
—Security. Even if the receiver is capable of invoking service calls, she may
  prefer not to do so for security reasons. Indeed, service calls may have side
  effects. Receiving intensional data from an untrusted party and invoking the
  calls embedded in it may thus lead to severe security violations. To overcome
  this problem, the receiver may decide to refuse documents with calls to ser-
  vices that do not belong to some specific list. It is then the responsibility of
  a helpful sender to materialize all the data generated by such service calls
  before sending the document.
— Functionalities. Last but not least, the choice may be guided by the applica-
  tion. In some cases, for example, for a UDDI-like service registry, the origin of
  the information is what is truly requested by the receiver, and hence service
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
4      •     T. Milo et al.




                    Fig. 1. Data exchange scenario for intensional documents.


    calls should not be materialized. In other cases, one may prefer to hide the
    true origin of the information, for example, for confidentiality reasons, or be-
    cause it is an asset of the sender, so the data must be materialized. Finally,
    calling services might also involve some fees that should be payed by one or
    the other party.

   Observe that the data returned by a service may itself contain some inten-
sional parts. As a simple example, TimeOut may return a list of 10 exhibits,
along with a service call to get more. Therefore, the decision of materializing
some information or not is inherently a recursive process. For instance, for
clients who cannot handle intensional documents, the newspaper server needs
to recursively materialize all the document before sending it.
   How can one guide the materialization of data? For purely extensional data,
schemas (like DTD and XML Schema) are used to specify the desired format
of the exchanged data. Similarly, we use schemas to control the exchange of
intensional data and, in particular, the invocation of service calls. The novelty
here is that schemas also entail information about which parts of the data are
allowed to be intensional and which service calls may appear in the documents,
and where. Before sending information, the sender must check if the data,
in its current structure, matches the schema expected by the receiver. If not,
the sender must perform the required calls for transforming the data into the
desired structure, if this is possible.
   A typical such scenario is depicted in Figure 1. The sender and the re-
ceiver, based on their personal policies, have agreed on a specific data exchange
schema. Now, consider some particular data t to be sent (represented by the
grey triangle in the figure). In fact, this document represents a set of equiv-
alent, increasingly materialized, pieces of information—the documents that
may be obtained from t by materializing some of the service calls (q, g , and f ).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                           Exchanging Intensional XML Data              •      5

Among them, the sender must find at least one document conforming to the
exchange schema (e.g., the dashed one) and send it.
   This schema-based approach is particularly relevant in the context of Web
services, since their input parameters and their results must match particular
XML Schemas, which are specified in their WSDL descriptions. The techniques
presented in this article can be used to achieve that.
   The contributions of the article are as follows:

(1) We provide a simple but flexible XML-based syntax to embed service calls
    in XML documents, and introduce an extension of XML Schema for describ-
    ing the required structure of the exchanged data. This consists in adding
    new type constructors for service call nodes. In particular, our typing dis-
    tinguishes between accepting a concrete type, for example, a temperature
    element, and accepting a service call returning some data of this type, for
    example, () → temperature.
(2) Given a document t and a data exchange schema, the sender needs to decide
    which data has to be materialized. We present algorithms that, based on
    schema and data analysis, find an effective sequence of call invocations, if
    such a sequence exists (or detect a failure if it does not). The algorithms
    provide different levels of guarantee of success for this rewriting process,
    ranging from “sure” success to a “possible” one.
(3) At a higher level, in order to check compatibility between applications, the
    sender may wish to verify that all the documents generated by its appli-
    cation may be sent to the target receiver, which involves comparing two
    schemas. We show that this problem can be easily reduced to the previous
    one.
(4) We illustrate the flexibility of the proposed paradigm through a real-life
    application: peer-to-peer news syndication. We will show that Web services
    can be customized by using and enforcing several exchange schemas.

   As explained above, our algorithms find an effective sequence of call invoca-
tions, if one exists, and detect failure otherwise. In a more general context, an er-
ror may arise because of type discrepancies between the caller and the receiver.
One may then want to modify the data and convert it to the right structure,
using data translation techniques such as those provided by Cluet et al. [1998]
and Doan et al. [2001]. As a simple example, one may need to convert a temper-
ature from Celsius degrees to Fahrenheit. In our context, this would amount to
plugging (possibly automatically) intermediary external services to perform the
needed data conversions. Existing data conversion algorithms can be adapted
to determine when conversion is needed. Our typing algorithms can be used to
check that the conversions lead to matching types. Data conversion techniques
are complementary and could be added to our framework. But the focus here
is on partially materializing the given data to match the specified schema.
   The core technique of this work is based on automata theory. For presentation
reasons, we first detail a simplified version of the main algorithm. We then
describe a more dynamic, optimized one, that is based on the same core idea
and is used in our implementation.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
6      •     T. Milo et al.

   Although the problems studied in this article are related to standard typing
problems in programming languages [Mitchell 1990], they differ here due to
the regular expressions present in XML schemas. Indeed, the general problem
that will be formalized here was recently shown to be undecidable by Muscholl
et al. [2004]. We will introduce a restriction that is practically founded, and
leads to a tractable solution.
   All the ideas presented here have been implemented and tested in the context
of the Active XML system [Abiteboul et al. 2002] (see also the Active XML home-
page of Web site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). This
system provides persistent storage for intensional documents with embedded
calls to Web services, along with active features to automatically trigger these
services and thus enrich/update the intensional documents. Furthermore, it al-
lows developers to declaratively specify Web services that support intensional
documents as input and output parameters. We used the algorithms described
here to implement a module that controls the types of documents being sent to
(and returned by) these Web services. This module is in charge of materializing
the appropriate data fragments to meet the interface requirements.
   In the following, we assume that the reader is familiar with XML and its typ-
ing languages (DTD or XML Schema). Although some basic knowledge about
SOAP and WSDL might be helpful to understand the details of the implemen-
tation, it is not necessary.
   The article is organized as follows: Section 2 describes a simple data model
and schema specification language and formalizes the general problem. Ad-
ditional features for a richer data model that facilitate the design of real life
applications are also introduced informally. Section 3 focuses on difficulties that
arise in this context, and presents the key restriction that we consider. It also
introduces the notions of “safe” and “possible” rewritings, which are studied in
Section 4 and 5, respectively. The problem of checking compatibility between in-
tensional schemas is considered in Section 6. The implementation is described
in Section 7. Then, we present in Section 8 an application of the algorithms
to Web services customization, in the context of peer-to-peer news syndication.
The last section studies related works and concludes the article.

2. THE MODEL AND THE PROBLEM
To simplify the presentation, we start by formalizing the problem using a simple
data model and a DTD-like schema specification. More precisely, we define the
notion of rewriting, which corresponds to the process of invoking some service
calls in an intensional document, in order to make it conform to a given schema.
Once this is clear, we explain how things can be extended to provide the features
ignored by the first simple model, and in particular we show how richer schemas
are taken into account.

2.1 The Simple Model
We first define documents, then move to schemas, before formalizing the key
notion of rewritings, and stating the results obtained in this setting, which will
be detailed in the following sections.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                Exchanging Intensional XML Data              •      7




                      Fig. 2. An intensional document before/after a call.

   2.1.1 Simple Intensional XML Documents. We model intensional XML
documents as ordered labeled trees consisting of two types of nodes: data nodes
and function nodes. The latter correspond to service calls. We assume the exis-
tence of some disjoint domains: N of nodes, L of labels, F of function names,7
and D of data values. In the sequel we use v, u, w to denote nodes, a, b, c to
denote labels, and f , g , q to denote function names.
   Definition 2.1. An intensional document d is an expression (T, λ), where
T = (N , E, <) is an ordered tree. N ⊂ N is a finite set of nodes, E ⊂ N × N are
the edges, < associates with each node in N a total order on its children, and
λ : N → L ∪ F ∪ D is a labeling function for the nodes, where only leaf nodes
may be assigned data values from D.
   Nodes with a label in L ∪ D are called data nodes while those with a label
in F are called function nodes. The children subtrees of a function node are
the function parameters. When the function is called, these subtrees are passed
to it. The return value then replaces the function node in the document. This
is illustrated in Figure 2, where data nodes are represented by circles, func-
tion nodes are represented by squares, and data values are quoted. Here, the
Get Temp Web service is invoked with the city name as a parameter. It returns a
temp element, which replaces the function node. An example of the actual XML
representation of intensional documents is given in Section 7. Observe that
the parameter subtrees and the return values may themselves be intensional
documents, that is, contain function nodes.
   2.1.2 Simple Schemas. We next define simple DTD-like schemas for in-
tensional documents. The specification associates (1) a regular expression with
each element name that describes the structure of the corresponding elements,
and (2) a pair of regular expressions with each function name that describe the
function signature, namely, its input and output types.
   Definition 2.2. A document schema s is an expression (L, F, τ ), where L ⊂
L and F ⊂ F are finite sets of labels and function names, respectively; τ is a
function that maps each label name l ∈ L to a regular expression over L ∪ F or
to the keyword “data” (for atomic data), and maps each function name f ∈ F
to a pair of such expressions, called the input and output type of f and denoted
by τin ( f ) and τout ( f ).

7 We assume in this model that function names identify Web service operations. This translates in
the implementation to several parameters (URL, operation name, . . . ) that allow one to invoke the
Web services.

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
8        •     T. Milo et al.

    For instance, the following is an example of a schema:
             data:
             τ (newspaper)      =   title.date.(Get Temp | temp).(TimeOut | exhibit ∗ )
             τ (title)          =   data
             τ (date)           =   data
             τ (temp)           =   data
             τ (city)           =   data
             τ (exhibit)        =   title.(Get Date | date)
    (∗)
             functions:
             τin (Get Temp)     = city
             τout (Get Temp)    = temp
             τin (TimeOut)      = data
             τout (TimeOut)     = (exhibit | performance)∗
             τin (Get Date) = title
             τout (Get Date) = date

  We next define the semantics of a schema, that is, the set of its instances.
To do so, if R is a regular expression over L ∪ F , we denote by lang(R) the
regular language defined by R. The expression lang(data) denotes the set of
data values in D.

   Definition 2.3. An intensional document t is an instance of a schema s =
(L, F, τ ) if for each data node (respectively function node) n ∈ t with label
l ∈ L (respectively l ∈ F ), the labels of n’s children form a word in lang(τ (l ))
(respectively in lang(τin (l ))).
   For a function name f ∈ F , a sequence t1 , . . . , tn of intensional trees is an
input instance (respectively output instance) of f , if the labels of the roots form
a word in lang(τin ( f )) (respectively lang(τout ( f )), and all the trees are instances8
of s.

   It is easy to see that the document of Figure 2(a) is an instance of the schema
of (∗), but not of a schema with τ identical to τ above, except for

               (∗∗) τ (newspaper) = title.date.temp.(TimeOut | exhibit ∗ ).

However, since τout (Get Temp) = temp, the document can always be turned into
an instance of the schema of (∗∗), by invoking the Get T emp service call and
replacing it by its return value. On the other hand, consider a schema with τ
identical to τ , except for

                      (∗∗∗) τ (newspaper) = title.date.temp.exhibit ∗ .

According to its signature, a call to TimeOut may also return performance
elements. Therefore, in general, the document may not become an instance
of the schema of (∗ ∗ ∗). However, it is possible that it becomes one (if

8 Like   in DTDs, every subtree conforms to the same schema as the whole document.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                   Exchanging Intensional XML Data             •      9

T imeOut returns a sequence of exhibits). The only way to know is to call the
service.
   This type of “on-line” testing is fine if the calls have no side effects or do not
cost money. If they do, we might want to warn the sender, before invoking the
call, that the overall process may not succeed, and see if she wants to proceed
nevertheless.

   2.1.3 Rewritings. When the proper invocation of service calls leads for sure
to the desired structure, we say that the rewriting is safe, and when it only pos-
sibly does, that this is a possible rewriting. These notions are formalized next.
                                                               v
   Definition 2.4. For a tree t, we say that t → t if t is obtained from t by
selecting a function node v in t with some label f and replacing it by an arbitrary
                               v1       v2       vn
output instance of f .9 If t → t1 → t2 · · · → tn we say that t rewrites into tn ,
             ∗
denoted t → tn . The nodes v1 , . . . , vn are called the rewriting sequence. The set
                      ∗
of all trees t s.t. t → t is denoted ext(t).
   Note that in the rewriting process, the replacement of a function node v by
its output instance is independent of any function semantics. In particular,
we may replace two occurrences of the same function by two different output
instances. Stressing somewhat the semantics, this can be interpreted as if the
value returned by the function changes over time. This captures the behavior
of real life Web services, like a temperature or stock exchange service, where
two consecutive calls may return a different result.
   Definition 2.5. Let t be a tree and s a schema. We say that t possibly rewrites
into s if ext(t) contains some instance of s. We say that t safely rewrites into s
either if t is already an instance of s, or if there exists some node v in t such
                           v
that all trees t where t → t safely rewrite into s.
   The fact that t safely rewrites into s means that we can be sure, without
actually making any call, that we can choose a sequence of calls that will turn
t into an instance of s. For instance, the document of Figure 2(a) safely rewrites
into the schema of (∗∗) but only possibly rewrites into that of (∗ ∗ ∗).
   Finally, to check compatibility between applications, we may want to check
whether all documents generated by one application (e.g., the sender applica-
tion) can be safely rewritten into the structure required by the second applica-
tion (e.g., the agreed data exchange format).
   Definition 2.6. Let s be a schema with some distinguished label r called
the root label. We say that s safely rewrites into another schema s if all the
instances t of s with root label r rewrite safely into instances of s .
   For instance, consider the schema of (∗) presented above with newspaper as
the root label. This schema safely rewrites into the schema of (∗∗) but does not
safely rewrite into the one of (∗ ∗ ∗).

9 Byreplacing the node by an output instance we mean that the node v and the subtree rooted at it
are deleted from t, and the forest trees t1 , . . . , tn of some output instance of f are plugged at the
place of v (as children of v’s parent).

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
10      •      T. Milo et al.

  2.1.4 The Results. Going back to the data exchange scenario described in
the introduction, we can now specify our main contributions:
(1) We present an algorithm that tests whether a document t can be safely
    rewritten into some schema s and, if so, provides an effective rewriting
    sequence, and
(2) When safe rewriting is not possible, we present an algorithm that tests
    whether t may be possibly rewritten into s, and finds a possibly successful
    rewriting sequence, if one exists.
(3) We also provide an algorithm for testing, given two schemas, whether one
    can be safely rewritten into the other.

2.2 A Richer Data Model
In order to make our presentation clear, and to simplify the definition of doc-
ument and schema rewritings, we used a very simple data model and schema
language. We will now present some useful extensions that bring more expres-
sive power, and facilitate the design of real life applications.

   2.2.1 Function Patterns. The schemas we have seen so far specify that a
particular function, identified by its name, may appear in the document. But
sometimes, one does not know in advance which functions will be used at a
given place, and yet may want to allow their usage, provided that they con-
form to certain conditions. For instance, we may have several editions of the
newspaper of Figure 2(a), for different cities. A common intensional schema for
such documents should not require the use of a particular Get temp function,
but rather allow for a set of functions, which have a suitable signature: they
should accept as single parameter a city element, and return a temperature el-
ement, as previously defined in τ . The particular weather forecast service that
will be used may depend on the city and be, for instance, retrieved from some
UDDI service registry. One may also want to enforce some security policies, for
example, be allowed to specify that the allowed functions should return only
extensional results.
   To specify such sets of functions, we use function patterns. A function pattern
definition consists of a boolean predicate over function names and a function
signature. A function belongs to the pattern if its name satisfies the Boolean
predicate and its signature is the same as the required one. A more liberal defi-
nition would be one that requires that the function signature only be subsumed
by the one specified in the definition, that is, that every instance of the former
be also an instance of the latter. This is possible but is computationally more
heavy, since it entails checking inclusion of the tree language defined by the
two schemas.
   In terms of implementation, one can assume that this new Boolean predicate
is implemented as a Web service that takes a function name as input and
returns true or false.
   To take this feature into account in our model, we define P to be a domain of
function pattern names. A schema s = (L, F, P, τ ) now also contains, in addition
to the elements and functions, a set of function patterns P ⊂ P. τ associate with
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                              Exchanging Intensional XML Data              •      11

each function pattern p ∈ P a signature and a Boolean predicate over function
names. We can now, for instance, write a schema for our local newspapers as
           τ (newspaper) = title.date.(Forecast | temp).(TimeOut | exhibit ∗ )
           τname (Forecast) = UDDIF ∧ InACL
           τin (Forecast) = city
           τout (Forecast) = temp
This schema enforces the fact that the function used in the document has the
proper signature and satisfies the Boolean predicates UDDIF and InACL. The
first predicate (UDDIF ) is a Web service that checks if the given function (ser-
vice) is registered in some particular UDDI registry. Predicate InACL then
verifies if the caller has the necessary access privileges for executing the given
function (calling the service). More generally, any Web service that allows the
verification of some property of the particular function node in the document
(here, the weather forecast service), possibly with respect to some contextual
information (e.g., the identity of the caller, the system date, etc.) can be used.

   2.2.2 Wildcards. Together with function patterns, one may also use wild-
cards in schemas. Their use is already common for data. In XML Schema, the
keyword any expresses the fact that a certain part of a document may con-
tain an arbitrary element, attribute, or even an unconstrained subtree. XML
Schema further allows one to restrict wildcards to (or exclude from them) cer-
tain domains of data, based on their namespace.10 This extends naturally to
our context. We consider the namespace of a function node in an intensional
document to be the namespace of the called Web service.11 Therefore, we can
use wildcards to allow certain document parts to contain arbitrary sub-trees
with arbitrary functions, or restrict them to (respectively exclude from them)
certain classes of functions.
   We believe that the combination of wildcards and function patterns provides
a good level of flexibility to describe the structure of documents. For instance,
one may specify that the temperature is obtained from an arbitrary function
that returns a correct temp element, but may take any argument, being data
or function call.

   2.2.3 Restricted Service Invocations. Another interesting extension is the
following: we assumed so far that all the functions appearing in a document
may be invoked in a rewriting, in order to match a given schema. This is not
always the case, for the same reasons as mentioned in the Introduction (secu-
rity, cost, access rights, etc.). The logic of rewritings will have to take this into
account, essentially by considering, among all possible rewritings, only a proper
subset. For that, the function names/patterns in the schema can be partitioned
into two disjoint groups of invocable and noninvocable ones. A legal rewriting is
then one that invokes only invocable functions. The notions of safe and possible

10 TheW3C XML activity. Go online to www.w3.org/XML.
11 Which is described in its WSDL description and, in our model, is one of the components of the
function name.

                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
12      •      T. Milo et al.

rewritings extend naturally to consider only legal rewritings. Since we are in-
terested here only in such rewritings, whenever we talk in the sequel about a
function invocation, we mean an invocable one.

    2.2.4 XML, XML Schema, and WSDL. The simple XML trees considered
above ignore a number of features of XML, such as attributes, and use a single
domain for data values. A richer setting may be obtained by using the full
fledged XML data model (see footnote 10). Similarly, richer schemas may be
defined by adopting XML Schema (see footnote 10), rather than using the simple
DTD-like schema used above. Indeed, our implementation is based on the full
XML model and on an extension of XML Schema.
    In our prototype, function calls embedded in XML documents are represented
by special function elements that identify the Web services to be invoked and
specify the value of input parameters. XML Schemas are enriched for inten-
sional documents (to form XML Schemaint ) by function and function pattern
definitions. In both cases, things are very much along the lines of the sim-
ple model we used above. We will see an example and more details of this in
Section 7.
    Function signatures are usually specified by service providers as WSDL def-
initions. We similarly extend WSDL to allow the use of XML Schemaint instead
of just XML Schema for type specifications, and we term this extended language
WSDLint .
    While intensional XML documents use a standard XML syntax, XML
Schemaint schemas do not comply with the XML Schema syntax. The exten-
sion is minimal, and very much along the lines of the simple syntax we used
above. We will also see an example and more details in Section 7. Note that
this is not the case for WSDL, since its specification does not enforce the use
of a specific schema language. Therefore WSDLint documents are valid WSDL
documents.

3. EXCHANGING INTENSIONAL DATA
We start by considering document rewriting. Schema rewriting is considered
later in Section 6.
   Given a document t that the sender wishes to send, and a data exchange
schema s, the sender needs to rewrite t into s. A possible process is the following:
(1) Check if t safely rewrites to s and if so, find a rewriting sequence, namely,
    a sequence of functions that need to be invoked to transform t into the
    required structure (preferably the shortest or cheapest one, according to
    some criteria).
(2) If a safe rewriting does not exist, check whether at least t may rewrite to s.
    If it is acceptable to do so (the sender accepts that the rewriting may fail),
    try to find a successful rewriting sequence if one exists (preferably with the
    least side effects on the path to find it, and at the least cost).
A variant is to combine safe and possible rewritings. For instance, one could con-
sider a mixed approach that first invokes some function calls and then attempts
from there to find safe rewritings. There are many alternative strategies.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                   Exchanging Intensional XML Data              •      13

   We will first consider safe document rewriting, then move to possible rewrit-
ing, and finally consider the mixed approach. As in the previous section, to
simplify the presentation, we first consider the problems in the context of the
simple data model defined above. Then in Section 7 we will show that the pro-
posed solutions naturally extend to richer data/schemas and in particular to
the context of full fledged XML and XML Schema.
   Before presenting solutions, let us first explain some of the difficulties that
one encounters when attempting to rewrite a document to a desired exchange
schema. While the examples given in the previous sections were rather simple—
and one could determine by a simple observation of the document which service
calls need to be materialized—things may be much more complex in general.
We explain next why this is the case and present a restriction that will make
the problem tractable.

3.1 Going Back and Forth
The rewriting sequence may depend on the answers being returned by the
functions: we may call one function at some place in the document, and then
decide, possibly based on its answer, that another function in the new data or
in a different part of the document needs to be called, and so on. In general,
this may force us to analyze the same portion of the document many times,
reexamining the same function call again and again, deciding at each iteration
whether, based on the answers returned so far, the function now needs to be
called or not. Such an iterative process may naturally be very expensive. We
thus restrict our attention here to a simpler class of “one-pass” left-to-right
rewritings12 where, for each node, the children are processed from left to right,
and once a child function is invoked, no further invocations are applied to its
left-hand sibling functions (i.e., successive children invocations are limited to
the new children functions possibly returned by the call, plus the right hand
siblings.). This restriction also applies to the results of function calls, which are
also processed in a left-to-right manner.
   Observe that, in general, with this restriction, one can miss a successful
rewriting that is not left-to-right. In all the real-life examples that we consid-
ered, left-to-right rewritings were not limiting.

3.2 Infinite Search Space
The essence of safe rewriting is that it succeeds no matter what specific an-
swers, among the possible ones, the invoked functions return. The domain of
the possible answers of each function is determined by its output type. Since the
regular expression defining this type may contain starred (“*”) subexpressions,
the domain is infinite, and the safe rewriting should account for each possible
element in this infinite domain. Moreover, the result of a service call may con-
tain intensional data, namely, other function calls. In general the number of
such new functions may be unbounded. For instance, consider a Get Exhibits

12 One   could choose similarly right-to-left.

                                         ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
14      •      T. Milo et al.

function, with output type
                                τout (Get Exhibits) = Get Exhibit ∗ .
  When Get Exhibits is invoked, an arbitrarily large number of Get Exhibit
functions may be returned, and one has to check for each of the occurrences
whether this particular function call needs to be invoked and whether, after
the invocation, the document can still be (safely) rewritten into the desired
schema.

3.3 Recursive Calls
As explained above, when a function is invoked, the returned data may itself
contain new calls. To conform to the target schema, these calls may need to be
triggered as well. The answer again may contain some new calls, etc. This may
lead to infinite computations. Observe that such recursive situations do occur
in practice. For example, a search engine Web service may return, for a given
keyword, some document URLs plus (possibly) a function node for obtaining
more answers. Calling this function, one can obtain a new list and perhaps
another function node, etc. If the target schema requires plain XML data, we
need to repeatedly call the functions until all the data has been obtained. In
this example, and often in general, one may want to bound the recursion. This
suggests the following definition and our corresponding restriction:
                                                                    v1   vn
  Definition 3.1. For a rewriting sequence t → t1 · · · → tn , we say that a
function node v j depends on a function node vi if v j ∈ ti but ∈ ti−1 (namely, if
the node v j was returned by the invocation of the function vi ).
  We say that a rewriting sequence is of depth k if the dependency graph among
the nodes contains no paths of length greater than k.

  The restriction. The restriction that we will impose below is the following:
We will consider only k-depth left-to-right rewritings.

   Note that while this restriction limits the search space, the latter remains
infinite, due to the starred subexpressions appearing in the schema. However,
under this restriction, we can exhibit a finite representation (based on au-
tomata) of the search space and use automata-based techniques to solve the
safe rewriting problem.
   Even with this restriction, the framework is general enough to handle most
practical cases. The problem of arbitrary safe rewriting (without the left-to-
right k-depth restriction) was recently shown to be undecidable [Muscholl et al.
2004]. Further work by the same authors [Muscholl et al. 2004; Segoufin 2003]
has shown that the left-to-right safe rewriting problem is actually decidable,
without the k-depth restriction, but the corresponding algorithms have a much
higher complexity (EXPTIME or 2EXPTIME, depending on whether the target
language is deterministic or not)—and thus are mostly of theoretical interest.

4. SAFE REWRITING
In this section, we present an algorithm for k-depth safe rewriting.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Exchanging Intensional XML Data              •      15

    We are given a document tree t and a schema s0 = (L0 , F0 , τ0 ) describing the
signature of all the functions in the document (as well as the elements/functions
used in these signatures). This corresponds to having a WSDL description for
each service being used, which is a normal requirement for Web services. We
are also given a data exchange schema s = (L, F, τ ), and our goal is to safely
rewrite t into s (with a k-depth left-to-right rewriting).
    To simplify, we assume that function types are the same in s0 and s, including
definitions of the corresponding subelements. This is reasonable since the func-
tion definitions represent the WSDL description of the functions, as given by
the service providers. While this assumption simplifies the rewriting process,
it is not essential. The algorithm can be extended to handle distinct signatures,
    For clarity, we decompose the presentation of the algorithm into three parts:

(1) The first part explains how to deal with function parameters. The main
    point is that, since the parameters may themselves contain other function
    calls (with parameters), the tree rewriting starts from the deepest function
    calls and recursively moves upward.
(2) The second part explains how the rewriting in each such iteration is per-
    formed. The key observation is that this can be achieved by traversing the
    tree from top to bottom, handling one node (and its direct children) at a time.
(3) Finally, the third and most intricate part, explains how each such node,
    and its direct children, is handled. In particular, we show how to decide
    which of the functions among these children needs to be invoked in order
    to make the node fit the desired structure.

For presentation reasons, we give here a simplified version of the actual algo-
rithm used in the implementation. To optimize the computation, a more dy-
namic variant, based on the same idea, is used there. We explain the main
principles of this variant in Section 7.

4.1 Rewriting Function Parameters
To invoke a function, its parameters should be of the right type. If they are
not, they should be rewritten to fit that type. When rewriting the parameters,
again, the functions appearing in them can be invoked only if their own pa-
rameters are (or can be rewritten into) the expected input type. We thus start
from the “deepest” functions, that is, those having no function occurrences in
the parameters, and recursively move upward:

— For the deepest functions, we verify that their parameters are indeed in-
  stances of the corresponding input types. If not, the rewriting fails.
— Then moving upward, we look at a function f and its parameters. All the func-
  tions appearing in these parameters were already handled—namely, their
  parameters can be safely rewritten to the appropriate type. We thus ignore
  the parameters of these lower level calls (together with all the functions in-
  cluded in them) and just try to safely rewrite f ’s own parameters into the
  required structure. If this is not possible, the rewriting fails, for the same
  reason as above.
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
16      •        T. Milo et al.

At the end of this process we know that all the outmost function calls in t are
fine. We can thus ignore their parameters (and whatever functions that appear
in them) and need to safely rewrite t into s by invoking only these outmost calls.

4.2 Top Down Traversal
In each iteration of the above recursive procedure we are given a tree (or a
forest) where the parameters of all the outmost functions have already been
handled, and we need to safely rewrite the tree (forest) by invoking only these
outmost functions. To do that we can traverse the tree(forest) top down, treating
at each step a single node and its immediate children.
   Consider a node n whose children labels form a word w. Note that the subtree
rooted at n can be safely rewritten into the target schema s = (L, F, τ ) if and
only if (1) w can be safely rewritten into a word in lang(τ (label(n)), and (2)
each of n’s children subtrees can itself be safely rewritten into an instance of
s . Note that since we assumed that s0 and s agree on function types, we only
need to rewrite the original children of n and not those that are returned by
function invocations. Therefore, we can start from the root and, going down,
for each node n try to safely rewrite the sequence of its children into a word
in lang(τ (label(n))). The algorithm succeeds if all these individual rewritings
succeed.
   The safe rewriting of a word w involves the invocation of functions in w
and (recursively) new functions that are added to w by those invocations. To
conclude the description of our rewriting algorithm we thus only need to explain
how this is done.

4.3 Rewriting the Children of a Node n
This is the most intricate part of the algorithm. We are given a word w—the
sequence of labels of n’s children—and our goal is to rewrite w to fit the target
schema. Namely, we need to rewrite w so that it becomes a word in the regular
language R = τ (label(n)). The rewriting process invokes functions in w and
(recursively) new functions that are added to w by those invocations. Each
such invocation changes w, replacing the function occurrence by its returned
answer. The possible changes that the invocation of a function f i may cause
are determined by the output type R f i = τout ( f i ) of f i .13 For instance, if w =
a1 , a2 , . . . , f i , . . . , am , invoking f i changes w into some w = a1 , a2 , . . . , b1 , . . . ,
bk , . . . , am where b1 , . . . , bk ∈ lang(R f i ).
    Since the functions signatures, as well as the target schema, are given in
terms of regular expressions, it is convenient to reason about them, and about
the overall rewriting process, by analyzing the relationships between their cor-
responding finite state automata. We assume some basic knowledge of regular
languages and finite state automata, and use in our algorithm standard no-
tions such as the intersection and complement of regular languages and the
Cartesian product of automata. For basic material, see for instance Hopcroft
and Ullman [1979].

13 Recall   from the discussion above that the input parameters can be ignored.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                           Exchanging Intensional XML Data              •      17




                           Fig. 3. Safe rewriting of w into R.


   Given the word w, the output types R f 1 , . . . , R f n of the available functions,
and the target regular language R, the algorithm in Figure 3 tests if w can be
safely rewritten into a word in R. Then, if the answer is positive, the algorithm
presented in Section 4.4 finds a safe rewriting sequence.
   We give the intuition behind the first algorithm next. To illustrate, we use
the newspaper document in Figure 2(a). Assume that we look at the root news-
paper node. Its children labels form the word w = title.date.Get Temp.TimeOut.
Assume that we want to find a safe rewriting for this word into a word in the
regular language τ (newspaper) of the schema of (**), namely,

                     R = title.date.temp.(TimeOut | exhibit ∗ ).

   The process of rewriting involves choosing some functions in w and replacing
them by a possible output; then choosing some other functions (which might
have been returned by the previous calls) and replacing them by their output,
and so on, up to depth k. For each function occurrence we have two choices:
either to leave it untouched, or to replace it by some word in it output type.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
18      •      T. Milo et al.




                    Fig. 4. The A1 automaton from the newspaper document.
                                 w




                      Fig. 5. The complement automaton A for schema (**).


The automaton Ak constructed in steps 5–10 of the algorithm represents pre-
                     w
cisely all the words that can be generated by such a k-depth rewriting process.
The fork nodes are the nodes where a choice (i.e., invoking the function or
not) exists, and the two fork options represent the possible consequent steps
in the automaton, depending on which of the two choices was made. Going
back to the above example, Figure 4 shows the 1-depth automaton A1 for the
                                                                        w
word w = title.date.Get Temp.TimeOut, with the signature of the Get Temp
and TimeOut functions defined as in Section 2. q2 and q3 are the fork nodes
and their two outgoing edges represent their fork options for Get Temp and
TimeOut, respectively. An edge represents the choice of invoking the function
while a function edge represents the choice not to invoke it.
   Suppose first that we want to verify that all possible rewritings lead to a
“good” word, that is, that they belong to the target language R. To put things
in regular language terms, the intersection of the language of Ak , consisting of
                                                                      w
these words, with the complement of the target language R should be empty.
A standard way to test that the intersection of two regular languages is empty
is to (i) construct an automaton A for the complement of the language R, (ii)
build a Cartesian product automaton A× = Ak × A for the two automata Ak
                                                    w                           w
and A, and (iii) check whether it accepts no words.
   The Cartesian product automaton of Ak and A is built in step 11 of the
                                                w
algorithm. To continue with the above example, the complement automaton
for the regular language R = τ (newspaper) of the schema of (**) is given in
Figure 5. The accepting states are p0 , p1 , p2 , and p6 . For brevity we use “*”
to denote all possible alphabet transitions besides those appearing in other
outgoing edges. The Cartesian product automaton A× = A1 × A (where A1 and
                                                                 w         w
A are the automata of Figures 4 and 5, respectively) is given in Figure 6. The
initial state is [q0 , p0 ] and the final accepting one is [q4 , p6 ].
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                           Exchanging Intensional XML Data              •      19




                      Fig. 6. The Cartesian product automaton A× .




                  Fig. 7. The complement automaton A for schema (***).


   Note, however, that, when searching for a safe rewriting, one does not need to
verify that all possible rewritings lead to a “good word,” that is, that none of the
words in Ak belongs to A. We only have to verify that for each function, there
             w
is some fork option (i.e., invoking the function or not) that, if taken, will not
lead to an accepting state. Since we are looking for left-to-right safe rewritings,
we need to check that, traversing the input from left to right, at least one such
“good” fork options exists for each function call on the way. The marking of nodes
in steps 15–17 of the algorithm achieves just that. Recall that we required in
step 4 that the complement automaton A be complete. This is precisely what
guarantees that all the fork nodes/options of Ak are recorded in A× and makes
                                                   w
the above marking possible.
   The marking for our particular example is illustrated in Figure 6. The colored
nodes are the marked ones. As can be seen, the fork nodes [q2 , p2 ] and [q3 , p3 ] are
not marked. For the first node, this is because its fork option is not marked. For
the second one, it is due to the unmarked TimeOut fork option. Consequently,
the initial state is not marked as well and there is a safe rewriting of the
newspaper element to the schema of (**). We will see in Section 4.4 how to find
this rewriting.
   For another example, consider the schema of (***). Here, a newspaper
is required to have the structure conforming to the regular expression
title.date.temp.exhibit ∗ . The complement automaton A for this language is
given in Figure 7. To test whether it is possible to safely rewrite our news-
paper document into this schema, we construct a Cartesian product automaton
A× = A1 × A (with A1 as in Figure 4 and A as in Figure 7). A× is given in
         w                w
Figure 8.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
20      •      T. Milo et al.




                           Fig. 8. The Cartesian product automaton A× .

   As one can see, in this case, the two fork nodes [q2 , p2 ] and [q3 , p3 ] have
both their fork options marked. Consequently the initial state is marked as
well and there is no safe rewriting of w into the schema of (***). Note that
this is precisely what our intuitive discussion from Section 2 indicated: the
invocation of T imeOut may return performance elements, hence the result
may not conform to the desired structure.
   The following theorem states the correctness of our algorithm.
   THEOREM 4.1. The above algorithm returns true if and only if a k-depth left-
to-right safe rewriting exists.
   PROOF. To prove correctness we have to show that (i) when the algorithm
returns a negative answer a safe rewriting indeed does not exist, and (ii) when
the answer is positive, there exists a safe rewriting.

     Notations. We will use the following notations:
— For an automaton X and a state q ∈ X , we denote L(X , q) the language
  accepted by X when making q the initial state. This is a subset of all suffixes
  of words accepted by the original automaton X .
— In the automaton A× , we use A0 to denote the subautomaton “originating”
  from Aw , and use A j , 0 < j ≤ k to denote the subautomaton “originating”
  from some A f added to Ak to represent the possible outputs of f , at the j th
  iteration of its construction. More formally, these are the projections of A×
  on nodes [q, q ] such that q belongs to Aw for A0 , and to A f for A j . Note that,
  in general, several automata are added in the j th iteration. To simplify the
  notation we use A j to denote any representative of this set.
— Given a subautomaton A j , an initial (respectively final) state of A j is a state
  [q, q ] such that q is an initial (respectively final) state of Aw /A f i .

  Completeness. We start by proving that the algorithm is complete, that is,
that if it answers negatively, no k-depth left-to-right safe rewriting of Aw exists.
  We first number the nodes of A× based on the order in which they got marked
by the algorithm. For a regular node, its assigned number should be greater
than the one of its marked successor that caused its marking. For a fork node
the assigned number should be greater than all the numbers of the nodes that
cause its marking. It is easy to come up with such a numbering by following
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                   Exchanging Intensional XML Data              •      21

the algorithm and using a counter that is incremented by one each time a node
gets marked.
  We also need the following lemma.
    LEMMA 4.2. If a state [q, q ] of A j is marked, then there exists a finite path
from [q, q ] to a marked final state [ p, p ] of A j , such that all the nodes on the
path belong to A j , are marked, and have decreasing numbers. Moreover, either
[ p, p ] is a final state of A0 , or there is an outgoing edge from it to a marked
state of some A j −1 , with a smaller number.
   PROOF. First, by definition of marking, there exists a finite marked path
from [q, q ] to a final state of A× , where the nodes have decreasing numbers.
If [q, q ] is in A0 , note that the final state of A× is also a final state of A0 .
Otherwise, then by construction of A× , such a path must go through a marked
final state of A j and continue, via an edge, to a marked state in A j −1 with a
smaller number.
   Among such paths, lets look at the one that has the longest marked prefix in
A j .14 We denote [ p, p ] the last node of the prefix. If [ p, p ] is a final state of A j
that exits via an edge to a marked A j −1 node with a smaller number, we are
done. Let’s suppose it’s not.
   If [ p, p ] is a fork node, then it has at least two outgoing edges that lead to
marked states: a function transition, which stays in A j , and the corresponding
  transition, which leads to some A j +1 . By the definition of our numbering, both
successor nodes have smaller numbers than [ p, p ]. Thus, we can extend our
prefix in A j by following the function transition, which contradicts the fact that
we were on the path with the longest prefix in A j .
   If [ p, p ] is not a fork node, then it must have a marked successor with a
smaller number (as otherwise it would not be marked). Its successors can either
be in A j or be transitions to some A j −1 (if it is a final state of A j ). As we
assumed above that the transitions did not lead to marked nodes with lower
number, [ p, p ] must have a marked successor in A j with a lower number, that
caused its marking. Note, however, that by adding this marked node to the
previous prefix, we can build a marked path with a longer prefix in A j , having
nodes with decreasing numbers. Again, a contradiction.
   We are now ready to prove direction (i). We do this again by contradiction.
Assume that our algorithm returns a negative answer, that is, that the initial
state of A× is marked, but a k-depth left-to-right safe rewriting from Aw does
exists. Recall that such rewritings discover the input word and the answers of
functions from left to right, and make their decisions (namely, to invoke func-
tions or not) as they proceed. Therefore, we can construct the counterexample
incrementally. We do not need to provide the full input word (or the functions
output) as a whole, but only “letter by letter,” as the rewriting process is going on.
   Also recall that, since the rewriting is supposed to be safe, we are free to
chose any answer we want for a function call, as long as it matches its output
type. The rewriting should succeed anyway.

14 Note   that it is not necessarily unique.

                                         ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
22      •      T. Milo et al.

   We will show that we can provide a finite sequence of letters (consisting
of an initial word and answers to the function calls the rewriting decides to
invoke) that stays on a marked path in A× and eventually reaches one of its
final states. This means, on the one hand, that the sequence represents a legal
k-depth rewriting (of a word accepted by Aw ) and, on the other hand, that it
does not belong to the target type. Consequently the rewriting is not safe.
   The sequence is constructed as follows. We begin at the initial (marked)
state of A× and start following some finite marked path in A0 leading to a
marked final state where the nodes on the paths have decreasing numbers.
Such a path must exist, by Lemma 4.2.
   At each step, when we traverse an edge with a label in L, we simply output
its label. If the edge is labeled by a function name in F, we also output the label,
but our action depends on whether the rewriting process decides to invoke
the function or not: if the function is not invoked, we stay on the same path.
Otherwise, we follow an edge from the current automaton Ai (initially i = 0)
to a marked state of the next level automaton Ai+1 . Note that such a marked
state must exist since the previous fork node was marked. Also, by definition
of the numbering, its assigned number is smaller than the one of the fork node.
   Then we continue the same process at Ai+1 following a finite path, with
nodes having decreasing numbers, to its final state, and on the way possibly
moving to higher level automata, as described above.
   Since all these paths are constructed as in Lemma 4.2, they end on a final
node of A× (for A0 ), or on a final node of A j with an transition to a marked
node of A j −1 , with a smaller number. In the latter case, we simply follow this
transition, which corresponds to ending the answer of a function call.
   Observe that, by the above arguments, we follow a path to a final state of A×
that consists only of marked nodes and correspond to a decreasing sequence of
numbers, which means that it is finite. Feeding the letters on this path to the
safe rewriting makes it end on a word that is not in the target language R, a
contradiction.

  Soundness. We now turn to direction (ii), which states the soundness of
our algorithm—namely, that if the initial state is not marked, then every word
accepted by Aw can be rewritten to match the target schema. We start by
proving that if the algorithm succeeds, then the following proposition holds:
   PROPOSITION 4.3. Let A j be a subautomaton of A× corresponding to some
function automaton A f i (or to Aw if j = 0).
   For every nonmarked state [q, q ] of A j originating from A f i (respectively
A0 ), every word in L(A f i , q) (respectively L(Aw , q)) has a “safe rewriting” into
a word w such that w corresponds to a nonmarked path in A× leading to a
(nonmarked) final state of A j .15
   PROOF. We use induction on j , starting from j = k and going down to j = 0,
to show that every word in L(A f i , q) (respectively L(Aw , q)) that contains only

15 Weoverload here, in a natural manner, the notion of safe rewriting, meaning that the above
property holds no matter what answer the function invocations return.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Exchanging Intensional XML Data              •      23

function nodes of depth ≥ j can be safely rewritten. For j = k, A j contains no
fork nodes. A state [q, q ] is thus not marked iff all states reachable from it are
not marked and the property trivially holds.
   We suppose now that the hypothesis holds for j + 1, and consider a word
that contains function nodes of depth ≥ j . We follow its corresponding path in
A j . If all nodes are nonmarked, we get to a nonmarked accepting state of A j ,
which means we are done. Otherwise, if the path contains marked nodes, we
show, by a second induction on the number of function symbols in w, how to
rewrite w to a “good” path.
   The base of the induction is for a word w that doesn’t contain function nodes.
Then, clearly, all nodes on the corresponding path must be nonmarked, or else
the first one, [q, q ], would be marked as well. Suppose we know how to deal
with a word containing l function nodes, and consider a word w that contains
l + 1 function nodes. We look at the first edge e = ([v, v ], [u, u ]) on the path
where [v, v ] is not marked but [u, u ] is marked. By definition of marking, [v, v ]
must be a fork node with e labeled by some function name f i . This splits w into
subwords w = w1 . f i .w2 . Since [v, v ] is not marked, its other fork option (the
edge corresponding to f i ), must lead to a nonmarked initial state of the subau-
tomaton A j +1 corresponding to A f i . We choose to invoke this function. By the
first induction hypothesis, there is a safe rewriting of the returned result into
a word w whose corresponding path is not marked and leads to a nonmarked
final state of A j +1 . By the construction of A× this final state must have an
outgoing edge leading to a state [u, u ] of A j . Furthermore, observe that the
latter is not marked (or otherwise the final state of A j +1 would be also marked).
   Finally, since w2 has l function nodes, by the second induction hypothesis
it can be safely rewritten into some word w whose corresponding nonmarked
path leads to a final state of A j . It follows that the rewritten word w1 .w .w ,
and its corresponding nonmarked path, leads from [q, q ] to a nonmarked final
state A j via a path consisting only of nonmarked nodes.

   We are now ready to prove direction (ii). If the algorithm answers positively,
a safe rewriting can be found by essentially the same construction as of the
above proposition.
   Given any word w accepted by Aw , our goal is to find a safe rewriting that
yields a word w whose corresponding path in A× leads to a nonmarked state
[q, p], where q is an accepting state of Aw (namely, an accepting state of A0 ).
Note that since the final state is not marked, p is not an accepting state of A.
And since A is deterministic this implies that w is a “good” word that belongs
to the target language R.
   The fact that such a rewriting indeed exists follows immediately from the
above proposition, taking the node [q, q ] of the proposition to be the initial
state of A× , and w as a particular input word. The actual rewriting can be
found as described in the proof. This concludes the proof of Theorem 4.1.

  Complexity. We now briefly discuss the complexity of the algorithm. Recall
that we use s0 to denote the schema of the sender and s to denote the agreed
data exchange schema. The complexity of deciding whether a safe rewriting
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
24      •      T. Milo et al.




                                Fig. 9. Finding the rewriting of w into R.


exists is determined by the size of the cartesian product automaton: we need
to construct it and then traverse and mark its nodes. More precisely, the com-
plexity is bounded by O(|A× |2 ) = O((|Ak | × |A|)2 ). The size of Ak is at most
                                           w                        w
O((|s0 | + |w|)k ) and the size of the complement automaton A is at most expo-
nential in the automaton being complemented [Hopcroft and Ullman 1979],
namely, at most exponential in the size of the target schema s. This exponen-
tial blow up may happen however only when s uses nondeterministic regular
expressions (i.e., regular expressions whose corresponding finite state automa-
ton is nondeterministic). Note, however, that XML Schema enforces the usage
of deterministic regular expressions. Hence, for most practical cases, the com-
plexity is polynomial in the size of the schemas s0 and s (with the exponent
determined by k).

4.4 Finding a Rewriting
The algorithm of Figure 3 checks if a safe rewriting exists. The constructive
proof we used to show its soundness entails a way to find a rewriting sequence
when a safe rewriting exists, which corresponds to the algorithm of Figure 9.
   This algorithm finds the safe rewriting sequence by following a nonmarked
path. Each fork node on the path, together with its nonmarked fork option,
determines what needs to be done with the corresponding function—an edge
means “invoke the function” while a function edge means “do not invoke.” In
the example previously discribed, which corresponds to Figure 6, it is easy to
see (following the path with colored background) that Get Temp needs to be
invoked while TimeOut should not.
   The complexity of actually performing the rewriting depends on the size of
the answers returned by the called functions. If x is the maximal answer size,
the length of the generated word is bounded by w × x k .

4.5 A Mixed Approach
As seen above, much of the work in searching for a safe rewriting comes from
the size of the automaton Ak that accounts for all possible outputs of function
                            w
invocation. A useful heuristic is to adopt a mixed approach, that starts by in-
voking some of the functions (e.g., the ones with no side effects or low price) to
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                            Exchanging Intensional XML Data              •      25




                          Fig. 10. Possible rewriting of w into R.

get their actual output, and then tries to safely rewrite the document. In terms
of the algorithm of Figure 3, rather than using the full function signature au-
tomaton A f i , we will use a smaller one that describes just (the type of) the actual
returned result. This may greatly simplify the resulting automaton Ak . More-w
over, the output of the already invoked calls can be reused when performing
the actual rewriting, instead of reissuing these calls.

5. POSSIBLE REWRITING
We considered safe rewritings in the previous section. We now turn to possible
rewritings. While function signatures provide an “upper bound” of the possible
output, when invoked with the actual given parameters they may return a
restricted “appropriate” output, so a rewriting that looked nonfeasible (unsafe)
may turn to be possible after some function calls. To test if a rewriting may
exist, we follow a similar three-step procedure as for safe rewriting: (1) test
functions parameters first, (2) traverse the tree top down, and (3) check each
node individually, trying to rewrite the word w consisting of the labels of its
direct children.
   Steps (1) and (2) are exactly as before. For step (3), Figure 10 provides an
algorithm to test if the children of a given node may rewrite to the target schema.
As before, we use the automaton Ak that describes all the words that may be
                                        w
derived from the word w in a k-depth rewriting. w may rewrite to a word in
the target language R iff some of these derived words belong to R, namely, if
the intersection of the two languages, Ak and R, is not empty. To test this, we
                                            w
construct (in step 4 of the algorithm) the Cartesian product automaton for these
two languages, and test (in step 5) whether the final state is reachable from
the initial one. This is done by a standard marking process, that starts from
the final nodes, and marks all nodes that have some edge leading to a marked
node. If the initial state is marked, this means that the intersection of the two
languages is not empty [Hopcroft and Ullman 1979].
   For instance, consider the automaton A for the schema of (***) with newspa-
per structure title.date.temp.exhibit ∗ given in Figure 11. The initial state is p0
and the final accepting states are p3 and p4 . The Cartesian product automaton
A× = A1 × A (for A1 as in Figure 4 and A as in Figure 11) is given in Figure 12.
        w             k
The initial state is [q0 , p0 ]. The final accepting states are [q4 , p3 ] and [q4 , p4 ],
and all states (including the initial one) have an outgoing path to a final state.
The only possible fork options left in the automaton, and which may lead to
a possible rewriting, are the ones requiring the invocation of both Get Temp
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
26      •      T. Milo et al.




                             Fig. 11. An automaton A for schema (***).




                   Fig. 12. Cartesian product automaton for possible rewriting.




                         Fig. 13. Finding a possible rewriting of w into R.

and TimeOut functions. If TimeOut returns nothing but exhibits the rewriting
succeeds.
  The correctness of this algorithm is stated below.

  PROPOSITION 5.1.          The above algorithm returns true iff a k-depth possible
rewriting exists.

   PROOF. Since Ak accepts the language of all possible words obtainable by a
                   w
k-depth rewriting, the rewriting is possible iff the intersection of the language
accepted by Ak with the target language is not empty. This is classically checked
             w
by computing the cross-product of the corresponding automata, and marking
nodes as described, to checked whether a final state is reachable from the initial
state.

   The complexity here is again determined by the size of the Cartesian product
automaton. However, in this case, it uses the schema automaton A (rather
than its complement, as for safe rewriting). Hence, the complexity of checking
whether a rewriting may exist is polynomial in the size of the schemas s0 and
s (with the exponent determined by k).
   Finding an actual rewriting is done through a heuristic described by the al-
gorithm of Figure 13. We follow a marked path, and invoke functions or not, as
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Exchanging Intensional XML Data              •      27

indicated by the fork options on the path. We have to backtrack when failing
(i.e., when the function returns a value that does not correspond to an accept-
ing path). This process ends either because we reached a final state, which
means that a rewriting was found, or because all choices were explored without
success.

6. SCHEMA REWRITING
So far, we considered the rewriting of a single document. At a higher level, to
check compatibility between applications, the sender may wish to verify that
all documents generated by her application can indeed be sent to the target
receiver. Given a schema s0 for the sender documents, and some distinguished
root label r, we want to verify that all instances of s0 with root r can be safely
rewritten to the schema s. Interestingly, it turns out that safe rewriting for
schemas is not more difficult than for documents. We decompose the algorithm
we propose for schema rewriting into two parts: first, how to check the initial
schema, by traversing it top down and second, for each type in this schema,
how to check that the corresponding regular expression safely rewrites into the
target schema.
   We first show how safe rewriting can be checked for DTDs, by checking all
the element definitions of s0 . Then, we sketch a top-down algorithm for checking
safe rewriting for XML Schema-like schemas. Finally, we explain how it can be
checked that any instance of a regular expression can be safely rewritten into
a target regular expression.

6.1 Rewriting DTDs
In the simple DTD-like schemas we used so far, checking that s0 safely rewrites
to s amounts to checking that, for every element definition τ0 (l 0 ) = r0 in s0 ,
(a) there exists an element definition for the element label l 0 in s and that
(b) every instance of the regular expression r0 can be safely rewritten into the
corresponding regular expression in s, namely τ (l 0 ). We term this last step
language safe rewriting, and give an algorithm for it in Section 6.3.
   Notice that, for such simple schemas, the element definitions can be checked
independently from each other, in any order. s0 safely rewrites into s iff the
language safe rewriting succeeds for all element definitions.

6.2 Rewriting XML Schemas
Things are more involved when we consider more expressive schema languages,
in the style of XML Schema. Types are allowed to be decoupled from element
labels, but it holds that the type of an element is unambiguously determined by
its label and the type of its parent. In this case, schema rewriting can be checked
by a top-down analysis of the initial schema s0 , starting from the root. The type
of the root determines the regular expression that has to be matched by its
children, and the type of the root of s determines the target regular expression
for the safe rewriting of types.
   Then, recursively moving down, the types corresponding to the labels of
the children on both sides are unambiguously determined, and so are there
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
28      •      T. Milo et al.




                          Fig. 14. Language safe rewriting of R0 into R.


corresponding regular expressions. Therefore safe rewriting of types can be
checked at the next level, and so on.
   Notice that, while proceeding this way, only pairs of types for which safe
rewriting hasn’t been tested yet need to be processed. This ensures that the
algorithm terminates, even if schemas are recursive.

6.3 Language Safe Rewriting
We explain now how to check for language safe rewriting. Given two regular
expressions R0 and R, we want to check that all words in the language of
R0 have a safe rewriting into a word in the language of R. The algorithm of
Figure 14 checks just that.
  This algorithm is almost identical to the one presented in Section 4, except
that the initial automaton is built to accept the language R0 instead of a single
word. The following proposition states its correctness.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                        Exchanging Intensional XML Data              •      29

   PROPOSITION 6.1. The above algorithm returns true if and only if every word
in the language R0 has a k-depth left-to-right safe rewriting into a word of R.
   PROOF. The proof of the algorithm of Section 4 naturally extends to language
safe rewriting. There, the completeness of the algorithm was shown by build-
ing a counterexample to the fact that there might be a safe rewriting although
the algorithm answers negatively. The same construction holds for language
rewriting, since it suffices to show that one word in the language R0 does not
safely rewrite into R to contradict the fact that R0 doesn’t rewrite into R. The
soundness of the algorithm of Section 4 was shown in a constructive manner,
by building a word corresponding to a nonmarked path in A× . The same con-
struction applies to each word accepted by A R0 , that is, for each word in the
language R0 , which establishes the correctness of this algorithm.

7. IMPLEMENTATION
The ideas and algorithms presented in the previous sections have been imple-
mented and used in the Schema Enforcement module of the Active XML system
[Abiteboul et al. 2002] (also see the Active XML homepage of Web site http://
www.rocq.inria.fr/verso/Gemo/Projects/axml). We next present how the in-
tensional data model and schema language of the previous sections map to XML,
XML Schema, SOAP, and WSDL. Then, we briefly describe the ActiveXML sys-
tem and the Schema Enforcement module.

7.1 Using the Standards
In the implementation, an intensional XML document is a syntactically well-
formed XML document. This is because we also use an XML-based syntax to
express the intensional parts in it. To distinguish these parts from the rest of
the document, we rely on the mechanism of XML namespaces (see footnote 10).
More precisely, the namespace http://www.activexml.com/ns/int is defined
for service calls. These calls can appear at any place where XML elements are
allowed. The following example corresponds to the document of Figure 2(a):
<?xml version="1.0"?>
  <newspaper xmlns:int="http://www.activexml.com/ns/int">
    <title> The Sun </title>
    <date> 04/10/2002 </date>
    <int:fun endpointURL="http://www.forecast.com/soap"
             methodName="Get_Temp"
             namespaceURI="urn:xmethods-weather">
      <int:params>
        <int:param>
          <city>Paris</city>
        </int:param>
      </int:params>
    </int:fun>
    <int:fun endpointURL="http://www.timeout.com/paris"
             methodName="TimeOut">
             namespaceURI="urn:timeout-program">
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
30      •      T. Milo et al.

         <int:params>
           <int:param> exhibits </int:param>
         </int:params>
       </int:fun>
     </newspaper>

   Function nodes have three attributes that provide the necessary information
to call a service using the SOAP protocol: the URL of the server, the method
name, and the associated namespace. These attributes uniquely identify the
called function, and are isomorphic to the function name in the abstract model.
   In order to define schemas for intensional documents, we use XML Schemaint ,
which is an extension of XML Schema. To describe intensional data, XML
Schemaint introduces functions and function patterns. These are declared and
used like element definitions in the standard XML Schema language. In par-
ticular, it is possible to declare functions and function patterns globally, and
reference them inside complex type definitions (e.g., sequence, choice, all). We
give next the XML representation of function patterns that are described by a
combination of five optional attributes and two optional subelements: params
and return:

<functionPattern
    id = NCName             methodName = token
    endpointURL = anyURI    namespaceURI = anyURI
    WSDLSignature = anyURI ref = NCName>
  Contents: (params?, return?)
</functionPattern>

   The id attribute identifies the function pattern, which can then be referenced
by another function pattern using the ref attribute. Attributes methodName,
endpointURL, and namespaceURI designate the SOAP Web service that im-
plements the Boolean predicate used to check whether a particular function
matches the function pattern. It takes as input parameter the SOAP identi-
fiers of the function to validate. As a convention, when these parameters are
omitted, the predicate returns true for all functions. The Contents detail the
function signature, that is, the expected types for the input parameters and the
result of the function. These types are also defined using XML Schemaint , and
may contain intensional parts.
   To illustrate this syntax, consider the function pattern Forecast, which cap-
tures any function with one input parameter of element type city, returning an
element of type temp. It is simply described by

<functionPattern id="Forecast">
  <params>
    <param> <element ref="city"/> </param>
  </params>
  <result> <element ref="temp"/> </result>
</functionPattern>
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                        Exchanging Intensional XML Data              •      31

   Functions are declared in a similar way to function patterns, by using el-
ements of type function. The main difference is that the three attributes
methodName, endpointURL, and namespaceURI directly identify the function that
can be used.
   As mentioned already, function and function pattern declarations may be
used at any place where regular element and type declarations are allowed.
For example, a newspaper element with structure title.date.(Forecast | temp).
(TimeOut | exhibit ∗ ) may be defined in XML Schemaint as

<xsd:element name="newspaper">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element ref="title"/>
      <xsd:element ref="date"/>
      <xsd:choice>
        <xsi:functionPattern ref="Forecast"/>
        <xsd:element ref="temp"/>
      <xsd:/choice>
      <xsd:choice>
        <xsi:functionPattern ref="TimeOut"/>
        <xsd:element ref="exhibit" minOccurs="0"
                 maxOccurs="unbounded"/>
      <xsd:/choice>
  <xsd:/complexType>
<xsd:/element>

   Note that just as for documents, we use a different namespace (embodied
here by the use of the prefix xsi) to differentiate the intensional part of the
schema from the rest of the declarations.
   Similarly to XML Schema, we require definitions to be unambiguous (see
footnote 10)—namely, when parsing a document, for each element and each
function node, the subelements can be sequentially assigned a correspond-
ing type/function pattern in a deterministic way by looking only at the ele-
ment/function name.
   One of the major features of the WSDL language is to describe the input
and output types of Web services functions using XML Schema. We extend
WSDL in the obvious way, by simply allowing these types to describe intensional
data, using XML Schemaint . Finally, XML Schemaint allows WSDL or WSDLint
descriptions to be referenced in the definition of a function or function pattern,
instead of defining the signature explicitly (using the WSDLSignature attribute).

7.2 The ActiveXML System
ActiveXML is a peer-to-peer system that is centered around intensional XML
documents. Each peer contains a repository of intensional documents, and pro-
vides some active features to enrich them by automatically triggering the func-
tion calls they contain. It also provides some Web services, defined declara-
tively as queries/updates on top of the repository documents. All the exchanges
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
32        •     T. Milo et al.

between the ActiveXML peers, and with other Web service providers/consumers
use the SOAP protocol.
   The important point here is that both the services that an ActiveXML peer
invokes and those that it provides potentially accept intensional input param-
eters and return intensional results. Calls to “regular” Web services should
comply with the input and output types defined in their WSDL description.
Similarly, when calling an ActiveXML peer, the parameters of the call should
comply with its WSDL. The role of the Schema Enforcement module is (i) to
verify whether the call parameters conform to the WSDLint description of the
service, (ii) if not, to try to rewrite them into the required structure and (iii) if
this fails, to report an error. Similarly, before an ActiveXML service returns its
answer, the module performs the same three steps on the returned data.

7.3 The Schema Enforcement Module
To implement this module, we needed a parser of XML Schemaint . We had
the choice between extending an existing XML Schema parser based on DOM
level 3 or developing an implementation from scratch [Ngoc 2002]. Whereas the
first solution seems preferable, we followed the second one because, at the time
we started the implementation, the available (free) software we tried (Apache
Xerces16 and Oracle Schema Processor17 ) appeared to have limited extensibility.
Our parser relies on a standard event-based SAX parser.16 It does not cover all
the features of XML Schema, but implements the important ones such as com-
plex types, element/type references, and schema import. It does not check the
validity of all simple types, nor does it deal with inheritance or keys. However,
these features could be added rather easily to our code.
   The schema enforcement algorithm we implemented in the module follows
the main lines of the algorithm in Section 4, and in particular the three same
stages:

(1) checking function parameters recursively, starting from the most inner ones
    and going out,
(2) traversing, in each iteration, the tree top down, and
(3) rewriting the children of every node encountered in this traversal.

Steps (1) and (2) are done as described in Section 4. For step (2), recall from
above that XML Schemaint are deterministic. This is precisely what enables
the top-down traversal since the possible type of elements/functions can be
determined locally. For step (3), our implementation uses an efficient variant
of the algorithm of Section 4. While the latter starts by constructing all the
required automata and only then analyzes the resulting graph, our implemen-
tation builds the automaton A× in a lazy manner, starting from the initial state,
and constructing only the needed parts. The construction is pruned whenever
a node can be marked directly, without looking at the remaining, unexplored,

16 The   Xerces Java parser. Go online to http://xml.apache.org/xerces-j/.
17 The   Oracle XML developer’s kit for Java. Go online to http://otn.oracle.com/tech/xml/.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                Exchanging Intensional XML Data              •      33




                                 Fig. 15. The pruned automaton.

branches. The two main ideas that guide this process are the following:
— Sink nodes. Some accepting states in A are “sink” nodes: once you get there,
  you cannot get out (e.g., p6 in Figures 5 and 7). For the Cartesian product au-
  tomaton A× , this means that all paths starting from such nodes are marked.
  When such a node is reached in the construction of A× , we can immediately
  mark it and prune all its outgoing branches. For example, in Figure 15, the
  top left shaded area illustrates which parts of the Cartesian product au-
  tomaton of Figure 6 can be pruned. Nodes [q3 , p6 ] and [q7 , p6 ] contain the
  sink node p6 . They can be immediately be declared as marked, and the rest
  of the construction (the left shaded area) need not be constructed.
—Marked nodes. Once a node is known to be marked, there is no point in explor-
  ing its outgoing branches any further. To continue with the above example,
  once the node [q7 , p6 ] gets marked, so does [q7 , p3 ] that points to it. Hence,
  there is no need to explore the other outgoing branches of [q7 , p3 ] (the shaded
  area on the right).
   While this dynamic variant of the algorithm has the same worst-case com-
plexity as the algorithm of Figure 3, it saves a lot of unnecessary computation
in practice. Details are available in Ngoc [2002].

8. PEER-TO-PEER NEWS SYNDICATION
In this section, we will illustrate the exchange of intensional documents, and
the usefulness of our schema-based rewriting techniques through a real-life ap-
plication: peer-to-peer news syndication. This application was recently demon-
strated in Abiteboul et al. [2003a].
   The setting is the one shown on Figure 16. We consider a number of news
sources (newspaper Web sites, or individual “Weblogs”) that regularly publish
news stories. They share this information with others in a standard XML for-
mat, called RSS.18 Clients can periodically query/retrieve news from the sources
they are interested in, or subscribe to news feeds. News aggregators are special
peers that know of several news sources and let other clients ask queries to
and/or discover the news sources they know.

18 RSS   1.0 specification. Go online to http://purl.org/rss/1.0.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
34      •      T. Milo et al.




                                Fig. 16. Peer-to-peer news exchange.

   All interactions between news sources, aggregators, and clients are done
through calls to Web services they provide. Intensional documents can be ex-
changed both when passing parameters to these Web services, and in the an-
swers they return. These exchanges are controlled by XML schemas, and docu-
ments are rewritten to match these schemas, using the safe/possible rewriting
algorithms detailed in the previous sections.
   This mechanism is used to provide several versions of a service, without
changing its implementation, merely by using different schemas for its in-
put parameters and results. For instance, the same querying service is eas-
ily customized to be used by distinct kinds of participants, for example, various
client types or aggregators, with different requirements on the type of its input/
output.
   More specifically, for each kind of peer we consider (namely, news sources and
aggregators), we propose a set of basic Web services, with intensional output
and input parameters, and show how they can be customized for different clients
via schema-based rewriting. We first consider the customization of intensional
outputs, then the one of intensional inputs.

8.1 Customizing Intensional Outputs
News sources provide news stories, using a basic Web service named getStory,
which retrieves a story based on its identifier, and has the following signature:
<function id="GetStory">
  <params>
    <param>
      <xsd:simpleType ref="xsd:string" />
    </param>
  </params>
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                               Exchanging Intensional XML Data              •      35

  <result>
    <xsd:element name="story" type="xsd:string" />
  </result>
</functionPattern>
 Note that the output of this service is fully extensional. News sources also
allow users to search for news items by keywords,19 using the following service:
<function id="GetNewsAbout">
  <params>
    <param>
      <xsd:simpleType ref="xsd:string" />
    </param>
  </params>
  <result>
    <xsd:complexType ref="ItemList2" />
  </result>
</functionPattern>
   This service returns an RSS list of news items, of type ItemList2, where the
items are given extensionally, except for the story, which can be intensional. The
definition of the corresponding function pattern, intensionalStory is omitted.
<xsd:complexType name="ItemList2">
  <xsd:sequence>
    <xsd:element name="item" type="Item"/>
  </xsd:sequence>
</xsd:complexType>

<xsd:complexType name="Item">
  <xsd:sequence>
    <xsd:element name="title" type="xsd:string"/>
    <xsd:element ref="pubDate" type="xsd:dateTime"/>
    <xsd:element ref="description" type="xsd:string"/>
    <xsd:choice>
      <xsi:functionPattern ref="intensionalStory"/>
      <xsd:element name="story" type="xsd:string"/>
    </xsd:choice>
  </xsd:sequence>
  <xsd:attribute name="id" type="xsd:NMTOKEN"/>
</xsd:complexType>
   A fully extensional variant of this service, aimed for instance at PDAs that
download news for offline reading, is easily provided by employing the Schema
Enforcement module to rewrite the previous output to one that complies to a
fully extensional ItemList3 type, similar to the one above, except for the story
that has to be extensional.
19 More complex query languages, such as the one proposed by Edutella could also be used (go online

to http://edutella.jxta.org).

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
36      •      T. Milo et al.

   A more complex scenario allows readers to specify a desired output type at
call time, as a parameter of the service call. If there exists a rewriting of the
output that matches this schema, it will be applied before sending the result,
otherwise an error message will be returned.
   Aggregators act as “superpeers” in the network. They know a number of news
sources they can use to answer user queries. They also know other aggregators,
which can relay the queries to additional news sources and other aggregators,
transitively. Like news sources, they provide a getNewsAbout Web service, but
allow for a more intensional output, of type ItemList, where news items can
be either extensional or intensional. In the latter case they must match the
intensionalNews function pattern, whose definition is omitted.
<xsd:complexType name="ItemList">
  <xsd:sequence>
    <xsd:choice>
      <xsi:functionPattern ref="intensionalNews"/>
      <xsd:element name="item" type="Item"/>
    </xsd:choice>
  </xsd:sequence>
</xsd:complexType>
   When queried by simple news readers, the answer is rewritten, depending if
the reader is a RSS customer or a PDA, into a Itemlist2 or Itemlist3 version,
respectively. On the other hand, when queried by other aggregators that prefer
compact intensional answers which can be easily forwarded to other aggrega-
tors, no rewriting is performed, with the answer remaining as intensional as
possible, preferably complying to the type below, which requires the information
to be intensional.
<xsd:complexType name="ItemList4">
  <xsd:sequence>
    <xsi:functionPattern ref="intensionalNews"/>
  </xsd:sequence>
</xsd:complexType>
   Note also that aggregators may have different capabilities. For instance,
some of them may not be able to recursively invoke the service calls they get
in intensional answers. This is captured by having them supply, as an input
parameter, a precise type for the answer of getNewsAbout, that matches their
capabilities (e.g., return me only service calls that return extensional data).

8.2 Intensional Input
So far, we considered the intensional output of services. To illustrate the
power of intensional input parameters, we define a continuous version of the
getNewsAbout service provided by news sources and aggregators.
   Clients call this service only once, to subscribe to a news feed. Then, they
periodically get new information that matches their query (a dual service exists,
to unsubscribe). Here, the input parameter is allowed to be given intensionally,
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Exchanging Intensional XML Data              •      37

so that the service provider can probe it, adjusting the answer to the parameter’s
current value. For instance, consider a mobile user whose physical location
changes, and wants to get news about the town she is visiting. The zip code
of this town can be provided by a Web service running on her device, namely
a GPS service. A call to this service will be passed as an intensional query
parameter, and will be called by the news source in order to periodically send
her the relevant local information.
   This continuous news service is actually implemented using a wrapper
around a noncontinuous getNewsAbout service, calling the latter periodically
with the keyword parameter it received in the subscription. Since getNewsAbout
doesn’t accept an intensional input parameter, the schema enforcement module
rewrites the intensional parameter given in the subscription every time it has
to be called.

8.3 Demonstration Setting
To demonstrate this application [Abiteboul et al. 2003a], news sources were
built as simple wrappers around RSS files provided by news websites such
as Yahoo!News, BBC Word, the New York Times, and CNN. The news from
these sources could also be queried through two aggregators providing the
GetNewsAbout service, but customized with different output schemas. The cus-
tomization of intensional input parameters was demonstrated using a contin-
uous service, as explained above, by providing a call to a getFavoriteKeyword
service as a parameter for the subscription.

9. CONCLUSION AND RELATED WORK
As mentioned in the Introduction, XML documents with embedded calls to Web
services are already present in several existing products. The idea of including
function calls in data is certainly not a new one. Functions embedded in data
were already present in relational systems [Molina et al. 2002] as stored pro-
cedures. Also, method calls form a key component of object-oriented databases
[Cattell 1996]. In the Web context, scripting languages such as PHP (see foot-
note 2) or JSP (see footnote 1) have made popular the integration of processing
inside HTML or XML documents. Combined with standard database interfaces
such as JDBC and ODBC, functions are used to integrate results of queries (e.g.,
SQL queries) into documents. A representative example for this is Oracle XSQL
(see footnote 17). Embedding Web service calls in XML documents is also done
in popular products such as Microsoft Office (Smart Tags) and Macromedia MX.
   While the static structure of such documents can be described by some DTD
or XML Schema, our extension of XML Schema with function types is a first
step toward a more precise description of XML documents embedding compu-
tation. Further work in that direction is clearly needed to better understand
this powerful paradigm. There are a number of other proposals for typing XML
documents, for example, Makoto [2001], Hosoya and Pierce [2000], and Cluet
et al. [1998]. We selected XML Schema (see footnote 10) for several reasons.
First, it is the standard recommended by the W3C for describing the struc-
ture of XML documents. Furthermore, it is the typing language used in WSDL
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
38       •     T. Milo et al.

to define the signatures of Web services (see footnote 3). By extending XML
Schema, we naturally introduce function types/patterns in WSDL service sig-
natures. Finally, one aspect of XML Schema simplifies the problem we study,
namely, the unambiguity of XML Schema grammars.
    In many applications, it is necessary to screen queries and/or results ac-
cording to specific user groups [Candan et al. 1996]. More specifically for us,
embedded Web service calls in documents that are exchanged may be a se-
rious cause of security violation. Indeed, this was one of the original mo-
tivations for the work presented here. Controlling these calls by enforcing
schemas for exchanged documents appeared to us as useful for building se-
cure applications, and can be combined with other security and access models
that were proposed for XML and Web services, for example, in Damiani et al.
[2001] and WS-Security.20 However, further work is needed to investigate this
aspect.
    The work presented here is part of the ActiveXML [Abiteboul et al. 2002,
2003b] (see also the Active XML homepage of the Web site: http://www.rocq.
inria.fr/verso/Gemo/Projects/axml) project based on XML and Web services.
We presented in this article what forms the core of the module that, in a peer,
supports and controls the dialogue (via Web services) with the rest of the world.
This particular module may be extended in several ways. First, one may intro-
duce “automatic converters” capable of restructuring the data that is received
to the format that was expected, and similarly for the data that is sent. Also,
this module may be extended to act as a “negotiator” who could speak to other
peers to agree with them on the intensional XML Schemas that should be used
to exchange data. Finally, the module may be extended to include search capa-
bilities, for example, UDDI style search (see footnote 4) to try to find services
on the Web that provide some particular information.
    In the global ActiveXML project, research is going on to extend the frame-
work in various directions. In particular, we are working on distribution and
replication of XML data and Web services [Abiteboul et al. 2003a]. Note that
when some data may be found in different places and a service may be per-
formed at different sites, the choice of which data to use and where to perform
the service becomes an optimization issue. This is related to work on distributed
database systems [Ozsu and Valduriez 1999] and to distributed computing at
large. The novel aspect is the ability to exchange intensional information. This
is in spirit of Jim and Suciu [2001], which considers also the exchange of inten-
sional information in a distributed query processing setting.
    Intensional XML documents nicely fit in the context of data integration, since
an intensional part of an XML document may be seen as a view on some data
source. Calls to Web services in XML data may be used to wrap Web sources
[Garcia-Molina et al. 1997] or to propagate changes for warehouse maintenance
[Zhuge et al. 1995]. Note that the control of whether to materialize data or not
(studied here) provides some flexible form of integration that is a hybrid of
the warehouse model (all is materialized) and the mediator model (nothing is).

20 The WS-Security specification. Go online to http://www.ibm.com/webservices/library/
ws-secure/.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                Exchanging Intensional XML Data              •      39

On the other hand, this is orthogonal to the issue of selecting the views to
materialize in a warehouse, studied in, for example, Gupta [1997] and Yang
et al. [1997].
   To conclude, we mention some fundamental aspects of the problem we stud-
ied. Although the k-depth/left-to-right restriction is not limiting in practice and
the algorithm we implemented is fast enough, it would be interesting to under-
stand the complexity and decidability barriers of (variants of) the problem.
As we mentioned already, many results were found by Muscholl et al. [2004].
Namely, they proved the undecidability of the general safe rewriting problem
for a context-free target language, and provided tight complexity bounds for
several restricted cases.
   We already mentioned the connection to type theory and the novelty of our
work in that setting, coming from the regular expressions in XML Schemas.
Typing issues in XML Schema have recently motivated a number of interesting
works such as Milo et al. [2000], which are based on tree automata.

REFERENCES

ABITEBOUL, S., AMANN, B., BAUMGARTEN, J., BENJELLOUN, O., NGOC, F. D., AND MILO, T. 2003a. Schema-
  driven customization of Web services. In Proceedings of VLDB.
ABITEBOUL, S., BENJELLOUN, O., MANOLESCU, I., MILO, T., AND WEBER, R. 2002. Active XML: Peer-to-
  peer data and Web services integration (demo). In Proceedings of VLDB.
ABITEBOUL, S., BONIFATI, A., COBENA, G., MANOLESCU, I., AND MILO, T. 2003b. Dynamic XML docu-
  ments with distribution and replication. In Proceedings of ACM SIGMOD.
CANDAN, K. S., JAJODIA, S., AND SUBRAHMANIAN, V. S. 1996. Secure mediated databases. In Proceed-
  ings of ICDE. 28–37.
CATTELL, R., Ed. 1996. The Object Database Standard: ODMG-93. Morgan Kaufman, San
  Francisco, CA.
                             ´
CLUET, S., DELOBEL, C., SIMEON, J., AND SMAGA, K. 1998. Your mediators need data conversion! In
  Proceedings of ACM SIGMOD. 177–188.
DAMIANI, E., DI VIMERCATI, S. D. C., PARABOSCHI, S., AND SAMARATI, P. 2001. Securing XML docu-
  ments. In Proceedings of EDBT.
DOAN, A., DOMINGOS, P., AND HALEVY, A. Y. 2001. Reconciling schemas of disparate data sources:
  a machine-learning approach. In Proceedings of ACM SIGMOD. ACM Press, New York, NY,
  509–520.
GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM,
  J. 1997. The TSIMMIS approach to mediation: Data models and languages. J. Intel. Inform.
  Syst. 8, 117–132.
GUPTA, H. 1997. Selection of views to materialize in a data warehouse. In Proceedings of ICDT.
  98–112.
HOPCROFT, J. E. AND ULLMAN, J. D. 1979. Introduction to Automata Theory, Languages and Com-
  putation. Addison-Wesley, Reading, MA.
HOSOYA, H. AND PIERCE, B. C. 2000. XDuce: A typed XML processing language. In Proceedings of
  WebDB (Dallas, TX).
JIM, T. AND SUCIU, D. 2001. Dynamically distributed query evaluation. In Proceedings of ACM
  PODS. 413–424.
MAKOTO, M. 2001. RELAX (Regular Language description for XML). ISO/IEC Tech. Rep.
  ISO/IEC, Geneva, Switzerland.
MILO, T., SUCIU, D., AND VIANU, V. 2000. Typechecking for XML transformers. In Proceedings of
  ACM PODS. 11–22.
MITCHELL, J. C. 1990. Type systems for programming languages. In Handbook of Theoretical
  Computer Science: Volume B: Formal Models and Semantics, J. van Leeuwen, Ed. Elsevier,
  Amsterdam, The Netherlands, 365–458.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
40      •      T. Milo et al.

MOLINA, H., ULLMAN, J., AND WIDOM, J. 2002. Database Systems: The Complete Book. Prentice
  Hall, Englewood Cliffs, NJ.
MUSCHOLL, A., SCHWENTICK, T., AND SEGOUFIN, L. 2004. Active context-free games. In Proceed-
  ings of the 21st Symposium on Theoretical Aspects of Computer Science (STACS ’04; Le Comm,
  Montpelier, France, Mar. 25–27).
NGOC, F. D. 2002. Validation de documents XML contenant des appels de services. M.S. thesis.
  CNAM. DEA SIR (in French) University of Paris VI, Paris, France.
OZSU, T. AND VALDURIEZ, P. 1999. Principles of Distributed Database Systems (2nd ed.). Prentice-
  Hall, Englewood Cliffs, NJ.
SEGOUFIN, L. 2003. Personal communication.
YANG, J., KARLAPALEM, K., AND LI, Q. 1997. Algorithms for materialized view design in data ware-
  housing environment. In VLDB ’97: Proceedings of the 23rd International Conference on Very
  Large Data Bases. Morgan Kaufman Publishers, San Francisco, CA, 136–145.
ZHUGE, Y., GARC´A-MOLINA, H., HAMMER, J., AND WIDOM, J. 1995. View maintenance in a warehous-
                ı
  ing environment. In Proceedings of ACM SIGMOD. 316–327.

Received October 2003; accepted March 2004




ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
Progressive Skyline Computation in
Database Systems
DIMITRIS PAPADIAS
Hong Kong University of Science and Technology
YUFEI TAO
City University of Hong Kong
GREG FU
JP Morgan Chase
and
BERNHARD SEEGER
Philipps University


The skyline of a d -dimensional dataset contains the points that are not dominated by any other
point on all dimensions. Skyline computation has recently received considerable attention in the
database community, especially for progressive methods that can quickly return the initial re-
sults without reading the entire database. All the existing algorithms, however, have some serious
shortcomings which limit their applicability in practice. In this article we develop branch-and-
bound skyline (BBS), an algorithm based on nearest-neighbor search, which is I/O optimal, that
is, it performs a single access only to those nodes that may contain skyline points. BBS is simple
to implement and supports all types of progressive processing (e.g., user preferences, arbitrary di-
mensionality, etc). Furthermore, we propose several interesting variations of skyline computation,
and show how BBS can be applied for their efficient processing.
Categories and Subject Descriptors: H.2 [Database Management]; H.3.3 [Information Storage
and Retrieval]: Information Search and Retrieval
General Terms: Algorithms, Experimentation
Additional Key Words and Phrases: Skyline query, branch-and-bound algorithms, multidimen-
sional access methods




This research was supported by the grants HKUST 6180/03E and CityU 1163/04E from Hong Kong
RGC and Se 553/3-1 from DFG.
Authors’ addresses: D. Papadias, Department of Computer Science, Hong Kong University of Sci-
ence and Technology, Clear Water Bay, Hong Kong; email: dimitris@cs.ust.hk; Y. Tao, Depart-
ment of Computer Science, City University of Hong Kong, Tat Chee Avenue, Hong Kong; email:
taoyf@cs.cityu.edu.hk; G. Fu, JP Morgan Chase, 277 Park Avenue, New York, NY 10172-0002; email:
gregory.c.fu@jpmchase.com; B. Seeger, Department of Mathematics and Computer Science, Philipps
University, Hans-Meerwein-Strasse, Marburg, Germany 35032; email: seeger@mathematik.uni-
marburg.de.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0041 $5.00


                          ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 41–82.
42         •     D. Papadias et al.




                                Fig. 1. Example dataset and skyline.

1. INTRODUCTION
The skyline operator is important for several applications involving multicrite-
ria decision making. Given a set of objects p1 , p2 , . . . , pN , the operator returns
all objects pi such that pi is not dominated by another object p j . Using the
common example in the literature, assume in Figure 1 that we have a set of
hotels and for each hotel we store its distance from the beach (x axis) and its
price ( y axis). The most interesting hotels are a, i, and k, for which there is no
point that is better in both dimensions. Borzsonyi et al. [2001] proposed an SQL
syntax for the skyline operator, according to which the above query would be
expressed as: [Select *, From Hotels, Skyline of Price min, Distance min], where
min indicates that the price and the distance attributes should be minimized.
The syntax can also capture different conditions (such as max), joins, group-by,
and so on.
   For simplicity, we assume that skylines are computed with respect to min con-
ditions on all dimensions; however, all methods discussed can be applied with
any combination of conditions. Using the min condition, a point pi dominates1
another point p j if and only if the coordinate of pi on any axis is not larger than
the corresponding coordinate of p j . Informally, this implies that pi is preferable
to p j according to any preference (scoring) function which is monotone on all
attributes. For instance, hotel a in Figure 1 is better than hotels b and e since it
is closer to the beach and cheaper (independently of the relative importance of
the distance and price attributes). Furthermore, for every point p in the skyline
there exists a monotone function f such that p minimizes f [Borzsonyi et al.
2001].
   Skylines are related to several other well-known problems, including convex
hulls, top-K queries, and nearest-neighbor search. In particular, the convex hull
contains the subset of skyline points that may be optimal only for linear pref-
                                                                      o
erence functions (as opposed to any monotone function). B¨ hm and Kriegel
[2001] proposed an algorithm for convex hulls, which applies branch-and-
bound search on datasets indexed by R-trees. In addition, several main-memory

1 According    to this definition, two or more points with the same coordinates can be part of the
skyline.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems               •      43

algorithms have been proposed for the case that the whole dataset fits in mem-
ory [Preparata and Shamos 1985].
   Top-K (or ranked) queries retrieve the best K objects that minimize a specific
preference function. As an example, given the preference function f (x, y) =
x + y, the top-3 query, for the dataset in Figure 1, retrieves < i, 5 >, < h, 7 >,
< m, 8 > (in this order), where the number with each point indicates its score.
The difference from skyline queries is that the output changes according to the
input function and the retrieved points are not guaranteed to be part of the
skyline (h and m are dominated by i). Database techniques for top-K queries
include Prefer [Hristidis et al. 2001] and Onion [Chang et al. 2000], which are
based on prematerialization and convex hulls, respectively. Several methods
have been proposed for combining the results of multiple top-K queries [Fagin
et al. 2001; Natsev et al. 2001].
   Nearest-neighbor queries specify a query point q and output the objects clos-
est to q, in increasing order of their distance. Existing database algorithms as-
sume that the objects are indexed by an R-tree (or some other data-partitioning
method) and apply branch-and-bound search. In particular, the depth-first al-
gorithm of Roussopoulos et al. [1995] starts from the root of the R-tree and re-
cursively visits the entry closest to the query point. Entries, which are farther
than the nearest neighbor already found, are pruned. The best-first algorithm
of Henrich [1994] and Hjaltason and Samet [1999] inserts the entries of the
visited nodes in a heap, and follows the one closest to the query point. The re-
lation between skyline queries and nearest-neighbor search has been exploited
by previous skyline algorithms and will be discussed in Section 2.
   Skylines, and other directly related problems such as multiobjective opti-
mization [Steuer 1986], maximum vectors [Kung et al. 1975; Matousek 1991],
and the contour problem [McLain 1974], have been extensively studied and nu-
merous algorithms have been proposed for main-memory processing. To the best
of our knowledge, however, the first work addressing skylines in the context of
databases was Borzsonyi et al. [2001], which develops algorithms based on block
nested loops, divide-and-conquer, and index scanning. An improved version of
block nested loops is presented in Chomicki et al. [2003]. Tan et al. [2001] pro-
posed progressive (or on-line) algorithms that can output skyline points without
having to scan the entire data input. Kossmann et al. [2002] presented an algo-
rithm, called NN due to its reliance on nearest-neighbor search, which applies
the divide-and-conquer framework on datasets indexed by R-trees. The exper-
imental evaluation of Kossmann et al. [2002] showed that NN outperforms
previous algorithms in terms of overall performance and general applicability
independently of the dataset characteristics, while it supports on-line process-
ing efficiently.
   Despite its advantages, NN has also some serious shortcomings such as
need for duplicate elimination, multiple node visits, and large space require-
ments. Motivated by this fact, we propose a progressive algorithm called branch
and bound skyline (BBS), which, like NN, is based on nearest-neighbor search
on multidimensional access methods, but (unlike NN) is optimal in terms of
node accesses. We experimentally and analytically show that BBS outper-
forms NN (usually by orders of magnitude) for all problem instances, while
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
44      •      D. Papadias et al.




                                     Fig. 2. Divide-and-conquer.

incurring less space overhead. In addition to its efficiency, the proposed algo-
rithm is simple and easily extendible to several practical variations of skyline
queries.
   The rest of the article is organized as follows: Section 2 reviews previous
secondary-memory algorithms for skyline computation, discussing their advan-
tages and limitations. Section 3 introduces BBS, proves its optimality, and an-
alyzes its performance and space consumption. Section 4 proposes alternative
skyline queries and illustrates their processing using BBS. Section 5 introduces
the concept of approximate skylines, and Section 6 experimentally evaluates
BBS, comparing it against NN under a variety of settings. Finally, Section 7
concludes the article and describes directions for future work.

2. RELATED WORK
This section surveys existing secondary-memory algorithms for computing sky-
lines, namely: (1) divide-and-conquer, (2) block nested loop, (3) sort first skyline,
(4) bitmap, (5) index, and (6) nearest neighbor. Specifically, (1) and (2) were pro-
posed in Borzsonyi et al. [2001], (3) in Chomicki et al. [2003], (4) and (5) in Tan
et al. [2001], and (6) in Kossmann et al. [2002]. We do not consider the sorted list
scan, and the B-tree algorithms of Borzsonyi et al. [2001] due to their limited
applicability (only for two dimensions) and poor performance, respectively.

2.1 Divide-and-Conquer
The divide-and-conquer (D&C) approach divides the dataset into several par-
titions so that each partition fits in memory. Then, the partial skyline of the
points in every partition is computed using a main-memory algorithm (e.g.,
Matousek [1991]), and the final skyline is obtained by merging the partial ones.
Figure 2 shows an example using the dataset of Figure 1. The data space is di-
vided into four partitions s1 , s2 , s3 , s4 , with partial skylines {a, c, g }, {d }, {i},
{m, k}, respectively. In order to obtain the final skyline, we need to remove
those points that are dominated by some point in other partitions. Obviously
all points in the skyline of s3 must appear in the final skyline, while those in s2
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                •      45

are discarded immediately because they are dominated by any point in s3 (in
fact s2 needs to be considered only if s3 is empty). Each skyline point in s1 is
compared only with points in s3 , because no point in s2 or s4 can dominate those
in s1 . In this example, points c, g are removed because they are dominated by
i. Similarly, the skyline of s4 is also compared with points in s3 , which results in
the removal of m. Finally, the algorithm terminates with the remaining points
{a, i, k}. D&C is efficient only for small datasets (e.g., if the entire dataset fits
in memory then the algorithm requires only one application of a main-memory
skyline algorithm). For large datasets, the partitioning process requires read-
ing and writing the entire dataset at least once, thus incurring significant I/O
cost. Further, this approach is not suitable for on-line processing because it
cannot report any skyline until the partitioning phase completes.


2.2 Block Nested Loop and Sort First Skyline
A straightforward approach to compute the skyline is to compare each point p
with every other point, and report p as part of the skyline if it is not dominated.
Block nested loop (BNL) builds on this concept by scanning the data file and
keeping a list of candidate skyline points in main memory. At the beginning,
the list contains the first data point, while for each subsequent point p, there
are three cases: (i) if p is dominated by any point in the list, it is discarded as it
is not part of the skyline; (ii) if p dominates any point in the list, it is inserted,
and all points in the list dominated by p are dropped; and (iii) if p is neither
dominated by, nor dominates, any point in the list, it is simply inserted without
dropping any point.
   The list is self-organizing because every point found dominating other points
is moved to the top. This reduces the number of comparisons as points that
dominate multiple other points are likely to be checked first. A problem of BNL
is that the list may become larger than the main memory. When this happens,
all points falling in the third case (cases (i) and (ii) do not increase the list size)
are added to a temporary file. This fact necessitates multiple passes of BNL. In
particular, after the algorithm finishes scanning the data file, only points that
were inserted in the list before the creation of the temporary file are guaranteed
to be in the skyline and are output. The remaining points must be compared
against the ones in the temporary file. Thus, BNL has to be executed again,
this time using the temporary (instead of the data) file as input.
   The advantage of BNL is its wide applicability, since it can be used for any
dimensionality without indexing or sorting the data file. Its main problems are
the reliance on main memory (a small memory may lead to numerous iterations)
and its inadequacy for progressive processing (it has to read the entire data file
before it returns the first skyline point). The sort first skyline (SFS) variation
of BNL alleviates these problems by first sorting the entire dataset according
to a (monotone) preference function. Candidate points are inserted into the list
in ascending order of their scores, because points with lower scores are likely to
dominate a large number of points, thus rendering the pruning more effective.
SFS exhibits progressive behavior because the presorting ensures that a point
p dominating another p must be visited before p ; hence we can immediately
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
46      •      D. Papadias et al.

                                    Table I. The Bitmap Approach
                          id    Coordinate        Bitmap Representation
                          a        (1, 9)       (1111111111, 1100000000)
                           b      (2, 10)       (1111111110, 1000000000)
                           c       (4, 8)       (1111111000, 1110000000)
                          d        (6, 7)       (1111100000, 1111000000)
                           e      (9, 10)       (1100000000, 1000000000)
                           f       (7, 5)       (1111000000, 1111110000)
                           g       (5, 6)       (1111110000, 1111100000)
                          h        (4, 3)       (1111111000, 1111111100)
                           i       (3, 2)       (1111111100, 1111111110)
                           k       (9, 1)       (1100000000, 1111111111)
                           l      (10, 4)       (1000000000, 1111111000)
                          m        (6, 2)      (1111100000, 11111111110)
                          n        (8, 3)       (1110000000, 1111111100)



output the points inserted to the list as skyline points. Nevertheless, SFS has
to scan the entire data file to return a complete skyline, because even a skyline
point may have a very large score and thus appear at the end of the sorted list
(e.g., in Figure 1, point a has the third largest score for the preference function
0 · distance + 1 · price). Another problem of SFS (and BNL) is that the order in
which the skyline points are reported is fixed (and decided by the sort order),
while as discussed in Section 2.6, a progressive skyline algorithm should be
able to report points according to user-specified scoring functions.

2.3 Bitmap
This technique encodes in bitmaps all the information needed to decide whether
a point is in the skyline. Toward this, a data point p = ( p1 , p2 , . . . , pd ), where
d is the number of dimensions, is mapped to an m-bit vector, where m is the
total number of distinct values over all dimensions. Let ki be the total number
of distinct values on the ith dimension (i.e., m = i=1∼d ki ). In Figure 1, for
example, there are k1 = k2 = 10 distinct values on the x, y dimensions and
m = 20. Assume that pi is the ji th smallest number on the ith axis; then it
is represented by ki bits, where the leftmost (ki − ji + 1) bits are 1, and the
remaining ones 0. Table I shows the bitmaps for points in Figure 1. Since point
a has the smallest value (1) on the x axis, all bits of a1 are 1. Similarly, since
a2 (= 9) is the ninth smallest on the y axis, the first 10 − 9 + 1 = 2 bits of its
representation are 1, while the remaining ones are 0.
    Consider that we want to decide whether a point, for example, c with bitmap
representation (1111111000, 1110000000), belongs to the skyline. The right-
most bits equal to 1, are the fourth and the eighth, on dimensions x and y,
respectively. The algorithm creates two bit-strings, c X = 1110000110000 and
cY = 0011011111111, by juxtaposing the corresponding bits (i.e., the fourth
and eighth) of every point. In Table I, these bit-strings (shown in bold) contain
13 bits (one from each object, starting from a and ending with n). The 1s in the
result of c X & cY = 0010000110000 indicate the points that dominate c, that
is, c, h, and i. Obviously, if there is more than a single 1, the considered point
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Progressive Skyline Computation in Database Systems                 •      47

                                 Table II. The Index Approach
                               List 1                      List 2
                      a (1, 9)     minC = 1   k (9, 1)              minC = 1
                      b (2, 10)    minC = 2   i (3, 2), m (6, 2)    minC = 2
                      c (4, 8)     minC = 4   h (4, 3), n (8, 3)    minC = 3
                      g (5, 6)     minC = 5   l (10, 4)             minC = 4
                      d (6, 7)     minC = 6   f (7, 5)              minC = 5
                      e (9, 10)    minC = 9


is not in the skyline.2 The same operations are repeated for every point in the
dataset to obtain the entire skyline.
   The efficiency of bitmap relies on the speed of bit-wise operations. The ap-
proach can quickly return the first few skyline points according to their inser-
tion order (e.g., alphabetical order in Table I), but, as with BNL and SFS, it
cannot adapt to different user preferences. Furthermore, the computation of
the entire skyline is expensive because, for each point inspected, it must re-
trieve the bitmaps of all points in order to obtain the juxtapositions. Also the
space consumption may be prohibitive, if the number of distinct values is large.
Finally, the technique is not suitable for dynamic datasets where insertions
may alter the rankings of attribute values.

2.4 Index
The index approach organizes a set of d -dimensional points into d lists such
that a point p = ( p1 , p2 , . . . , pd ) is assigned to the ith list (1 ≤ i ≤ d ), if and
only if its coordinate pi on the ith axis is the minimum among all dimensions, or
formally, pi ≤ p j for all j = i. Table II shows the lists for the dataset of Figure 1.
Points in each list are sorted in ascending order of their minimum coordinate
(minC, for short) and indexed by a B-tree. A batch in the ith list consists of
points that have the same ith coordinate (i.e., minC). In Table II, every point
of list 1 constitutes an individual batch because all x coordinates are different.
Points in list 2 are divided into five batches {k}, {i, m}, {h, n}, {l }, and { f }.
    Initially, the algorithm loads the first batch of each list, and handles the one
with the minimum minC. In Table II, the first batches {a}, {k} have identical
minC = 1, in which case the algorithm handles the batch from list 1. Processing
a batch involves (i) computing the skyline inside the batch, and (ii) among the
computed points, it adds the ones not dominated by any of the already-found
skyline points into the skyline list. Continuing the example, since batch {a}
contains a single point and no skyline point is found so far, a is added to the
skyline list. The next batch {b} in list 1 has minC = 2; thus, the algorithm
handles batch {k} from list 2. Since k is not dominated by a, it is inserted in
the skyline. Similarly, the next batch handled is {b} from list 1, where b is
dominated by point a (already in the skyline). The algorithm proceeds with
batch {i, m}, computes the skyline inside the batch that contains a single point
i (i.e., i dominates m), and adds i to the skyline. At this step, the algorithm does

2 Theresult of “&” will contain several 1s if multiple skyline points coincide. This case can be
handled with an additional “or” operation [Tan et al. 2001].

                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
48      •      D. Papadias et al.




                                       Fig. 3. Example of NN.
not need to proceed further, because both coordinates of i are smaller than or
equal to the minC (i.e., 4, 3) of the next batches (i.e., {c}, {h, n}) of lists 1 and
2. This means that all the remaining points (in both lists) are dominated by i,
and the algorithm terminates with {a, i, k}.
   Although this technique can quickly return skyline points at the top of the
lists, the order in which the skyline points are returned is fixed, not supporting
user-defined preferences. Furthermore, as indicated in Kossmann et al. [2002],
the lists computed for d dimensions cannot be used to retrieve the skyline on any
subset of the dimensions because the list that an element belongs to may change
according the subset of selected dimensions. In general, for supporting queries
on arbitrary dimensions, an exponential number of lists must be precomputed.

2.5 Nearest Neighbor
NN uses the results of nearest-neighbor search to partition the data universe
recursively. As an example, consider the application of the algorithm to the
dataset of Figure 1, which is indexed by an R-tree [Guttman 1984; Sellis et al.
1987; Beckmann et al. 1990]. NN performs a nearest-neighbor query (using an
existing algorithm such as one of the proposed by Roussopoulos et al. [1995], or
Hjaltason and Samet [1999] on the R-tree, to find the point with the minimum
distance (mindist) from the beginning of the axes (point o). Without loss of
generality,3 we assume that distances are computed according to the L1 norm,
that is, the mindist of a point p from the beginning of the axes equals the sum
of the coordinates of p. It can be shown that the first nearest neighbor (point
i with mindist 5) is part of the skyline. On the other hand, all the points in
the dominance region of i (shaded area in Figure 3(a)) can be pruned from
further consideration. The remaining space is split in two partitions based on
the coordinates (ix , i y ) of point i: (i) [0, ix ) [0, ∞) and (ii) [0, ∞) [0, i y ). In
Figure 3(a), the first partition contains subdivisions 1 and 3, while the second
one contains subdivisions 1 and 2.
   The partitions resulting after the discovery of a skyline point are inserted in
a to-do list. While the to-do list is not empty, NN removes one of the partitions

3 NN (and BBS) can be applied with any monotone function; the skyline points are the same, but
the order in which they are discovered may be different.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Progressive Skyline Computation in Database Systems                •      49




                       Fig. 4. NN partitioning for three-dimensions.

from the list and recursively repeats the same process. For instance, point a is
the nearest neighbor in partition [0, ix ) [0, ∞), which causes the insertion of
partitions [0, ax ) [0, ∞) (subdivisions 5 and 7 in Figure 3(b)) and [0, ix ) [0, a y )
(subdivisions 5 and 6 in Figure 3(b)) in the to-do list. If a partition is empty, it is
not subdivided further. In general, if d is the dimensionality of the data-space,
a new skyline point causes d recursive applications of NN. In particular, each
coordinate of the discovered point splits the corresponding axis, introducing a
new search region towards the origin of the axis.
   Figure 4(a) shows a three-dimensional (3D) example, where point n with
coordinates (nx , n y , nz ) is the first nearest neighbor (i.e., skyline point). The NN
algorithm will be recursively called for the partitions (i) [0, nx ) [0, ∞) [0, ∞)
(Figure 4(b)), (ii) [0, ∞) [0, n y ) [0, ∞) (Figure 4(c)) and (iii) [0, ∞) [0, ∞) [0, nz )
(Figure 4(d)). Among the eight space subdivisions shown in Figure 4, the eighth
one will not be searched by any query since it is dominated by point n. Each
of the remaining subdivisions, however, will be searched by two queries, for
example, a skyline point in subdivision 2 will be discovered by both the second
and third queries.
   In general, for d > 2, the overlapping of the partitions necessitates dupli-
cate elimination. Kossmann et al. [2002] proposed the following elimination
methods:

—Laisser-faire: A main memory hash table stores the skyline points found so
 far. When a point p is discovered, it is probed and, if it already exists in the
 hash table, p is discarded; otherwise, p is inserted into the hash table. The
 technique is straightforward and incurs minimum CPU overhead, but results
 in very high I/O cost since large parts of the space will be accessed by multiple
 queries.
—Propagate: When a point p is found, all the partitions in the to-do list that
 contain p are removed and repartitioned according to p. The new partitions
 are inserted into the to-do list. Although propagate does not discover the same
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
50      •      D. Papadias et al.

 skyline point twice, it incurs high CPU cost because the to-do list is scanned
 every time a skyline point is discovered.
—Merge: The main idea is to merge partitions in to-do, thus reducing the num-
 ber of queries that have to be performed. Partitions that are contained in
 other ones can be eliminated in the process. Like propagate, merge also in-
 curs high CPU cost since it is expensive to find good candidates for merging.
—Fine-grained partitioning: The original NN algorithm generates d partitions
 after a skyline point is found. An alternative approach is to generate 2d
 nonoverlapping subdivisions. In Figure 4, for instance, the discovery of point
 n will lead to six new queries (i.e., 23 – 2 since subdivisions 1 and 8 cannot
 contain any skyline points). Although fine-grained partitioning avoids dupli-
 cates, it generates the more complex problem of false hits, that is, it is possible
 that points in one subdivision (e.g., subdivision 4) are dominated by points
 in another (e.g., subdivision 2) and should be eliminated.
  According to the experimental evaluation of Kossmann et al. [2002], the
performance of laisser-faire and merge was unacceptable, while fine-grained
partitioning was not implemented due to the false hits problem. Propagate
was significantly more efficient, but the best results were achieved by a hybrid
method combining propagate and laisser-faire.

2.6 Discussion About the Existing Algorithms
We summarize this section with a comparison of the existing methods, based
on the experiments of Tan et al. [2001], Kossmann et al. [2002], and Chomicki
et al. [2003]. Tan et al. [2001] examined BNL, D&C, bitmap, and index, and
suggested that index is the fastest algorithm for producing the entire skyline
under all settings. D&C and bitmap are not favored by correlated datasets
(where the skyline is small) as the overhead of partition-merging and bitmap-
loading, respectively, does not pay-off. BNL performs well for small skylines,
but its cost increases fast with the skyline size (e.g., for anticorrelated datasets,
high dimensionality, etc.) due to the large number of iterations that must be
performed. Tan et al. [2001] also showed that index has the best performance in
returning skyline points progressively, followed by bitmap. The experiments of
Chomicki et al. [2003] demonstrated that SFS is in most cases faster than BNL
without, however, comparing it with other algorithms. According to the eval-
uation of Kossmann et al. [2002], NN returns the entire skyline more quickly
than index (hence also more quickly than BNL, D&C, and bitmap) for up to four
dimensions, and their difference increases (sometimes to orders of magnitudes)
with the skyline size. Although index can produce the first few skyline points in
shorter time, these points are not representative of the whole skyline (as they
are good on only one axis while having large coordinates on the others).
   Kossmann et al. [2002] also suggested a set of criteria (adopted from Heller-
stein et al. [1999]) for evaluating the behavior and applicability of progressive
skyline algorithms:
 (i) Progressiveness: the first results should be reported to the user almost
     instantly and the output size should gradually increase.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems               •      51

 (ii) Absence of false misses: given enough time, the algorithm should generate
      the entire skyline.
(iii) Absence of false hits: the algorithm should not discover temporary skyline
      points that will be later replaced.
(iv) Fairness: the algorithm should not favor points that are particularly good
      in one dimension.
 (v) Incorporation of preferences: the users should be able to determine the
      order according to which skyline points are reported.
(vi) Universality: the algorithm should be applicable to any dataset distribu-
      tion and dimensionality, using some standard index structure.
    All the methods satisfy criterion (ii), as they deal with exact (as opposed to
approximate) skyline computation. Criteria (i) and (iii) are violated by D&C and
BNL since they require at least a scan of the data file before reporting skyline
points and they both insert points (in partial skylines or the self-organizing
list) that are later removed. Furthermore, SFS and bitmap need to read the
entire file before termination, while index and NN can terminate as soon as all
skyline points are discovered. Criteria (iv) and (vi) are violated by index because
it outputs the points according to their minimum coordinates in some dimension
and cannot handle skylines in some subset of the original dimensionality. All
algorithms, except NN, defy criterion (v); NN can incorporate preferences by
simply changing the distance definition according to the input scoring function.
    Finally, note that progressive behavior requires some form of preprocessing,
that is, index creation (index, NN), sorting (SFS), or bitmap creation (bitmap).
This preprocessing is a one-time effort since it can be used by all subsequent
queries provided that the corresponding structure is updateable in the presence
of record insertions and deletions. The maintenance of the sorted list in SFS can
be performed by building a B+-tree on top of the list. The insertion of a record
in index simply adds the record in the list that corresponds to its minimum
coordinate; similarly, deletion removes the record from the list. NN can also
be updated incrementally as it is based on a fully dynamic structure (i.e., the
R-tree). On the other hand, bitmap is aimed at static datasets because a record
insertion/deletion may alter the bitmap representation of numerous (in the
worst case, of all) records.

3. BRANCH-AND-BOUND SKYLINE ALGORITHM
Despite its general applicability and performance advantages compared to ex-
isting skyline algorithms, NN has some serious shortcomings, which are de-
scribed in Section 3.1. Then Section 3.2 proposes the BBS algorithm and proves
its correctness. Section 3.3 analyzes the performance of BBS and illustrates its
I/O optimality. Finally, Section 3.4 discusses the incremental maintenance of
skylines in the presence of database updates.

3.1 Motivation
A recursive call of the NN algorithm terminates when the corresponding
nearest-neighbor query does not retrieve any point within the corresponding
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
52      •      D. Papadias et al.




                                       Fig. 5. Recursion tree.


space. Lets call such a query empty, to distinguish it from nonempty queries
that return results, each spawning d new recursive applications of the algo-
rithm (where d is the dimensionality of the data space). Figure 5 shows a
query processing tree, where empty queries are illustrated as transparent cy-
cles. For the second level of recursion, for instance, the second query does not
return any results, in which case the recursion will not proceed further. Some
of the nonempty queries may be redundant, meaning that they return sky-
line points already found by previous queries. Let s be the number of skyline
points in the result, e the number of empty queries, ne the number of nonempty
ones, and r the number of redundant queries. Since every nonempty query
either retrieves a skyline point, or is redundant, we have ne = s + r. Fur-
thermore, the number of empty queries in Figure 5 equals the number of leaf
nodes in the recursion tree, that is, e = ne · (d − 1) + 1. By combining the two
equations, we get e = (s + r) · (d − 1) + 1. Each query must traverse a whole
path from the root to the leaf level of the R-tree before it terminates; there-
fore, its I/O cost is at least h node accesses, where h is the height of the tree.
Summarizing the above observations, the total number of accesses for NN is:
NANN ≥ (e + s + r) · h = (s + r) · h · d + h > s · h · d . The value s · h · d is a rather
optimistic lower bound since, for d > 2, the number r of redundant queries
may be very high (depending on the duplicate elimination method used), and
queries normally incur more than h node accesses.
    Another problem of NN concerns the to-do list size, which can exceed that of
the dataset for as low as three dimensions, even without considering redundant
queries. Assume, for instance, a 3D uniform dataset (cardinality N ) and a sky-
line query with the preference function f (x, y, z) = x. The first skyline point
n (nx , n y , nz ) has the smallest x coordinate among all data points, and adds
partitions Px = [0, nx ) [0, ∞) [0, ∞), P y = [0, ∞) [0, n y ) [0, ∞), Pz = [0, ∞)
[0, ∞) [0, nz ) in the to-do list. Note that the NN query in Px is empty because
there is no other point whose x coordinate is below nx . On the other hand, the
expected volume of P y (Pz ) is 1/2 (assuming unit axis length on all dimensions),
because the nearest neighbor is decided solely on x coordinates, and hence n y
(nz ) distributes uniformly in [0, 1]. Following the same reasoning, a NN in P y
finds the second skyline point that introduces three new partitions such that
one partition leads to an empty query, while the volumes of the other two are
1/ . P is handled similarly, after which the to-do list contains four partitions
  4    z
with volumes 1/4, and 2 empty partitions. In general, after the ith level of re-
cursion, the to-do list contains 2i partitions with volume 1/2i , and 2i−1 empty
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems               •      53




                               Fig. 6. R-tree example.

partitions. The algorithm terminates when 1/2i < 1/N (i.e., i > log N ) so that
all partitions in the to-do list are empty. Assuming that the empty queries are
performed at the end, the size of the to-do list can be obtained by summing the
number e of empty queries at each recursion level i:
                                log N
                                        2i−1 = N − 1.
                                i=1

The implication of the above equation is that, even in 3D, NN may behave
like a main-memory algorithm (since the to-do list, which resides in memory,
is the same order of size as the input dataset). Using the same reasoning, for
arbitrary dimensionality d > 2, e = ((d −1)log N ), that is, the to-do list may
become orders of magnitude larger than the dataset, which seriously limits
the applicability of NN. In fact, as shown in Section 6, the algorithm does not
terminate in the majority of experiments involving four and five dimensions.

3.2 Description of BBS
Like NN, BBS is also based on nearest-neighbor search. Although both algo-
rithms can be used with any data-partitioning method, in this article we use
R-trees due to their simplicity and popularity. The same concepts can be ap-
plied with other multidimensional access methods for high-dimensional spaces,
where the performance of R-trees is known to deteriorate. Furthermore, as
claimed in Kossmann et al. [2002], most applications involve up to five di-
mensions, for which R-trees are still efficient. For the following discussion, we
use the set of 2D data points of Figure 1, organized in the R-tree of Figure 6
with node capacity = 3. An intermediate entry ei corresponds to the minimum
bounding rectangle (MBR) of a node Ni at the lower level, while a leaf entry
corresponds to a data point. Distances are computed according to L1 norm, that
is, the mindist of a point equals the sum of its coordinates and the mindist of a
MBR (i.e., intermediate entry) equals the mindist of its lower-left corner point.
    BBS, similar to the previous algorithms for nearest neighbors [Roussopoulos
                                                                  o
et al. 1995; Hjaltason and Samet 1999] and convex hulls [B¨ hm and Kriegel
2001], adopts the branch-and-bound paradigm. Specifically, it starts from the
root node of the R-tree and inserts all its entries (e6 , e7 ) in a heap sorted ac-
cording to their mindist. Then, the entry with the minimum mindist (e7 ) is
“expanded”. This expansion removes the entry (e7 ) from the heap and inserts
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
54      •      D. Papadias et al.

                                      Table III. Heap Contents
                Action                       Heap Contents                       S
                Access root    <e7, 4><e6, 6>                                   Ø
                Expand e7      <e3, 5><e6, 6><e5, 8><e4, 10>                    Ø
                Expand e3      <i, 5><e6, 6><h, 7><e5, 8> <e4, 10><g, 11>       {i}
                Expand e6      <h, 7><e5 , 8><e1, 9><e4, 10><g, 11>             {i}
                Expand e1      <a, 10><e4, 10><g, 11><b, 12><c, 12>           {i, a}
                Expand e4      <k, 10> < g, 11>< b, 12>< c, 12>< l, 14>     {i, a, k}




                                       Fig. 7. BBS algorithm.


its children (e3 , e4 , e5 ). The next expanded entry is again the one with the min-
imum mindist (e3 ), in which the first nearest neighbor (i) is found. This point
(i) belongs to the skyline, and is inserted to the list S of skyline points.
    Notice that up to this step BBS behaves like the best-first nearest-neighbor
algorithm of Hjaltason and Samet [1999]. The next entry to be expanded is
e6 . Although the nearest-neighbor algorithm would now terminate since the
mindist (6) of e6 is greater than the distance (5) of the nearest neighbor (i)
already found, BBS will proceed because node N6 may contain skyline points
(e.g., a). Among the children of e6 , however, only the ones that are not dominated
by some point in S are inserted into the heap. In this case, e2 is pruned because
it is dominated by point i. The next entry considered (h) is also pruned as it
also is dominated by point i. The algorithm proceeds in the same manner until
the heap becomes empty. Table III shows the ids and the mindist of the entries
inserted in the heap (skyline points are bold).
    The pseudocode for BBS is shown in Figure 7. Notice that an entry is checked
for dominance twice: before it is inserted in the heap and before it is expanded.
The second check is necessary because an entry (e.g., e5 ) in the heap may become
dominated by some skyline point discovered after its insertion (therefore, the
entry does not need to be visited).
    Next we prove the correctness for BBS.

  LEMMA 1. BBS visits (leaf and intermediate) entries of an R-tree in ascend-
ing order of their distance to the origin of the axis.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                •      55




                        Fig. 8. Entries of the main-memory R-tree.

   PROOF. The proof is straightforward since the algorithm always visits en-
tries according to their mindist order preserved by the heap.
   LEMMA 2. Any data point added to S during the execution of the algorithm
is guaranteed to be a final skyline point.
    PROOF. Assume, on the contrary, that point p j was added into S, but it is not
a final skyline point. Then p j must be dominated by a (final) skyline point, say,
pi , whose coordinate on any axis is not larger than the corresponding coordinate
of p j , and at least one coordinate is smaller (since pi and p j are different points).
This in turn means that mindist( pi ) < mindist( p j ). By Lemma 1, pi must be
visited before p j . In other words, at the time p j is processed, pi must have
already appeared in the skyline list, and hence p j should be pruned, which
contradicts the fact that p j was added in the list.
  LEMMA 3. Every data point will be examined, unless one of its ancestor nodes
has been pruned.

   PROOF. The proof is obvious since all entries that are not pruned by an
existing skyline point are inserted into the heap and examined.
   Lemmas 2 and 3 guarantee that, if BBS is allowed to execute until its ter-
mination, it will correctly return all skyline points, without reporting any false
hits. An important issue regards the dominance checking, which can be expen-
sive if the skyline contains numerous points. In order to speed up this process
we insert the skyline points found in a main-memory R-tree. Continuing the
example of Figure 6, for instance, only points i, a, k will be inserted (in this
order) to the main-memory R-tree. Checking for dominance can now be per-
formed in a way similar to traditional window queries. An entry (i.e., node
MBR or data point) is dominated by a skyline point p, if its lower left point
falls inside the dominance region of p, that is, the rectangle defined by p and
the edge of the universe. Figure 8 shows the dominance regions for points i,
a, k and two entries; e is dominated by i and k, while e is not dominated by
any point (therefore is should be expanded). Note that, in general, most domi-
nance regions will cover a large part of the data space, in which case there will
be significant overlap between the intermediate nodes of the main-memory
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
56      •      D. Papadias et al.

R-tree. Unlike traditional window queries that must retrieve all results, this
is not a problem here because we only need to retrieve a single dominance re-
gion in order to determine that the entry is dominated (by at least one skyline
point).
   To conclude this section, we informally evaluate BBS with respect to the
criteria of Hellerstein et al. [1999] and Kossmann et al. [2002], presented in
Section 2.6. BBS satisfies property (i) as it returns skyline points instantly in
ascending order of their distance to the origin, without having to visit a large
part of the R-tree. Lemma 3 ensures property (ii), since every data point is
examined unless some of its ancestors is dominated (in which case the point is
dominated too). Lemma 2 guarantees property (iii). Property (iv) is also fulfilled
because BBS outputs points according to their mindist, which takes into account
all dimensions. Regarding user preferences (v), as we discuss in Section 4.1,
the user can specify the order of skyline points to be returned by appropriate
preference functions. Furthermore, BBS also satisfies property (vi) since it does
not require any specialized indexing structure, but (like NN) it can be applied
with R-trees or any other data-partitioning method. Furthermore, the same
index can be used for any subset of the d dimensions that may be relevant to
different users.

3.3 Analysis of BBS
In this section, we first prove that BBS is I/O optimal, meaning that (i) it visits
only the nodes that may contain skyline points, and (ii) it does not access the
same node twice. Then we provide a theoretical comparison with NN in terms
of the number of node accesses and memory consumption (i.e., the heap versus
the to-do list sizes). Central to the analysis of BBS is the concept of the skyline
search region (SSR), that is, the part of the data space that is not dominated
by any skyline point. Consider for instance the running example (with skyline
points i, a, k). The SSR is the shaded area in Figure 8 defined by the skyline
and the two axes. We start with the following observation.

  LEMMA 4. Any skyline algorithm based on R-trees must access all the nodes
whose MBRs intersect the SSR.

For instance, although entry e in Figure 8 does not contain any skyline points,
this cannot be determined unless the child node of e is visited.

   LEMMA 5. If an entry e does not intersect the SSR, then there is a skyline
point p whose distance from the origin of the axes is smaller than the mindist
of e.

   PROOF. Since e does not intersect the SSR, it must be dominated by at
least one skyline point p, meaning that p dominates the lower-left corner of
e. This implies that the distance of p to the origin is smaller than the mindist
of e.

     THEOREM 6.       The number of node accesses performed by BBS is optimal.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                •      57

   PROOF. First we prove that BBS only accesses nodes that may contain sky-
line points. Assume, to the contrary, that the algorithm also visits an entry
(let it be e in Figure 8) that does not intersect the SSR. Clearly, e should not
be accessed because it cannot contain skyline points. Consider a skyline point
that dominates e (e.g., k). Then, by Lemma 5, the distance of k to the origin is
smaller than the mindist of e. According to Lemma 1, BBS visits the entries of
the R-tree in ascending order of their mindist to the origin. Hence, k must be
processed before e, meaning that e will be pruned by k, which contradicts the
fact that e is visited.
   In order to complete the proof, we need to show that an entry is not visited
multiple times. This is straightforward because entries are inserted into the
heap (and expanded) at most once, according to their mindist.

   Assuming that each leaf node visited contains exactly one skyline point, the
number NABBS of node accesses performed by BBS is at most s · h (where s
is the number of skyline points, and h the height of the R-tree). This bound
corresponds to a rather pessimistic case, where BBS has to access a complete
path for each skyline point. Many skyline points, however, may be found in the
same leaf nodes, or in the same branch of a nonleaf node (e.g., the root of the
tree!), so that these nodes only need to be accessed once (our experiments show
that in most cases the number of node accesses at each level of the tree is much
smaller than s). Therefore, BBS is at least d (= s·h·d /s·h) times faster than NN
(as explained in Section 3.1, the cost NANN of NN is at least s · h · d ). In practice,
for d > 2, the speedup is much larger than d (several orders of magnitude) as
NANN = s · h · d does not take into account the number r of redundant queries.
   Regarding the memory overhead, the number of entries nheap in the heap of
BBS is at most ( f − 1) · NABBS . This is a pessimistic upper bound, because it
assumes that a node expansion removes from the heap the expanded entry and
inserts all its f children (in practice, most children will be dominated by some
discovered skyline point and pruned). Since for independent dimensions the
expected number of skyline points is s = ((ln N )d −1 /(d − 1)!) (Buchta [1989]),
nheap ≤ ( f − 1) · NABBS ≈ ( f − 1) · h · s ≈ ( f − 1) · h · (ln N )d −1 /(d − 1)!. For
d ≥ 3 and typical values of N and f (e.g., N = 105 and f ≈ 100), the heap
size is much smaller than the corresponding to-do list size, which as discussed
in Section 3.1 can be in the order of (d − 1)log N . Furthermore, a heap entry
stores d + 2 numbers (i.e., entry id, mindist, and the coordinates of the lower-
left corner), as opposed to 2d numbers for to-do list entries (i.e., d -dimensional
ranges).
   In summary, the main-memory requirement of BBS is at the same order
as the size of the skyline, since both the heap and the main-memory R-tree
sizes are at this order. This is a reasonable assumption because (i) skylines
are normally small and (ii) previous algorithms, such as index, are based on
the same principle. Nevertheless, the size of the heap can be further reduced.
Consider that in Figure 9 intermediate node e is visited first and its children
(e.g., e1 ) are inserted into the heap. When e is visited afterward (e and e have
the same mindist), e1 can be immediately pruned, because there must exist at
least a (not yet discovered) point in the bottom edge of e1 that dominates e1 . A
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
58      •      D. Papadias et al.




                                Fig. 9. Reducing the size of the heap.

similar situation happens if node e is accessed first. In this case e1 is inserted
into the heap, but it is removed (before its expansion) when e1 is added. BBS
can easily incorporate this mechanism by checking the contents of the heap
before the insertion of an entry e: (i) all entries dominated by e are removed;
(ii) if e is dominated by some entry, it is not inserted. We chose not to implement
this optimization because it induces some CPU overhead without affecting the
number of node accesses, which is optimal (in the above example e1 would be
pruned during its expansion since by that time e1 will have been visited).

3.4 Incremental Maintenance of the Skyline
The skyline may change due to subsequent updates (i.e., insertions and dele-
tions) to the database, and hence should be incrementally maintained to avoid
recomputation. Given a new point p (e.g., a hotel added to the database), our
incremental maintenance algorithm first performs a dominance check on the
main-memory R-tree. If p is dominated (by an existing skyline point), it is sim-
ply discarded (i.e., it does not affect the skyline); otherwise, BBS performs a
window query (on the main-memory R-tree), using the dominance region of p,
to retrieve the skyline points that will become obsolete (i.e., those dominated by
p). This query may not retrieve anything (e.g., Figure 10(a)), in which case the
number of skyline points increases by one. Figure 10(b) shows another case,
where the dominance region of p covers two points i, k, which are removed
(from the main-memory R-tree). The final skyline consists of only points a, p.
   Handling deletions is more complex. First, if the point removed is not in
the skyline (which can be easily checked by the main-memory R-tree using
the point’s coordinates), no further processing is necessary. Otherwise, part
of the skyline must be reconstructed. To illustrate this, assume that point i in
Figure 11(a) is deleted. For incremental maintenance, we need to compute the
skyline with respect only to the points in the constrained (shaded) area, which
is the region exclusively dominated by i (i.e., not including areas dominated by
other skyline points). This is because points (e.g., e, l ) outside the shaded area
cannot appear in the new skyline, as they are dominated by at least one other
point (i.e., a or k). As shown in Figure 11(b), the skyline within the exclusive
dominance region of i contains two points h and m, which substitute i in the final
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems                 •      59




                 Fig. 10. Incremental skyline maintenance for insertion.




                 Fig. 11. Incremental skyline maintenance for deletion.



skyline (of the whole dataset). In Section 4.1, we discuss skyline computation
in a constrained region of the data space.
   Except for the above case of deletion, incremental skyline maintenance in-
volves only main-memory operations. Given that the skyline points constitute
only a small fraction of the database, the probability of deleting a skyline point
is expected to be very low. In extreme cases (e.g., bulk updates, large num-
ber of skyline points) where insertions/deletions frequently affect the skyline,
we may adopt the following “lazy” strategy to minimize the number of disk
accesses: after deleting a skyline point p, we do not compute the constrained
skyline immediately, but add p to a buffer. For each subsequent insertion, if p
is dominated by a new point p , we remove it from the buffer because all the
points potentially replacing p would become obsolete anyway as they are dom-
inated by p (the insertion of p may also render other skyline points obsolete).
When there are no more updates or a user issues a skyline query, we perform
a single constrained skyline search, setting the constraint region to the union
of the exclusive dominance regions of the remaining points in the buffer, which
is emptied afterward.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
60      •      D. Papadias et al.




                                Fig. 12. Constrained query example.


4. VARIATIONS OF SKYLINE QUERIES
In this section we propose novel variations of skyline search, and illustrate how
BBS can be applied for their processing. In particular, Section 4.1 discusses
constrained skylines, Section 4.2 ranked skylines, Section 4.3 group-by sky-
lines, Section 4.4 dynamic skylines, Section 4.5 enumerating and K -dominating
queries, and Section 4.6 skybands.

4.1 Constrained Skyline
Given a set of constraints, a constrained skyline query returns the most in-
teresting points in the data space defined by the constraints. Typically, each
constraint is expressed as a range along a dimension and the conjunction of all
constraints forms a hyperrectangle (referred to as the constraint region) in the
d -dimensional attribute space. Consider the hotel example, where a user is in-
terested only in hotels whose prices ( y axis) are in the range [4, 7]. The skyline
in this case contains points g , f , and l (Figure 12), as they are the most inter-
esting hotels in the specified price range. Note that d (which also satisfies the
constraints) is not included as it is dominated by g . The constrained query can
be expressed using the syntax of Borzsonyi et al. [2001] and the where clause:
Select *, From Hotels, Where Price∈[4, 7], Skyline of Price min, Distance min.
In addition, constrained queries are useful for incremental maintenance of the
skyline in the presence of deletions (as discussed in Section 3.4).
   BBS can easily process such queries. The only difference with respect to the
original algorithm is that entries not intersecting the constraint region are
pruned (i.e., not inserted in the heap). Table IV shows the contents of the heap
during the processing of the query in Figure 12. The same concept can also be
applied when the constraint region is not a (hyper-) rectangle, but an arbitrary
area in the data space.
   The NN algorithm can also support constrained skylines with a similar
modification. In particular, the first nearest neighbor (e.g., g ) is retrieved in
the constraint region using constrained nearest-neighbor search [Ferhatosman-
oglu et al. 2001]. Then, each space subdivision is the intersection of the origi-
nal subdivision (area to be searched by NN for the unconstrained query) and
the constraint region. The index method can benefit from the constraints, by
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                  •      61

                     Table IV. Heap Contents for Constrained Query
                     Action                Heap Contents             S
                     Access root     <e7 , 4><e6 , 6>                Ø
                     Expand e7       <e3 , 5><e6 , 6><e4 , 10>       Ø
                     Expand e3       <e6 , 6> <e4 , 10><g, 11>       Ø
                     Expand e6       <e4 , 10><g, 11><e2 , 11>       Ø
                     Expand e4       <g, 11><e2 , 11><l, 14>        {g}
                     Expand e2       <f, 12><d, 13><l, 14>        {g, f, l}


starting with the batches at the beginning of the constraint ranges (instead of
the top of the lists). Bitmap can avoid loading the juxtapositions (see Section
2.3) for points that do not satisfy the query constraints, and D&C may discard,
during the partitioning step, points that do not belong to the constraint region.
For BNL and SFS, the only difference with respect to regular skyline retrieval is
that only points in the constraint region are inserted in the self-organizing list.

4.2 Ranked Skyline
Given a set of points in the d -dimensional space [0, 1]d , a ranked (top-K ) sky-
line query (i) specifies a parameter K , and a preference function f which is
monotone on each attribute, (ii) and returns the K skyline points p that have
the minimum score according to the input function. Consider the running exam-
ple, where K = 2 and the preference function is f (x, y) = x + 3 y 2 . The output
skyline points should be < k, 12 >, < i, 15 > in this order (the number with
each point indicates its score). Such ranked skyline queries can be expressed
using the syntax of Borzsonyi et al. [2001] combined with the order by and stop
after clauses: Select *, From Hotels, Skyline of Price min, Distance min, order
by Price + 3·sqr(Distance), stop after 2.
   BBS can easily handle such queries by modifying the mindist definition to
reflect the preference function (i.e., the mindist of a point with coordinates x
and y equals x + 3 y 2 ). The mindist of an intermediate entry equals the score
of its lower-left point. Furthermore, the algorithm terminates after exactly K
points have been reported. Due to the monotonicity of f , it is easy to prove that
the output points are indeed skyline points. The only change with respect to
the original algorithm is the order of entries visited, which does not affect the
correctness or optimality of BBS because in any case an entry will be considered
after all entries that dominate it.
   None of the other algorithms can answer this query efficiently. Specifically,
BNL, D&C, bitmap, and index (as well as SFS if the scoring function is different
from the sorting one) require first retrieving the entire skyline, sorting the
skyline points by their scores, and then outputting the best K ones. On the other
hand, although NN can be used with all monotone functions, its application to
ranked skyline may incur almost the same cost as that of a complete skyline.
This is because, due to its divide-and-conquer nature, it is difficult to establish
the termination criterion. If, for instance, K = 2, NN must perform d queries
after the first nearest neighbor (skyline point) is found, compare their results,
and return the one with the minimum score. The situation is more complicated
when K is large where the output of numerous queries must be compared.
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
62      •      D. Papadias et al.

4.3 Group-By Skyline
Assume that for each hotel, in addition to the price and distance, we also store
its class (i.e., 1-star, 2-star, . . . , 5-star). Instead of a single skyline covering all
three attributes, a user may wish to find the individual skyline in each class.
Conceptually, this is equivalent to grouping the hotels by their classes, and then
computing the skyline for each group; that is, the number of skylines equals
the cardinality of the group-by attribute domain. Using the syntax of Borzsonyi
et al. [2001], the query can be expressed as Select *, From Hotels, Skyline of
Price min, Distance min, Class diff (i.e., the group-by attribute is specified by
the keyword diff).
    One straightforward way to support group-by skylines is to create a sepa-
rate R-tree for the hotels in the same class, and then invoke BBS in each tree.
Separating one attribute (i.e., class) from the others, however, would compro-
mise the performance of queries involving all the attributes.4 In the following,
we present a variation of BBS which operates on a single R-tree that indexes
all the attributes. For the above example, the algorithm (i) stores the skyline
points already found for each class in a separate main-memory 2D R-tree and
(ii) maintains a single heap containing all the visited entries. The difference is
that the sorting key is computed based only on price and distance (i.e., exclud-
ing the group-by attribute). Whenever a data point is retrieved, we perform the
dominance check at the corresponding main-memory R-tree (i.e., for its class),
and insert it into the tree only if it is not dominated by any existing point.
    On the other hand the dominance check for each intermediate entry e (per-
formed before its insertion into the heap, and during its expansion) is more com-
plicated, because e is likely to contain hotels of several classes (we can identify
the potential classes included in e by its projection on the corresponding axis).
First, its MBR (i.e., a 3D box) is projected onto the price-distance plane and
the lower-left corner c is obtained. We need to visit e, only if c is not dominated
in some main-memory R-tree corresponding to a class covered by e. Consider,
for instance, that the projection of e on the class dimension is [2, 4] (i.e., e may
contain only hotels with 2, 3, and 4 stars). If the lower-left point of e (on the
price-distance plane) is dominated in all three classes, e cannot contribute any
skyline point. When the number of distinct values of the group-by attribute
is large, the skylines may not fit in memory. In this case, we can perform the
algorithm in several passes, each pass covering a number of continuous values.
The processing cost will be higher as some nodes (e.g., the root) may be visited
several times.
    It is not clear how to extend NN, D&C, index, or bitmap for group-by skylines
                 ı
beyond the na¨ve approach, that is, invoke the algorithms for every value of the
group-by attribute (e.g., each time focusing on points belonging to a specific
group), which, however, would lead to high processing cost. BNL and SFS can
be applied in this case by maintaining separate temporary skylines for each
class value (similar to the main memory R-trees of BBS).


4A 3D skyline in this case should maximize the value of the class (e.g., given two hotels with the
same price and distance, the one with more stars is preferable).

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                        Progressive Skyline Computation in Database Systems                    •     63

4.4 Dynamic Skyline
Assume a database containing points in a d -dimensional space with axes
d 1 , d 2 , . . . , d d . A dynamic skyline query specifies m dimension functions f 1 ,
 f 2 , . . . , f m such that each function f i (1 ≤ i ≤ m) takes as parameters the co-
ordinates of the data points along a subset of the d axes. The goal is to return
the skyline in the new data space with dimensions defined by f 1 , f 2 , . . . , f m .
Consider, for instance, a database that stores the following information for each
hotel: (i) its x and (ii) y coordinates, and (iii) its price (i.e., the database contains
three dimensions). Then, a user specifies his/her current location (ux , u y ), and
requests the most interesting hotels, where preference must take into consid-
eration the hotels’ proximity to the user (in terms of Euclidean distance) and
the price. Each point p with coordinates ( px , p y , pz ) in the original 3D space is
transformed to a point p in the 2D space with coordinates ( f 1 ( px , p y ), f 2 ( pz )),
where the dimension functions f 1 and f 2 are defined as


           f 1 ( px , p y ) =   ( px − ux )2 + ( p y − u y )2 ,   and      f 2 ( pz ) = pz .


   The terms original and dynamic space refer to the original d -dimensional
data space and the space with computed dimensions (from f 1 , f 2 , . . . , f m ), re-
spectively. Correspondingly, we refer to the coordinates of a point in the original
space as original coordinates, while to those of the point in the dynamic space
as dynamic coordinates.
   BBS is applicable to dynamic skylines by expanding entries in the heap ac-
cording to their mindist in the dynamic space (which is computed on-the-fly
when the entry is considered for the first time). In particular, the mindist
of a leaf entry (data point) e with original coordinates (ex , e y , ez ), equals
  (ex − ux )2 + (e y − u y )2 + ez . The mindist of an intermediate entry e whose
MBR has ranges [ex0 , ex1 ] [e y0 , e y1 ] [ez0 , ez1 ] is computed as mindist([ex0 , ex1 ]
[e y0 , e y1 ], (ux , u y )) + ez0 , where the first term equals the mindist between point
(ux , u y ) to the 2D rectangle [ex0 , ex1 ] [e y0 , e y1 ]. Furthermore, notice that the
concept of dynamic skylines can be employed in conjunction with ranked and
constraint queries (i.e., find the top five hotels within 1 km, given that the price
is twice as important as the distance). BBS can process such queries by ap-
propriate modification of the mindist definition (the z coordinate is multiplied
by 2) and by constraining the search region ( f 1 (x, y) ≤ 1 km).
    Regarding the applicability of the previous methods, BNL still applies be-
cause it evaluates every point, whose dynamic coordinates can be computed
on-the-fly. The optimizations, of SFS, however, are now useless since the order
of points in the dynamic space may be different from that in the original space.
D&C and NN can also be modified for dynamic queries with the transformations
described above, suffering, however, from the same problems as the original al-
gorithms. Bitmap and index are not applicable because these methods rely on
pre-computation, which provides little help when the dimensions are defined
dynamically.
                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
64      •      D. Papadias et al.

4.5 Enumerating and K -Dominating Queries
Enumerating queries return, for each skyline point p, the number of points
dominated by p. This information provides some measure of “goodness” for the
skyline points. In the running example, for instance, hotel i may be more inter-
esting than the other skyline points since it dominates nine hotels as opposed
to two for hotels a and k. Let’s call num( p) the number of points dominated by
point p. A straightforward approach to process such queries involves two steps:
(i) first compute the skyline and (ii) for each skyline point p apply a query win-
dow in the data R-tree and count the number of points num( p) falling inside the
dominance region of p. Notice that since all (except for the skyline) points are
dominated, all the nodes of the R-tree will be accessed by some query. Further-
more, due to the large size of the dominance regions, numerous R-tree nodes
will be accessed by several window queries. In order to avoid multiple node vis-
its, we apply the inverse procedure, that is, we scan the data file and for each
point we perform a query in the main-memory R-tree to find the dominance re-
gions that contain it. The corresponding counters num( p) of the skyline points
are then increased accordingly.
    An interesting variation of the problem is the K -dominating query, which
retrieves the K points that dominate the largest number of other points. Strictly
speaking, this is not a skyline query, since the result does not necessarily contain
skyline points. If K = 3, for instance, the output should include hotels i, h, and
m, with num(i) = 9, num(h) = 7, and num(m) = 5. In order to obtain the
result, we first perform an enumerating query that returns the skyline points
and the number of points that they dominate. This information for the first
K = 3 points is inserted into a list sorted according to num( p), that is, list =
< i, 9 >, < a, 2 >, < k, 2 >. The first element of the list (point i) is the first result
of the 3-dominating query. Any other point potentially in the result should be
in the (exclusive) dominance region of i, but not in the dominance region of a, or
k(i.e., in the shaded area of Figure 13(a)); otherwise, it would dominate fewer
points than a, or k. In order to retrieve the candidate points, we perform a local
skyline query S in this region (i.e., a constrained query), after removing i from
S and reporting it to the user. S contains points h and m. The new skyline
S1 = (S − {i}) ∪ S is shown in Figure 13(b).
    Since h and m do not dominate each other, they may each dominate at
most seven points (i.e., num(i) − 2), meaning that they are candidates for the
3-dominating query. In order to find the actual number of points dominated,
we perform a window query in the data R-tree using the dominance regions
of h and m as query windows. After this step, < h, 7 > and < m, 5 > replace
the previous candidates < a, 2 >, < k, 2 > in the list. Point h is the second
result of the 3-dominating query and is output to the user. Then, the process is
repeated for the points that belong to the dominance region of h, but not in the
dominance regions of other points in S1 (i.e., shaded area in Figure 13(c)). The
new skyline S2 = (S1 − {h}) ∪ {c, g } is shown in Figure 13(d). Points c and g may
dominate at most five points each (i.e., num(h) − 2), meaning that they cannot
outnumber m. Hence, the query terminates with < i, 9 >< h, 7 >< m, 5 > as
the final result. In general, the algorithm can be thought of as skyline “peeling,”
since it computes local skylines at the points that have the largest dominance.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems               •      65




                       Fig. 13. Example of 3-dominating query.




   Figure 14 shows the pseudocode for K -dominating queries. It is worth point-
ing out that the exclusive dominance region of a skyline point for d > 2 is
not necessarily a hyperrectangle (e.g., in 3D space it may correspond to an
“L-shaped” polyhedron derived by removing a cube from another cube). In
this case, the constraint region can be represented as a union of hyperrect-
angles (constrained BBS is still applicable). Furthermore, since we only care
about the number of points in the dominance regions (as opposed to their
ids), the performance of window queries can be improved by using aggre-
gate R-trees [Papadias et al. 2001] (or any other multidimensional aggregate
index).
   All existing algorithms can be employed for enumerating queries, since the
only difference with respect to regular skylines is the second step (i.e., counting
the number of points dominated by each skyline point). Actually, the bitmap
approach can avoid scanning the actual dataset, because information about
num( p) for each point p can be obtained directly by appropriate juxtapositions
of the bitmaps. K -dominating queries require an effective mechanism for sky-
line “peeling,” that is, discovery of skyline points in the exclusive dominance
region of the last point removed from the skyline. Since this requires the ap-
plication of a constrained query, all algorithms are applicable (as discussed in
Section 4.1).
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
66      •      D. Papadias et al.




                              Fig. 14. K -dominating BBS algorithm.




                               Fig. 15. Example of 2-skyband query.

4.6 Skyband Query
Similar to K nearest-neighbor queries (that return the K NNs of a point), a
K -skyband query reports the set of points which are dominated by at most K
points. Conceptually, K represents the thickness of the skyline; the case K = 0
corresponds to a conventional skyline. Figure 15 illustrates the result of a 2-
skyband query containing hotels {a, b, c, g, h, i, k, m}, each dominated by at
most two other hotels.
   A na¨ve approach to check if a point p with coordinates ( p1 , p2 , . . . , pd ) is
        ı
in the skyband would be to perform a window query in the R-tree and count
the number of points inside the range [0, p1 ) [0, p2 ) . . . [0, pd ). If this number
is smaller than or equal to K , then p belongs to the skyband. Obviously, the
approach is very inefficient, since the number of window queries equals the
cardinality of the dataset. On the other hand, BBS provides an efficient way for
processing skyband queries. The only difference with respect to conventional
skylines is that an entry is pruned only if it is dominated by more than K
discovered skyline points. Table V shows the contents of the heap during the
processing of the query in Figure 15. Note that the skyband points are reported
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Progressive Skyline Computation in Database Systems                   •         67

                        Table V. Heap Contents of 2-Skyband Query
  Action                              Heap Contents                                     S
  Access root   <e7 , 4><e6 , 6>                                                       Ø
  Expand e7     <e3 , 5><e6 , 6><e5 , 8><e4 , 10>                                      Ø
  Expand e3     <i, 5><e6 , 6><h, 7><e5 , 8> <e4 , 10><g, 11>                          {i}
  Expand e6     <h, 7><e5 , 8><e1 , 9><e4 , 10><e2 , 11><g, 11>                      {i, h}
  Expand e5     <m, 8><e1 , 9><e4 , 10><n, 11><e2 , 11><g, 11>                     {i, h, m}
  Expand e1     <a, 10><e4 , 10><n, 11><e2 , 11><g, 11><b, 12><c, 12>            {i, h, m, a}
  Expand e4     <k, 10><n, 11><e2 , 11><g, 11><b, 12><c, 12><l, 14>        {i, h, m, a, k, g, b, c}


                            Table VI. Applicability Comparison
                            D&C     BNL     SFS     Bitmap      Index    NN      BBS
           Constrained       Yes     Yes    Yes       Yes        Yes     Yes     Yes
           Ranked            No      No     No        No         No      No      Yes
           Group-by          No      Yes    Yes       No         No      No      Yes
           Dynamic           Yes     Yes    Yes       No         No      Yes     Yes
           K-dominating      Yes     Yes    Yes       Yes        Yes     Yes     Yes
           K-skyband         No      Yes    Yes       No         No      No      Yes


in ascending order of their scores, therefore maintaining the progressiveness of
the results. BNL and SFS can support K -skyband queries with similar modifi-
cations (i.e., insert a point in the list if it is dominated by no more than K other
points). None of the other algorithms is applicable, at least in an obvious way.

4.7 Summary
Finally, we close this section with Table VI, which summarizes the applicability
of the existing algorithms for each skyline variation. A “no” means that the
technique is inapplicable, inefficient (e.g., it must perform a postprocessing step
on the basic algorithm), or its extension is nontrivial. Even if an algorithm (e.g.,
BNL) is applicable for a query type (group-by skylines), it does not necessarily
imply that it is progressive (the criteria of Section 2.6 also apply to the new
skyline queries). Clearly, BBS has the widest applicability since it can process
all query types effectively.

5. APPROXIMATE SKYLINES
In this section we introduce approximate skylines, which can be used to pro-
vide immediate feedback to the users (i) without any node accesses (using a
histogram on the dataset), or (ii) progressively, after the root visit of BBS. The
problem for computing approximate skylines is that, even for uniform data, we
cannot probabilistically estimate the shape of the skyline based only on the
dataset cardinality N . In fact, it is difficult to predict the actual number of sky-
line points (as opposed to their order of magnitude [Buchta 1989]). To illustrate
this, Figures 16(a) and 16(b) show two datasets that differ in the position of
a single point, but have different skyline cardinalities (1 and 4, respectively).
Thus, instead of obtaining the actual shape, we target a hypothetical point p
such that its x and y coordinates are the minimum among all the expected co-
ordinates in the dataset. We then define the approximate skyline using the two
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
68      •      D. Papadias et al.




                                     Fig. 16. Skylines of uniform data.

line segments enclosing the dominance region of p. As shown in Figure 16(c),
this approximation can be thought of as a “low-resolution” skyline.
    Next we compute the expected coordinates of p. First, for uniform distribu-
tion, it is reasonable to assume that p falls on the diagonal of the data space
(because the data characteristics above and below the diagonal are similar).
Assuming, for simplicity, that the data space has unit length on each axis, we
denote the coordinates of p as (λ, λ) with 0 ≤ λ ≤ 1. To derive the expected
value for λ, we need the probability P{λ ≤ ξ } that λ is no larger than a specific
value ξ . To calculate this, note that λ > ξ implies that all the points fall in
the dominance region of (ξ , ξ ) (i.e., a square with length 1 − ξ ). For uniform
data, a point has probability (1 − ξ )2 to fall in this region, and thus P{λ > ξ }
(i.e., the probability that all points are in this region) equals [(1 − ξ )2 ] N . So, P
{λ ≤ ξ } = 1 − (1 − ξ )2N , and the expected value of λ is given by
                                1                                1
                                       dP(λ ≤ ξ )
                  E(λ) =            ξ·            dξ = 2N            ξ · (1 − ξ )2N −1 dξ .   (5.1)
                                          dξ
                            0                                0

Solving this integral, we have
                                          E(λ) = 1/(2N + 1).                                  (5.2)
   Following similar derivations for d -dimensional spaces, we obtain E(λ) =
1/(d · N + 1). If the dimensions of the data space have different lengths, then
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                 •      69




              Fig. 17. Obtaining the approximate skyline for nonuniform data.


the expected coordinate of the hypothetical skyline point on dimension i equals
ALi /(d · N +1), where ALi is the length of the axis. Based on the above analysis,
we can obtain the approximate skyline for arbitrary data distribution using a
multidimensional histogram [Muralikrishna and DeWitt 1988; Acharya et al.
1999], which typically partitions the data space into a set of buckets and stores
for each bucket the number (called density) of points in it. Figure 17(a) shows the
extents of 6 buckets (b1 , . . . , b6 ) and their densities, for the dataset of Figure 1.
Treating each bucket as a uniform data space, we compute the hypothetical
skyline point based on its density. Then the approximate skyline of the original
dataset is the skyline of all the hypothetical points, as shown in Figure 17(b).
Since the number of hypothetical points is small (at most the number of buck-
ets), the approximate skyline can be computed using existing main-memory
algorithms (e.g., Kung et al. [1975]; Matousek [1991]). Due to the fact that his-
tograms are widely used for selectivity estimation and query optimization, the
extraction of approximate skylines does not incur additional requirements and
does not involve I/O cost.
   Approximate skylines using histograms can provide some information about
the actual skyline in environments (e.g., data streams, on-line processing sys-
tems) where only limited statistics of the data distribution (instead of individual
data) can be maintained; thus, obtaining the exact skyline is impossible. When
the actual data are available, the concept of approximate skyline, combined
with BBS, enables the “drill-down” exploration of the actual one. Consider, for
instance, that we want to estimate the skyline (in the absence of histograms)
by performing a single node access. In this case, BBS retrieves the data R-tree
root and computes by Equation (5.2), for every entry MBR, a hypothetical sky-
line point (i) assuming that the distribution in each MBR is almost uniform
(a reasonable assumption for R-trees [Theodoridis et al. 2000]), and (ii) using
the average node capacity and the tree level to estimate the number of points
in the MBR. The skyline of the hypothetical points constitutes a rough esti-
mation of the actual skyline. Figure 18(a) shows the approximate skyline after
visiting the root entry as well as the real skyline (dashed line). The approx-
imation error corresponds to the difference of the SSRs of the two skylines,
that is, the area that is dominated by exactly one skyline (shaded region in
Figure 18(a)).
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
70      •      D. Papadias et al.




                   Fig. 18. Approximate skylines as a function of node accesses.

   The approximate version of BBS maintains, in addition to the actual skyline
S, a set HS consisting of points in the approximate skyline. HS is used just for
reporting the current skyline approximation and not to guide the search (the
order of node visits remains the same as the original algorithm). For each inter-
mediate entry found, if its hypothetical point p is not dominated by any point
in HS, it is added into the approximate skyline and all the points dominated
by p are removed from HS. Leaf entries correspond to actual data points and
are also inserted in HS (provided that they are not dominated). When an entry
is deheaped, we remove the corresponding (hypothetical or actual) point from
HS. If a data point is added to S, it is also inserted in HS. The approximate
skyline is progressively refined as more nodes are visited, for example, when
the second node N7 is deheaped, the hypothetical point of N7 is replaced with
those of its children and the new HS is computed as shown in Figure 18(b).
Similarly, the expansion of N3 will lead to the approximate skyline of Figure
18(c). At the termination of approximate BBS, the estimated skyline coincides
with the actual one. To show this, assume, on the contrary, that at the termi-
nation of the algorithm there still exists a hypothetical/actual point p in HS,
which does not belong to S. It follows that p is not dominated by the actual
skyline. In this case, the corresponding (intermediate or leaf) entry producing
p should be processed, contradicting the fact that the algorithm terminates.
   Note that for computing the hypothetical point of each MBR we use Equa-
tion (5.2) because it (i) is simple and efficient (in terms of computation cost),
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                 •      71




              Fig. 19. Alternative approximations after visiting root and N7 .

(ii) provides a uniform treatment of approximate skylines (i.e., the same as in
the case of histograms), and (iii) has high accuracy (as shown in Section 6.8).
Nevertheless, we may derive an alternative approximation based on the fact
that each MBR boundary contains a data point. Assuming a uniform distribu-
tion on the MBR projections and that no point is minimum on two different
dimensions, this approximation leads to d hypothetical points per MBR such
that the expected position of each point is 1/((d − 1) · N + 1). Figure 19(a) shows
the approximate skyline in this case after the first two node visits (root and N7 ).
Alternatively, BBS can output an envelope enclosing the actual skyline, where
the lower bound refers to the skyline obtained from the lower-left vertices of the
MBRs and the upper bound refers to the skyline obtained from the upper-right
vertices. Figure 19(b) illustrates the corresponding envelope (shaded region)
after the first two node visits. The volume of the envelope is an upper bound
for the actual approximation error, which shrinks as more nodes are accessed.
The concepts of skyline approximation or envelope permit the immediate visu-
alization of information about the skyline, enhancing the progressive behavior
of BBS. In addition, approximate BBS can be easily modified for processing the
query variations of Section 4 since the only difference is the maintenance of the
hypothetical points in HS for the entries encountered by the original algorithm.
The computation of hypothetical points depends on the skyline variation, for
example, for constrained skylines the points are computed by taking into ac-
count only the node area inside the constraint region. On the other hand, the
application of these concepts to NN is not possible (at least in an obvious way),
because of the duplicate elimination problem and the multiple accesses to the
same node(s).

6. EXPERIMENTAL EVALUATION
In this section we verify the effectiveness of BBS by comparing it against NN
which, according to the evaluation of Kossmann et al. [2002], is the most effi-
cient existing algorithm and exhibits progressive behavior. Our implementation
of NN combined laisser-faire and propagate because, as discussed in Section 2.5,
it gives the best results. Specifically, only the first 20% of the to-do list was
searched for duplicates using propagate and the rest of the duplicates were
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
72      •      D. Papadias et al.




                      Fig. 20. Node accesses vs. dimensionality d (N = 1M).

handled with laisser-faire. Following the common methodology in the literature,
we employed independent (uniform) and anticorrelated5 datasets (generated in
the same way as described in Borzsonyi et al. [2001]) with dimensionality d in
the range [2, 5] and cardinality N in the range [100K, 10M]. The length of each
axis was 10,000. Datasets were indexed by R*-trees [Beckmann et al. 1990]
with a page size of 4 kB, resulting in node capacities between 204 (d = 2)
and 94 (d = 5). For all experiments we measured the cost in terms of node
accesses since the diagrams for CPU-time are very similar (see Papadias et al.
[2003]).
   Sections 6.1 and 6.2 study the effects of dimensionality and cardinality for
conventional skyline queries, whereas Section 6.3 compares the progressive
behavior of the algorithms. Sections 6.4, 6.5, 6.6, and 6.7 evaluate constrained,
group-by skyline, K -dominating skyline, and K -skyband queries, respectively.
Finally, Section 6.8 focuses on approximate skylines. Ranked queries are not
included because NN is inapplicable, while the performance of BBS is the same
as in the experiments for progressive behavior. Similarly, the cost of dynamic
skylines is the same as that of conventional skylines in selected dimension
projections and omitted from the evaluation.

6.1 The Effect of Dimensionality
In order to study the effect of dimensionality, we used the datasets with cardi-
nality N = 1M and varied d between 2 and 5. Figure 20 shows the number of
node accesses as a function of dimensionality, for independent and anticorre-
lated datasets. NN could not terminate successfully for d > 4 in case of inde-
pendent, and for d > 3 in case of anticorrelated, datasets due to the prohibitive
size of the to-do list (to be discussed shortly). BBS clearly outperformed NN and
the difference increased fast with dimensionality. The degradation of NN was
caused mainly by the growth of the number of partitions (i.e., each skyline point
spawned d partitions), as well as the number of duplicates. The degradation of
BBS was due to the growth of the skyline and the poor performance of R-trees

5 For anticorrelated distribution, the dimensions are linearly correlated such that, if pi is smaller
than p j on one axis, then pi is likely to be larger on at least one other dimension (e.g., hotels near
the beach are typically more expensive). An anticorrelated dataset has fractal dimensionality close
to 1 (i.e., points lie near the antidiagonal of the space).

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                    Progressive Skyline Computation in Database Systems                 •      73




           Fig. 21. Heap and to-do list sizes versus dimensionality d (N = 1M).

in high dimensions. Note that these factors also influenced NN, but their effect
was small compared to the inherent deficiencies of the algorithm.
   Figure 21 shows the maximum sizes (in kbytes) of the heap, the to-do list,
and the dataset, as a function of dimensionality. For d = 2, the to-do list was
smaller than the heap, and both were negligible compared to the size of the
dataset. For d = 3, however, the to-do list surpassed the heap (for independent
data) and the dataset (for anticorrelated data). Clearly, the maximum size of
the to-do list exceeded the main-memory of most existing systems for d ≥ 4
(anticorrelated data), which explains the missing numbers about NN in the
diagrams for high dimensions. Notice that Kossmann et al. [2002] reported the
cost of NN for returning up to the first 500 skyline points using anticorrelated
data in five dimensions. NN can return a number of skyline points (but not
the complete skyline), because the to-do list does not reach its maximum size
until a sufficient number of skyline points have been found (and a large number
of partitions have been added). This issue is discussed further in Section 6.3,
where we study the sizes of the heap and to-do lists as a function of the points
returned.

6.2 The Effect of Cardinality
Figure 22 shows the number of node accesses versus the cardinality for 3D
datasets. Although the effect of cardinality was not as important as that of
dimensionality, in all cases BBS was several orders of magnitude faster than
NN. For anticorrelated data, NN did not terminate successfully for N ≥ 5M,
again due to the prohibitive size of the to-do list. Some irregularities in the
diagrams (a small dataset may be more expensive than a larger one) are due to
the positions of the skyline points and the order in which they were discovered.
If, for instance, the first nearest neighbor is very close to the origin of the axes,
both BBS and NN will prune a large part of their respective search spaces.

6.3 Progressive Behavior
Next we compare the speed of the algorithms in returning skyline points incre-
mentally. Figure 23 shows the node accesses of BBS and NN as a function of the
points returned for datasets with N = 1M and d = 3 (the number of points in
the final skyline was 119 and 977, for independent and anticorrelated datasets,
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
74      •      D. Papadias et al.




                       Fig. 22. Node accesses versus cardinality N (d = 3).




            Fig. 23. Node accesses versus number of points reported (N = 1M, d = 3).


respectively). Both algorithms return the first point with the same cost (since
they both apply nearest neighbor search to locate it). Then, BBS starts to grad-
ually outperform NN and the difference increases with the number of points
returned.
   To evaluate the quality of the results, Figure 24 shows the distribution of the
first 50 skyline points (out of 977) returned by each algorithm for the anticor-
related dataset with N = 1M and d = 3. The initial skyline points of BBS are
evenly distributed in the whole skyline, since they were discovered in the order
of their mindist (which was independent of the algorithm). On the other hand,
NN produced points concentrated in the middle of the data universe because
the partitioned regions, created by new skyline points, were inserted at the end
of the to-do list, and thus nearby points were subsequently discovered.
   Figure 25 compares the sizes of the heap and to-do lists as a function of the
points returned. The heap reaches its maximum size at the beginning of BBS,
whereas the to-do list reaches it toward the end of NN. This happens because
before BBS discovered the first skyline point, it inserted all the entries of the
visited nodes in the heap (since no entry can be pruned by existing skyline
points). The more skyline points were discovered, the more heap entries were
pruned, until the heap eventually became empty. On the other hand, the to-do
list size is dominated by empty queries, which occurred toward the late phases
of NN when the space subdivisions became too small to contain any points.
Thus, NN could still be used to return a number of skyline points (but not the
complete skyline) even for relatively high dimensionality.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Progressive Skyline Computation in Database Systems                  •      75




      Fig. 24. Distribution of the first 50 skyline points (anticorrelated, N = 1M, d = 3).




  Fig. 25. Sizes of the heap and to-do list versus number of points reported (N = 1M, d = 3).


6.4 Constrained Skyline
Having confirmed the efficiency of BBS for conventional skyline retrieval, we
present a comparison between BBS and NN on constrained skylines. Figure 26
shows the node accesses of BBS and NN as a function of the constraint region
volume (N = 1M, d = 3), which is measured as a percentage of the volume of
the data universe. The locations of constraint regions were uniformly generated
and the results were computed by taking the average of 50 queries. Again BBS
was several orders of magnitude faster than NN.
   The counterintuitive observation here is that constraint regions covering
more than 8% of the data space are usually more expensive than regular sky-
lines. Figure 27(a) verifies the observation by illustrating the node accesses of
BBS on independent data, when the volume of the constraint region ranges
between 98% and 100% (i.e., regular skyline). Even a range very close to 100%
is much more expensive than a conventional skyline. Similar results hold for
NN (see Figure 27(b)) and anticorrelated data.
   To explain this, consider Figure 28(a), which shows a skyline S in a constraint
region. The nodes that must be visited intersect the constrained skyline search
region (shaded area) defined by S and the constraint region. In this example,
all four nodes e1 , e2 , e3 , e4 may contain skyline points and should be accessed.
On the other hand, if S were a conventional skyline, as in Figure 28(b), nodes
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
76      •      D. Papadias et al.




            Fig. 26. Node accesses versus volume of constraint region (N = 1M, d = 3).




Fig. 27. Node accesses versus volume of constraint region 98–100% (independent, N = 1M, d = 3).

e2 , e3 , and e4 could not exist because they should contain at least a point that
dominates S. In general, the only data points of the conventional SSR (shaded
area in Figure 28(b)) lie on the skyline, implying that, for any node MBR, at
most one of its vertices can be inside the SSR. For constrained skylines there is
no such restriction and the number of nodes intersecting the constrained SSR
can be arbitrarily large.
    It is important to note that the constrained queries issued when a skyline
point is removed during incremental maintenance (see Section 3.4) are always
cheaper than computing the entire skyline from scratch. Consider, for instance,
that the partial skyline of Figure 28(a) is computed for the exclusive dominance
area of a deleted skyline point p on the lower-left corner of the constraint region.
In this case nodes such as e2 , e3 , e4 cannot exist because otherwise they would
have to contain skyline points, contradicting the fact that the constraint region
corresponds to the exclusive dominance area of p.

6.5 Group-By Skyline
Next we consider group-by skyline retrieval, including only BBS because, as dis-
cussed in Section 4, NN is inapplicable in this case. Toward this, we generate
datasets (with cardinality 1M) in a 3D space that involves two numerical di-
mensions and one categorical axis. In particular, the number cnum of categories
is a parameter ranging from 2 to 64 (cnum is also the number of 2D skylines
returned by a group-by skyline query). Every data point has equal probability
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Progressive Skyline Computation in Database Systems                 •      77




                      Fig. 28. Nodes potentially intersecting the SSR.




   Fig. 29. BBS node accesses versus cardinality of categorical axis cnum (N = 1M, d = 3).

to fall in each category, and, for all the points in the same category, their dis-
tribution (on the two numerical axes) is either independent or anticorrelated.
Figure 29 demonstrates the number of node accesses as a function of cnum . The
cost of BBS increases with cnum because the total number of skyline points (in
all 2D skylines) and the probability that a node may contain qualifying points
in some category (and therefore it should be expanded) is proportional to the
size of the categorical domain.

6.6 K -Dominating Skyline
This section measures the performance of NN and BBS on K -dominating
queries. Recall that each K -dominating query involves an enumerating query
(i.e., a file scan), which retrieves the number of points dominated by each sky-
line point. The K skyline points with the largest counts are found and the
top-1 is immediately reported. Whenever an object is reported, a constrained
skyline is executed to find potential candidates in its exclusive dominance re-
gion (see Figure 13). For each such candidate, the number of dominated points
is retrieved using a window query on the data R-tree. After this process, the
object with the largest count is reported (i.e., the second best object), another
constrained query is performed, and so on. Therefore, the total number of con-
strained queries is K − 1, and each such query may trigger multiple window
queries. Figure 30 demonstrates the cost of BBS and NN as a function of K .
The overhead of the enumerating and (multiple) window queries dominates the
total cost, and consequently BBS and NN have a very similar performance.
    Interestingly, the overhead of the anticorrelated data is lower (than the in-
dependent distribution) because each skyline point dominates fewer points
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
78      •      D. Papadias et al.




Fig. 30. NN and BBS node accesses versus number of objects to be reported for K -dominating
queries (N = 1M, d = 2).




Fig. 31. BBS node accesses versus “thickness” of the skyline for K -skyband queries (N = 1M,
d = 3).


(therefore, the number of window queries is smaller). The high cost of
K -dominating queries (compared to other skyline variations) is due to the com-
plexity of the problem itself (and not the proposed algorithm). In particular, a
K -dominating query is similar to a semijoin and could be processed accordingly.
For instance a nested-loops algorithm would (i) count, for each data point, the
number of dominated points by scanning the entire database, (ii) sort all the
points in descending order of the counts, and (iii) report the K points with the
highest counts. Since in our case the database occupies more than 6K nodes, this
algorithm would need to access 36E+6 nodes (for any K ), which is significantly
higher than the costs in Figure 30 (especially for low K ).

6.7 K -Skyband
Next, we evaluate the performance of BBS on K -skyband queries (NN is inap-
plicable). Figure 31 shows the node accesses as a function of K ranging from
0 (conventional skyline) to 9. As expected, the performance degrades as K in-
creases because a node can be pruned only if it is dominated by more than K
discovered skyline points, which becomes more difficult for higher K . Further-
more, the number of skyband points is significantly larger for anticorrelated
data, for example, for K = 9, the number is 788 (6778) in the independent
(anticorrelated) case, which explains the higher costs in Figure 31(b).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                     Progressive Skyline Computation in Database Systems                 •      79




      Fig. 32. Approximation error versus number of minskew buckets (N = 1M, d = 3).

6.8 Approximate Skylines
This section evaluates the quality of the approximate skyline using a hypothet-
ical point per bucket or visited node (as shown in the examples of Figures 17
and 18, respectively). Given an estimated and an actual skyline, the approx-
imation error corresponds to their SSR difference (see Section 5). In order to
measure this error, we used a numerical approach: (i) we first generated a large
number α of points (α = 104 ) uniformly distributed in the data space, and (ii)
counted the number β of points that are dominated by exactly one skyline. The
error equals β/α, which approximates the volume of the SSR difference divided
by the volume of the entire data space. We did not use a relative error (e.g.,
volume of the SSR difference divided by the volume of the actual SSR) because
such a definition is sensitive to the position of the actual skyline (i.e., a skyline
near the origin of the axes would lead to higher error even if the SSR difference
remains constant).
   In the first experiment, we built a minskew [Acharya et al. 1999] histogram
on the 3D datasets by varying the number of buckets from 100 to 1000, resulting
in main-memory consumption in the range of 3K bytes (100) to 30K bytes (1000
buckets). Figure 32 illustrates the error as a function of the bucket number. For
independent distribution, the error is very small (less than 0.01%) even with
the smallest number of buckets because the rough “shape” of the skyline for a
uniform dataset can be accurately predicted using Equation (5.2). On the other
hand, anticorrelated data were skewed and required a large number of buckets
for achieving high accuracy.
   Figure 33 evaluates the quality of the approximation as a function of node
accesses (without using a histogram). As discussed in Section 5, the first rough
estimate of the skyline is produced when BBS visits the root entry and then
the approximation is refined as more nodes are accessed. For independent data,
extremely accurate approximation (with error 0.01%) can be obtained immedi-
ately after retrieving the root, a phenomenon similar to that in Figure 32(a).
For anti-correlated data, the error is initially large (around 15% after the root
visit), but decreases considerably with only a few additional node accesses. Par-
ticularly, the error is less than 3% after visiting 30 nodes, and close to zero with
around 100 accesses (i.e., the estimated skyline is almost identical to the actual
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
80      •      D. Papadias et al.




      Fig. 33. BBS approximation error versus number of node accesses (N = 1M, d = 3).

one with about 25% of the node accesses required for the discovery of the actual
skyline).

7. CONCLUSION
The importance of skyline computation in database systems increases with
the number of emerging applications requiring efficient processing of prefer-
ence queries and the amount of available data. Consider, for instance, a bank
information system monitoring the attribute values of stock records and an-
swering queries from multiple users. Assuming that the user scoring functions
are monotonic, the top-1 result of all queries is always a part of the skyline.
Similarly, the top-K result is always a part of the K -skyband. Thus, the system
could maintain only the skyline (or K -skyband) and avoid searching a poten-
tially very large number of records. However, all existing database algorithms
for skyline computation have several deficiencies, which severely limit their
applicability. BNL and D&C are not progressive. Bitmap is applicable only for
datasets with small attribute domains and cannot efficiently handle updates.
Index cannot be used for skyline queries on a subset of the dimensions. SFS,
like all above algorithms, does not support user-defined preferences. Although
NN was presented as a solution to these problems, it introduces new ones,
namely, poor performance and prohibitive space requirements for more than
three dimensions. This article proposes BBS, a novel algorithm that overcomes
all these shortcomings since (i) it is efficient for both progressive and com-
plete skyline computation, independently of the data characteristics (dimen-
sionality, distribution), (ii) it can easily handle user preferences and process
numerous alternative skyline queries (e.g., ranked, constrained, approximate
skylines), (iii) it does not require any precomputation (besides building the
R-tree), (iv) it can be used for any subset of the dimensions, and (v) it has
limited main-memory requirements.
   Although in this implementation of BBS we used R-trees in order to perform
a direct comparison with NN, the same concepts are applicable to any data-
partitioning access method. In the future, we plan to investigate alternatives
(e.g., X-trees [Berchtold et al. 1996], and A-trees [Sakurai et al. 2000]) for high-
dimensional spaces, where R-trees are inefficient). Another possible solution for
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Progressive Skyline Computation in Database Systems                  •      81

high dimensionality would include (i) converting the data points to subspaces
with lower dimensionalities, (ii) computing the skyline in each subspace, and
(iii) merging the partial skylines. Finally, a topic worth studying concerns sky-
line retrieval in other application domains. For instance, Balke et al. [2004]
studied skyline computation for Web information systems considering that the
records are partitioned in several lists, each residing at a distributed server.
The tuples in every list are sorted in ascending order of a scoring function,
which is monotonic on all attributes. Their processing method uses the main
concept of the threshold algorithm [Fagin et al. 2001] to compute the entire
skyline by reading the minimum number of records in each list. Another inter-
esting direction concerns skylines in temporal databases [Salzberg and Tsotras
1999] that retain historical information. In this case, a query could ask for the
most interesting objects at a past timestamp or interval.

REFERENCES

ACHARYA, S., POOSALA, V., AND RAMASWAMY, S. 1999. Selectivity estimation in spatial databases.
  In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Philadelphia, PA,
  June 1–3). 13–24.
BALKE, W., GUNZER, U., AND ZHENG, J. 2004. Efficient distributed skylining for Web information sys-
  tems. In Proceedings of the International Conference on Extending Database Technology (EDBT;
  Heraklio, Greece, Mar. 14–18). 256–273.
BECKMANN, N., KRIEGEL, H., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efficient and
  robust access method for points and rectangles. In Proceedings of the ACM Conference on the
  Management of Data (SIGMOD; Atlantic City, NJ, May 23–25). 322–331.
BERCHTOLD, S., KEIM, D., AND KRIEGEL, H. 1996. The X-tree: An index structure for high-
  dimensional data. In Proceedings of the Very Large Data Bases Conference (VLDB; Mumbai,
  India, Sep. 3–6). 28–39.
 ¨
BOHM, C. AND KRIEGEL, H. 2001. Determining the convex hull in large multidimensional
  databases. In Proceedings of the International Conference on Data Warehousing and Knowledge
  Discovery (DaWaK; Munich, Germany, Sep. 5–7). 294–306.
BORZSONYI, S., KOSSMANN, D., AND STOCKER, K. 2001. The skyline operator. In Proceedings of the
  IEEE International Conference on Data Engineering (ICDE; Heidelberg, Germany, Apr. 2–6).
  421–430.
BUCHTA, C. 1989. On the average number of maxima in a set of vectors. Inform. Process. Lett.,
  33, 2, 63–65.
CHANG, Y., BERGMAN, L., CASTELLI, V., LI, C., LO, M., AND SMITH, J. 2000. The Onion technique: In-
  dexing for linear optimization queries. In Proceedings of the ACM Conference on the Management
  of data (SIGMOD; Dallas, TX, May 16–18). 391–402.
CHOMICKI, J., GODFREY, P., GRYZ, J., AND LIANG, D. 2003. Skyline with pre-sorting. In Proceedings
  of the IEEE International Conference on Data Engineering (ICDE; Bangalore, India, Mar. 5–8).
  717–719.
FAGIN, R., LOTEM, A., AND NAOR, M. 2001. Optimal aggregation algorithms for middleware. In Pro-
  ceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
  (PODS; Santa Barbara, CA, May 21–23). 102–113.
FERHATOSMANOGLU, H., STANOI, I., AGRAWAL, D., AND ABBADI, A. 2001. Constrained nearest neighbor
  queries. In Proceedings of the International Symposium on Spatial and Temporal Databases
  (SSTD; Redondo Beach, CA, July 12–15). 257–278.
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings
  of the ACM Conference on the Management of Data (SIGMOD; Boston, MA, June 18–21). 47–
  57.
HELLERSTEIN, J., ANVUR, R., CHOU, A., HIDBER, C., OLSTON, C., RAMAN, V., ROTH, T., AND
  HAAS, P. 1999. Interactive data analysis: The control project. IEEE Comput. 32, 8, 51–
  59.

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
82      •      D. Papadias et al.

HENRICH, A. 1994. A distance scan algorithm for spatial access structures. In Proceedings of
  the ACM Workshop on Geographic Information Systems (ACM GIS; Gaithersburg, MD, Dec.).
  136–143.
HJALTASON, G. AND SAMET, H. 1999. Distance browsing in spatial databases. ACM Trans. Database
  Syst. 24, 2, 265–318.
HRISTIDIS, V., KOUDAS, N., AND PAPAKONSTANTINOU, Y. 2001. PREFER: A system for the efficient
  execution of multi-parametric ranked queries. In Proceedings of the ACM Conference on the
  Management of Data (SIGMOD; May 21–24). 259–270.
KOSSMANN, D., RAMSAK, F., AND ROST, S. 2002. Shooting stars in the sky: An online algorithm for
  skyline queries. In Proceedings of the Very Large Data Bases Conference (VLDB; Hong Kong,
  China, Aug. 20–23). 275–286.
KUNG, H., LUCCIO, F., AND PREPARATA, F. 1975. On finding the maxima of a set of vectors. J. Assoc.
  Comput. Mach., 22, 4, 469–476.
MATOUSEK, J. 1991. Computing dominances in En . Inform. Process. Lett. 38, 5, 277–278.
MCLAIN, D. 1974. Drawing contours from arbitrary data points. Comput. J. 17, 4, 318–324.
MURALIKRISHNA, M. AND DEWITT, D. 1988. Equi-depth histograms for estimating selectivity factors
  for multi-dimensional queries. In Proceedings of the ACM Conference on the Management of Data
  (SIGMOD; Chicago, IL, June 1–3). 28–36.
NATSEV, A., CHANG, Y., SMITH, J., LI., C., AND VITTER. J. 2001. Supporting incremental join queries
  on ranked inputs. In Proceedings of the Very Large Data Bases Conference (VLDB; Rome, Italy,
  Sep. 11–14). 281–290.
PAPADIAS, D., TAO, Y., FU, G., AND SEEGER, B. 2003. An optimal and progressive algorithm for
  skyline queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD;
  San Diego, CA, June 9–12). 443–454.
PAPADIAS, D., KALNIS, P., ZHANG, J., AND TAO, Y. 2001. Efficient OLAP operations in spatial data
  warehouses. In Proceedings of International Symposium on Spatial and Temporal Databases
  (SSTD; Redondo Beach, CA, July 12–15). 443–459.
PREPARATA, F. AND SHAMOS, M. 1985. Computational Geometry—An Introduction. Springer, Berlin,
  Germany.
ROUSSOPOULOS, N., KELLY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of
  the ACM Conference on the Management of Data (SIGMOD; San Jose, CA, May 22–25). 71–79.
SAKURAI, Y., YOSHIKAWA, M., UEMURA, S., AND KOJIMA, H. 2000. The A-tree: An index structure for
  high-dimensional spaces using relative approximation. In Proceedings of the Very Large Data
  Bases Conference (VLDB; Cairo, Egypt, Sep. 10–14). 516–526.
SALZBERG, B. AND TSOTRAS, V. 1999. A comparison of access methods for temporal data. ACM
  Comput. Surv. 31, 2, 158–221.
SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R+-tree: A dynamic index for multi-
  dimensional objects. In Proceedings of the Very Large Data Bases Conference (VLDB; Brighton,
  England, Sep. 1–4). 507–518.
STEUER, R. 1986. Multiple Criteria Optimization. Wiley, New York, NY.
TAN, K., ENG, P., AND OOI, B. 2001. Efficient progressive skyline computation. In Proceedings of
  the Very Large Data Bases Conference (VLDB; Rome, Italy, Sep. 11–14). 301–310.
THEODORIDIS, Y., STEFANAKIS, E., AND SELLIS, T. 2000. Efficient cost models for spatial queries using
  R-trees. IEEE Trans. Knowl. Data Eng. 12, 1, 19–32.

Received October 2003; revised April 2004; accepted June 2004




ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
Advanced SQL Modeling in RDBMS
ANDREW WITKOWSKI, SRIKANTH BELLAMKONDA, TOLGA BOZKAYA,
NATHAN FOLKERT, ABHINAV GUPTA, JOHN HAYDU, LEI SHENG, and
SANKAR SUBRAMANIAN
Oracle Corporation


Commercial relational database systems lack support for complex business modeling. ANSI SQL
cannot treat relations as multidimensional arrays and define multiple, interrelated formulas over
them, operations which are needed for business modeling. Relational OLAP (ROLAP) applications
have to perform such tasks using joins, SQL Window Functions, complex CASE expressions, and
the GROUP BY operator simulating the pivot operation. The designated place in SQL for calcula-
tions is the SELECT clause, which is extremely limiting and forces the user to generate queries
with nested views, subqueries and complex joins. Furthermore, SQL query optimizers are pre-
occupied with determining efficient join orders and choosing optimal access methods and largely
disregard optimization of multiple, interrelated formulas. Research into execution methods has
thus far concentrated on efficient computation of data cubes and cube compression rather than on
access structures for random, interrow calculations. This has created a gap that has been filled
by spreadsheets and specialized MOLAP engines, which are good at specification of formulas for
modeling but lack the formalism of the relational model, are difficult to coordinate across large user
groups, exhibit scalability problems, and require replication of data between the tool and RDBMS.
This article presents an SQL extension called SQL Spreadsheet, to provide array calculations
over relations for complex modeling. We present optimizations, access structures, and execution
models for processing them efficiently. Special attention is paid to compile time optimization for
expensive operations like aggregation. Furthermore, ANSI SQL does not provide a good separation
between data and computation and hence cannot support parameterization for SQL Spreadsheets
models. We propose two parameterization methods for SQL. One parameterizes ANSI SQL view
using subqueries and scalars, which allows passing data to SQL Spreadsheet. Another method
presents parameterization of the SQL Spreadsheet formulas. This supports building stand-alone
SQL Spreadsheet libraries. These models are then subject to the SQL Spreadsheet optimizations
during model invocation time.
Categories and Subject Descriptors: H.2.3. [Database Management]: Languages—Data manip-
ulation languages (DML); query languages; H.2.4. [Database Management]: Systems—Query
processing
General Terms: Design, Languages
Additional Key Words and Phrases: Excel, analytic computations, OLAP, spreadsheet


1. INTRODUCTION
One of the most successful analytical tools for business data is the spreadsheet.
A user can enter business data, define formulas over it using two-dimensional

Authors’ addresses: Oracle Corporation, 500 Oracle Parkway, Redwood Shores, CA 94065;
email: {andrew.witkowski,srikanth.bellamkonda,tolga.bozkaya,nathan.folkert,abhinav.gupta,john.
haydu,lei.sheng,sankar.subramanian}@oracle.com.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0083 $5.00


                         ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 83–121.
84      •      A. Witkowski et al.

array abstractions, construct simultaneous equations with recursive models,
pivot data and compute aggregates for selected cells, apply a rich set of business
functions, etc. Spreadsheets also provide flexible user interfaces like graphs and
reports.
   Unfortunately, analytical usefulness of the RDBMS has not measured up to
that of spreadsheets [Blattner 1999; Simon 2000] or specialized MOLAP tools
like Microsoft Analytical Services [Peterson and Pinkelman 2000; Thomsen
et al. 1999], Oracle Analytic Workspaces [OLAP Application Developer’s Guide
2004], and others [Balmin et al. 2000; Howson 2002]. It is cumbersome and
in most cases inefficient to perform array calculations in SQL—a fundamental
problem resulting from lack of language constructs to treat relations as arrays
and lack of efficient random access methods for their access. To simulate array
computations on a relation SQL users must resort to using multiple self-joins to
align different rows, must use ANSI SQL Window functions to reach from one
row into another, or must use ANSI SQL GROUP BY operator to pivot a table
and simulate interrow with intercolumn computations. None of the operations
is natural or efficient for array computations with multiple formulas found in
spreadsheets.
   Spreadsheets, for example Microsoft Excel [Simon 2000], provide an excel-
lent user interface but have their own problems. They offer two-dimensional
“row-column” addressing. Hence, it is hard to build a model where formulas
reference data via symbolic references. In addition, they do not scale well when
the data set is large. For example, a single sheet in a spreadsheet typically
supports up to 64K rows with about 200 columns, and handling terabytes of
sales data is practically impossible even when using multiple sheets. Further-
more, spreadsheets do not support the parallel processing necessary to pro-
cess terabytes of data in small windows of time. In collaborative analysis with
multiple spreadsheets, it is nearly impossible to get a complete picture of the
business by querying multiple, inconsistent spreadsheets each using its own
layout and placement of data. There is no standard metadata or a unified ab-
straction interrelating them akin to RDBMS dictionary tables and RDBMS
relations.
   This article proposes spreadsheet-like computations in RDBMS through ex-
tensions to SQL, leaving the user interface aspects to be handled by OLAP tools.
Here is a glimpse of our proposal:

— Relations can be viewed as n-dimensional arrays, and formulas can be defined
  over the cells of these arrays. Cell addressing is symbolic, using dimensional
  columns.
— The formulas can automatically be ordered based on the dependencies be-
  tween cells.
— Recursive references and convergence conditions are supported, providing
  for a recursive execution model.
— Densification (filling gaps in sparse data) can be easily performed.
— Formulas are encapsulated in a new SQL query clause. Their result is a
  relation and can be further used in joins, subqueries, etc.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                       Advanced SQL Modeling in RDBMS                  •      85

— The new clause supports logical partitioning of the data providing a natural
  mechanism of parallel execution.
— Formulas support INSERT and UPDATE semantics as well as correlation
  between their left and right sides. This allows us to simulate the effect of
  multiple joins and UNIONs using a single access structure.

    Furthermore, our article addresses lack of parameterization models in ANSI
SQL. The issue is critical for model building as this ANSI SQL shortcoming
prevents us from constructing parameterized libraries of SQL Spreadsheet. We
propose two new parameterization methods for SQL. One parameterizes ANSI
SQL views with subqueries and scalars allowing passing data to inner query
blocks and hence to SQL Spreadsheet. The second model is a parameterization
of the SQL Spreadsheet formulas. We can declare a named set of formulas, called
SQL Spreadsheet Procedure, operating on an N-dimensional array that can be
invoked from an SQL Spreadsheet. The array is passed by reference to the SQL
Spreadsheet Procedure. We support merging of formulas from SQL Spread-
sheet Procedure to the main body of SQL Spreadsheet. This allows for global
formula optimizations, like removal of unused formulas, etc. SQL Spreadsheet
Procedures are useful for building standalone SQL Spreadsheet libraries.
    This article is organized as follows. Section 2 provides SQL language ex-
tensions for spreadsheets. Section 3 provides motivating examples. Section 4
presents an overview of the evaluation of spreadsheets in SQL. Section 5 de-
scribes the analysis of the spreadsheet clause and query optimizations with
spreadsheets. Section 6 discusses our execution models. Section 7 describes our
parameterization models. Section 8 reports results from performance experi-
ments on spreadsheet queries, and Section 9 contains our conclusions. The elec-
tronic appendix explains parallel execution of SQL Spreadsheets and presents
our experimental results; it also discusses our future research in this area.


2. SQL EXTENSIONS FOR SPREADSHEETS

2.1 Notation
In the following examples, we will use a fact table f (t, r, p, s, c) representing a
data-warehouse of consumer-electronic products with three dimensions: time
(t), region (r), and product ( p), and two measures: sales (s) and cost (c).

2.2 Spreadsheet Clause
OLAP applications divide relational attributes into dimensions and measures.
To model that, we introduce a new SQL query clause, called the spreadsheet
clause, which identifies, within the query result, PARTITION, DIMENSION,
and MEASURES columns. The PARTITION (PBY) columns divide the relation
into disjoint subsets. The DIMENSION (DBY) columns identify a unique row
within each partition, and this row is called a cell. The MEASURES (MEA)
columns identify expressions computed by the spreadsheet and are referenced
by DBY columns. Following this, there is a sequence of formulas, each describing
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
86      •      A. Witkowski et al.

a computation on the cells. Thus the structure of the spreadsheet clause is
     <existing parts of a query block>
     SPREADSHEET PBY (cols) DBY (cols) MEA (cols)
     <processing options>
     (<formula>, <formula>,.., <formula>)
   It is evaluated after joins, aggregations, window functions, and final projec-
tion, but before the ORDER BY clause.
   Cells are referenced using an array notation in which a measure is followed
by square brackets holding dimension values. Thus s[‘vcr’, 2002] is a reference
to the cell containing sales of the ‘vcr’ product in 2002. If the dimensions are
uniquely qualified, the cell reference is called a single cell reference, for example,
s[p=‘dvd’, t=2002]. If the dimensions are qualified by general predicates, the
cell reference refers to a set of cells and is called a range reference, for example,
s[p=‘dvd’, t,<2002].
   Each formula represents an assignment and contains a left side that desig-
nates the target cells and a right side that contains the expressions involving
cells or ranges of cells within the partition. For example:
     SELECT r, p, t, s
     FROM f
     SPREADSHEET PBY(r) DBY (p, t) MEA (s)
     (
       s[p=‘dvd’,t=2002] =s[p=‘dvd’,t=2001]*1.6,
       s[p=‘vcr’,t=2002] =s[p=‘vcr’,t=2000]+s[p=‘vcr’,t=2001],
       s[p=‘tv’, t=2002] =avg(s)[p=‘tv’,1992<t<2002]
     )
   This query partitions table f by region r and defines that, within each re-
gion, sales of ‘dvd’ in 2002 will be 60% higher than ‘dvd’ sales in 2001, sales
of ‘vcr’ in 2002 will be the sum of ‘vcr’ sales in 2000 and 2001, and sales of
‘tv’ will be the average of ‘tv’ sales in the years between 1992 and 2002. As
a shorthand, a positional notation exists, for example: s[‘dvd’,2002] instead of
s[p=‘dvd’,t=2002].
   The left side of a formula defines calculations that can span a range of cells.
A new function cv() (abbreviation for current value) carries the current value of
a dimension from the left side to the right side, thus effectively serving as a join
between right and left side. The * operator denotes all values in the dimension.
   The following spreadsheet clause states that sales of every product in the
‘west’ region for year >2001 will be 20% higher than sales of the same product
in the preceding year. Observe that region and product dimensions on the right
side reference function cv() to carry dimension values from left to the right
side.

     SPREADSHEET DBY (r, p, t) MEA (s)
     (
       s[‘west’,*,t>2001] = 1.2*s[cv(r),cv(p),t=cv(t)-1]
     )
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                        Advanced SQL Modeling in RDBMS                  •      87

   Formulas may specify a range of cells to be updated. A formula referring to
multiple cells on the left side is called an existential formula. For existential
formulas, the result may be order dependent. For example, the intention of the
following query is to compute the sales of ‘vcr’ for all years before 2002 as an
average of sales of 2 preceding years:
  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
    s[‘vcr’,t<2002]= avg(s)[‘vcr’,cv(t)-2<=t<cv(t)]
  )
  But processing rows in ascending or descending order with regard to di-
mension t produces different results as we are both updating and referencing
measure s. To avoid ambiguity, the user can specify an order in which the rule
should be evaluated:
  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
    s[‘vcr’, t<2002] ORDER BY t ASC =
                   avg(s)[cv(p), cv(t)-2<=t<cv(t)]
  )
   An innovative feature of SQL spreadsheet is the creation of new rows in
the result set. Any formula with a single cell reference on left side can operate
either in UPDATE or UPSERT (default) mode. The latter creates new cells
within a partition if they do not exist; otherwise it updates them. UPDATE
ignores nonexistent cells. For example,
  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
    UPSERT s[‘tv’, 2000] =
           s[‘black-tv’,2000] + s[‘white-tv’,2000]
  )
will create, for each region, a row with p=‘tv’ and t=2000 if this cell is not
present in the input stream.
   Semantics for the UPSERT operation is obvious when the left side qualifies
a single cell as this cell is then either updated or inserted. An interesting issue
is how to interpret UPSERT for an existential formula. For example,
  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
  UPSERT s[‘tv’,*] = s[‘black-tv’,cv()]+s[‘white-tv’,cv()]
  )
   This creates a new member of the product dimension, the ‘tv’ member, for
each of the values in the time dimension. In OLAP this is referred to as a calcu-
lated member. In SQL Spreadsheet, the UPSERT operation where one dimen-
sion d 1 is qualified by a constant while the remaining ones d 2 , . . . , d n by Boolean
conditions c2 , . . . , cn are defined to be a sequence of two operations: UPSERT and
UPDATE. We first determine the distinct values in the remaining dimensions:
  SELECT DISTINCT d 2 , . . . , d n FROM input set WHERE c2 , . . . , cn
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
88      •      A. Witkowski et al.

and perform upserts of these distinct values with constant on dimension d 1 .
Then we execute the formula in the UPDATE mode updating the upserted
values. In the above example, we (logically) perform these two operations:
     SPREADSHEET PBY(r) DBY (p, t) MEA (s)
     (
       UPSERT s[‘tv’, FOR t IN
                (SELECT DISTINCT t FROM input set)] = NULL,
       UPDATE s[‘tv’, *] =s[‘black-tv’,cv()]
                         +s[‘white-tv’,cv()]
     )
   This easily generalizes to cases when there is more than one dimension qual-
ified with constants while others are qualified by Boolean conditions.

2.3 Reference Spreadsheets
OLAP applications frequently deal with objects of different dimensionality in
a single query. For example, the sales table may have region (r), product ( p),
and time (t) dimensions, while the budget allocation table has only a region
(r) dimension. To account for that, our query block can have, in addition to
the main spreadsheet, multiple read-only reference spreadsheets, which are
n-dimensional arrays defined over other query blocks. Reference spreadsheets,
akin to main spreadsheets, have DBY and MEA clauses, indicating their dimen-
sions and measures, respectively. For example, assume a budget table budget (r,
p) containing predictions p for a sales increase for each region r. The following
query predicts sales in 2002 in regions ‘east’ and ‘west’. For the ‘west’ region,
the prediction is based on the prediction factor p from the budget reference
table.
     SELECT r, t, s
     FROM f GROUP by r, t
     SPREADSHEET
       REFERENCE budget ON (SELECT r, p FROM budget)
                 DBY(r) MEA(p)
     DBY (r, t) MEA (sum(s) s)
     (
       s[‘west’,2002]= p[‘west’]*s[‘west’ ,2001],
       s[‘east’,2002]= s[‘east’,2001]+s[‘east’,2000]
     )
   The purpose of a reference spreadsheet is similar to a relational join, but it
allows us to perform, within a spreadsheet clause, multiple joins using the same
access structures (e.g., a hash table—see Section 6.1). Thus self-joins within a
spreadsheet can be cheaper compared to doing them outside.

2.4 Ordering the Evaluation of Formulas
By default, formulas are evaluated based on the order of their dependencies,
and we refer to it as the AUTOMATIC ORDER. For example in
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Advanced SQL Modeling in RDBMS                  •      89

  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
    s[‘dvd’,2002] = s[‘dvd’,2000] + s[‘dvd’,2001]
    s[‘dvd’,2001] = 1000
  )
the first formula depends on the second and consequently we will evaluate the
second one first. However, there are scenarios in which lexicographical ordering
(i.e., the order in formulas are specified) of evaluation is desired. For that,
we provide an explicit processing option, SEQUENTIAL ORDER, as in the
following:
  SPREADSHEET DBY(r,p,t) MEA(s) SEQUENTIAL ORDER
  (....<formulas>....)

2.5 ANSI Window Functions in SQL Spreadsheet
Many of the ANSI window functions can be emulated using aggregates on the
right side of the formulas or using an ORDER BY clause on their left side. How-
ever, for user convenience, we also allow the explicit use of window functions on
the right side of formulas. The window functions that are specified on the right
side of a formula are computed over the range of cells defined by the left side.
For example, the following formula computes the 3-year moving sum of sales
of each product for all times within a region (we have a 3-year moving average
as the window function specifies RANGE BETWEEN 1 PRECEDING AND 1
FOLLOWING, i.e., a total of 3 years):
  SPREADSHEET PBY(r) DBY (p, t) MEA (s, 0 mov sum)
  (
    mov sum[*, *] =
        sum(s) OVER (PARTITION BY p ORDER BY t
               RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING)
  )

2.6 Cycles and Recursive Models
Similar to existing spreadsheets, our computations may contain cycles, as in
the formula
   s[1] = s[1]/2.
   Consequently, we have processing options to specify the number of iterations
or the convergence criteria for cycles and recursion. The ITERATE (n) option
requests iteration of the formulas ‘n’ times. The optional UNTIL condition will
stop the iteration when the <condition> has been met, up to a maximum of
n iterations as specified by ITERATE (n). The <condition> can reference cells
before and after the iteration facilitating definition of convergence conditions.
A helper function previous(<cell>) returns the value of <cell> at the start of
each iteration. For example,
  SPREADSHEET DBY (x) MEA (s)
    ITERATE (10) UNTIL (PREVIOUS(s[1])-s[1] <= 1)
  (s[1] = s[1]/2)
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
90      •      A. Witkowski et al.

will execute the formula s[1] = s[1]/2 until the convergence condition is met, up
to a maximum of 10 iterations (in this case if initially s[1] is between 1024 and
2047, evaluation of the formulas will stop after 10 iterations).

2.7 Spreadsheet Processing Options and Miscellaneous Functions
There are other processing options for the SQL spreadsheet in addition to the
ones for ordering of formulas and termination of cycles. For example, we can
specify UPDATE/UPSERT options as the default for the entire spreadsheet.
The IGNORE NAV (where NAV refers to nonavailable values) option allows us
to treat NULL values in numeric operations as 0, which is convenient for newly
inserted cells with the UPSERT option.
   The new predicate <cell> IS PRESENT indicates if the row indicated by the
<cell> existed before the execution of the spreadsheet clause and is convenient
for determining upserted values.

2.8 Semantics of Updates in SQL Spreadsheets
We note two important update properties of SQL Spreadsheet. First, SQL
Spreadsheet is part of a query block and hence doesn’t cause any modifica-
tion to the stored relations. Users can explicitly use an UPDATE or MERGE
statement to propagate changes made by the formulas to the target relations.
This involves an explicit join of the query with the spreadsheet to the target
relation. For example, to propagate a calculated member ‘tv’ from query {S1 on
page 6} to the fact relation f, we could use an ANSI SQL MERGE statement
(note that UPDATE will not work as it does not support insertion into the target
table of nonjoining rows):
     MERGE INTO f USING
     ( SELECT r, p, t, s
       FROM f
       SPREADSHEET PBY(r) DBY (p, t) MEA (s)
       (
         UPSERT s[‘tv’, *] = s[‘black-tv’,cv()]
                           + s[‘white-tv’,cv()]
       )
     ) v
     ON f.r = v.r AND f.p = v.p AND f.t = v.t
     WHEN MATCHED THEN UPDATE SET f.s = v.s
     WHEN NOT MATCHED THEN INSERT
                           VALUES (v.r, v.p, v.t, v.s)
   Second, SQL Spreadsheet formulas compute measures which can later be
used by other formulas, that is, formulas can operate on data produced by
other formulas, and hence order of their execution is important. This is in
contrast to the semantics of ANSI SQL UPDATE . . . WHERE . . . statement.
In ANSI SQL, the WHERE condition is always applied to the (logical) copy of
the target relation rather than to its updated values. This allows for cleaner
but also less powerful semantics. In our case, changes made by prior formulas
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                             Advanced SQL Modeling in RDBMS                 •      91

are visible to the following ones to simulate classical spreadsheet and MOLAP
tools. This makes ordering of formulas important and imposes restrictions on
their optimizations like reordering or pruning of formulas. We elaborate on this
in Section 5.1.

3. MOTIVATING EXAMPLE OF SPREADSHEET USAGE
Here is an example demonstrating the expressive power of SQL Spreadsheet
and its potential for efficient computation as compared to the alternative avail-
able in ANSI SQL.
   An analyst predicts sales for the year 2002. Based on business trends, sales
of ‘tv’ in 2002 is the sales in 2001 scaled by the average increase between 1992
and 2001. Sales of ‘vcr’ is the sum of sales in 2000 and 2001. Sales of ‘dvd’ is
the average of the three previous years. Finally, the analyst wants to introduce,
in every region, a new dimension member ‘video’ for the year 2002, defined as
sales of ‘tv’ plus sales of ‘vcr’. Assuming that rows for ‘tv‘, ‘dvd’, and ‘vcr’ for
year 2002 already exist, we express this as
   SELECT r, p, t, s FROM f
   SPREADSHEET PBY(r) DBY (p, t) MEA (s)
   (
   F1: UPDATE s[‘tv’,2002] = s[‘tv’,2001] +
        Slope1 (s,t)[‘tv’,1992<=t<=2001]*s[‘tv’,2001],
   F2: UPDATE s[‘vcr’, 2002] = s[‘vcr’,2000]+s[‘vcr’,2001],
   F3: UPDATE s[‘dvd’,2002] =
   (s[‘dvd’,1999]+s[‘dvd’,2000]+s[‘dvd’,2001])/3,
   F4: UPSERT s[‘video’, 2002] = s[‘tv’,2002]+s[‘vcr’,2002]
   )
   To express the above query in ANSI SQL, formula F1 would require an ag-
gregate subquery plus a join to the fact table f; formula F2, a double self-join of
the fact table; formula F3, a triple self join of the fact table; and formula F4, a
union operation. Such a query would not only be difficult to generate but would
also execute inefficiently. For the equivalent query using the SQL Spreadsheet
clause as shown above, we need to scan the data to generate a point-addressable
access structure like a hash table or an index for all formulas only once. The
slope function as expressed above requires a scan of the data to find rows sat-
isfying the predicate ‘1992<=t<=2001’. But if we can deduce from database
constraints that t is an integer, then formula F1 is first transformed into
   F1: UPDATE s[‘tv’,2002] = s[‘tv’,2001]+
       slope(s,t)[‘tv’,t in (1992,...,2001)]* s[‘tv’,2001]
   This way, the access structure can be used for random, multiple accesses
along the time dimension as opposed to a scan to find the rows satisfying the
predicate. Formulas F2, F3, and F4 can use the structure directly. The structure

1 The aggregate function slope() is a recent addition to ANSI SQL [Zemke et al. 1999] and de-

notes linear regression slope (the ANSI name of this function is called regr slope() but we use the
shortened name slope() in this document).

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
92      •      A. Witkowski et al.




                              Fig. 1. Cycles in the spreadsheet graph.


is then used multiple times, giving a performance advantage over the multiple
joins required by the equivalent ANSI SQL alternative. In real applications, we
expect hundreds of formulas, and consequently a single point-access structure
in place of hundreds of joins provides a significant performance advantage.
    As another example, consider a common financial calculation: determining
the maximum allowable mortgage payment for an individual. Assume that the
person’s income is from two sources, salary and capital gains. Salary minus
mortgage interest is taxed at 38%, and capital gains is taxed at 28%. Net income
is salary plus capital gains minus interest expense minus tax. The maximum al-
lowable mortgage interest expense (tax deductible) is 30% of net income. Given
the person’s salary and capital gains and the rules above, we want to find the
individual’s net income, total taxes, and maximum allowable interest expense.
To calculate this, we must solve three simultaneous equations.
    Assume a table ledger with two columns, account and balance, where each
row in the table holds the balance for one account. Using this table, the calcu-
lations described above can be performed in a single query:
     SELECT account, b
     FROM ledger
     SPREADSHEET DBY (account) MEA (balance b)
     RULES IGNORE NAV ITERATE (100)
     UNTIL (ABS(b[‘net’] - PREVIOUS(b[‘net’])) < 0.01)
     (
       F1: b[‘interest’] = b[‘net’] * 0.30,
       F2: b[‘net’] = b[‘salary’] + b[‘capital gains’]
                    - b[‘interest’] - b[‘tax’],
       F3: b[‘tax’] = (b[‘salary’]-b[‘interest’]) * 0.38
                    + b[‘capital gains’] * 0.28
     )
Note two cycles in the above formulas—see Figure 1. Formula F1 depends on
F2 and formula F2 depends on F1. Formula F2 also depends on F3 and F3
depends on F1. Since there are recursive references, the query is written using
the ITERATE option with a condition to terminate the iteration. In this case,
the query specifies that processing will be terminated after iterating over the
formulas 100 times or when the difference in the value of net income between
the previous iteration and the current iteration is less than 0.01.
   Although it may be possible to express this complex calculation using a single
ANSI SQL query, it is unlikely to perform well.
   Assume that the initial content of the ledger contains values for salary and
capital gains (see the Input balance column in Table I).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                          Advanced SQL Modeling in RDBMS               •      93

                                    Table I. Input Ledger
                       Account        Input Balance    Result Balance
                    Salary              100,000.00       100,000.00
                    Capital gains        15,000.00        15,000.00
                    Net                       0           61,382.80
                    Tax                       0           35,202.36
                    Interest                  0           18,414.83


   After 26 iterations, we satisfy the convergence condition and find values for
taxes, interest, and net income; the result is shown in the Result balance column
of Table I.

4. SQL SPREADSHEET EVALUATION OVERVIEW
We divide SQL Spreadsheet evaluation into three broad stages.
   The first stage is the spreadsheet analysis and optimization (see Section 5),
which analyzes the formulas to determine if they are acyclic, which is impor-
tant to determine the execution method. This stage also performs a number of
formula optimizations like pruning of formulas, pushing predicates from the
outer query blocks into the spreadsheet block, etc. The analysis is done using
a graph representing dependencies between formulas, bounding rectangle the-
ory defining the scope of the outside filters, and known techniques for predicate
transformations like predicate push and pull. The result of the analysis is a
SQL Spreadsheet with transformed, more optimal formulas and a flag indicat-
ing whether the formulas are cyclic or acyclic.
   The second stage involves building a random access structure on the data
coming to the spreadsheet. This structure is currently a hash table (see Section
6.1) but can be another structure like a B-Tree or Prefix Trees used for cube
compression [Lakshmanan et al. 2003; Sismanis et al. 2002], which supports
random cell access, partitioning of data, and data scans.
   The third stage (see Section 6.2) evaluates the formulas produced at the first
stage. We support three evaluation algorithms, one for spreadsheet with au-
tomatic order and no cycles, one for spreadsheet with automatic order which
supports runtime cycle detection, and one for sequential spreadsheet. The al-
gorithms use the hash structure built in the second stage to execute formulas
in groups (called levels) so that the scans required for aggregate evaluation are
minimized.

5. SPREADSHEET ANALYSIS AND OPTIMIZATION
The spreadsheet analysis determines the order of evaluation of formulas,
prunes formulas whose results are fully filtered out by outer queries, restricts
the formulas whose results are partially filtered, migrates predicates from outer
queries into the inner WHERE clause to limit the data processed by the spread-
sheet, and generates a filter condition to identify the cells that are required
throughout the evaluation of the spreadsheet formulas.
  The analysis also determines one of two types of execution methods: one for
acyclic and one for (potentially) cyclic formulas. Because of complex predicates
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
94      •      A. Witkowski et al.

in formulas, analysis cannot always ascertain acyclicity of formulas in the
spreadsheet. Hence, we sometimes use an expensive cyclic execution method
for an acyclic spreadsheet.

5.1 Formula Dependencies and Execution Order
The order of evaluation of formulas is determined from their dependency graph.
Formula F1 depends on F2 (written F2 → F1 ) if a cell evaluated by F2 is used
by F1 . For example in
     F1: s[‘video’,2000]=s[‘tv’, 2000]+s[‘vcr’, 2000]
     F2 :s[‘vcr’, 2000]=s[‘vcr’,1998]+s[‘vcr’, 1999]
F2 → F1 as F1 requires a cell s[‘vcr’,2000] computed by F2 . To form the →
relation, for each formula F , we determine cells that are referenced on its right
side, R(F), and cells that are modified on its left side, L(F). Obviously, F2 → F1
if and only if R(F1 ) intersects L(F2 ). In the presence of complex cell references,
like s[‘tv’, t2 +t3 +t4<t5 ], it is hard to determine the intersection of predicates. In
this case, we assume that the formula references all cells. This may result in the
overestimation of the → relation, leading to spurious cycles in the dependency
graph.
    The → relation results in a graph with formulas as nodes and their depen-
dency relationships as directed edges. The graph is then analyzed for (partial)
ordering.
    A spreadsheet formula can access a range of cells (e.g., an aggregate such as
avg(s)[‘tv’,*] or left side of an existential formula such as s[*, *] = 10) and thus
require a scan of data. If two formulas are independent that is, unrelated in the
partial order derived from the graph, they can be evaluated concurrently using
a single scan. For concurrent evaluation, formulas are grouped into enumerated
levels such that each level contains independent formulas, and no formula in
the level depends on a formula in a higher level.
    The path through the partial order with the maximum number of scans rep-
resents the minimum number of total scans possible, since they are all related
by the partial order. If we have an acyclic graph, then we can minimize the
number of levels containing scans to this value. The following algorithm gener-
ates the levels such that the number of scans is minimized (proof of minimality
is available from the authors).
    Let G(F, E) be the graph of the → relation where F is the formulas and E
is the → edges. We will call a formula with no incoming edges a source and
formulas with only single cell references single refs:
     GenLevels(G)
     {
         LEVEL <- 1
         WHILE (F is not          empty)
         {
            Find the set          FS of all the SOURCES in F
            IF (cycle is          detected)
               break the          cycle /* see below */
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Advanced SQL Modeling in RDBMS                  •      95

           ELSE IF (FS contains any single refs)
           {
              assign single refs in FS to level LEVEL;
              F = F - {single refs} in FS
           }
           ELSE IF (FS contains only scans)
           {
              assign formulas in FS to level LEVEL;
              F = F - FS
           }
           LEVEL <- LEVEL + 1,
       }
  }
   Consider the following query. Here, the spreadsheet graph has one edge:
F3 → F2. The algorithm will assign the point reference F3 to level 1 and the
scan F2 to level 2, but will delay assigning the scan F1 until level 2 so that F1
and F2 can share a single scan.
  SELECT * FROM f
  GROUP BY p, t
  SPREADSHEET DBY(p,t) MEA(sum(s) s)
  (
    F1: s[‘tv’, 2000] = sum(s)[‘tv’, 1990<t<2000],
    F2: s[‘vcr’,2000] = sum(s)[‘vcr’, 1995<t<2000],
    F3: s[‘vcr’,1999]=s[‘vcr’,1997]+s[‘vcr’,1998]
  )
   The GenLevels algorithm presented above simplifies the cyclic case. Before
generating the levels, the graph is analyzed for strongly connected components
using algorithms from Tarjan [1972]. We can then isolate cyclic subgraphs from
acyclic parts of the graph and from other cyclic subgraphs. This is important
because the computational complexity of cyclic evaluation is proportional to
the total number of rows updated or upserted in a cycle (see the autocyclic
algorithm in Section 6.2). After assigning levels to formulas a cyclic subgraph
is dependent on, removing formulas from the subgraph and assigning them to
individual levels in the same order until the subgraph is exhausted can break
the cyclic subgraph.
   For spreadsheets with a sequential order of evaluation, the dependency edges
created always point from an earlier formula to the latter formula. A sequential-
order spreadsheet graph can therefore never be cyclic. We still generate levels
in order to group the independent formulas together and, hence, minimize the
number of scans that are required for the computation of aggregates and exis-
tential rules in the spreadsheet.

5.2 Pruning Formulas
We expect that, to encapsulate common computations, applications will gener-
ate views containing spreadsheets with hundreds of formulas. Users querying
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
96      •      A. Witkowski et al.

these views will likely require only a subset of the result and, hence, put pred-
icates over the views. This gives us an opportunity to prune formulas that
compute cells discarded by these predicates. For example:
     SELECT * FROM
     ( SELECT r, p, t, s FROM f
       SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPDATE
       (
         F1: s[‘dvd’,2000]=s[‘dvd’, 1999]*1.2,
         F2: s[‘vcr’,2000]=s[‘vcr’,1998]+s[‘vcr’,1999],
         F3: s[‘tv’, 2000]=avg(s)[‘tv’, 1990<t<2000]
       )
     )
     WHERE p in (‘dvd’, ‘vcr’, ‘video’)
  The evaluation of formula F3 is unnecessary as the outer query filters out
the cell that F3 evaluates. The above formulas are independent, and this makes
the pruning process simple. Now, let’s say, we had a formula F4 that depends
on F3, such as
     F4: s[‘video’,2000]=s[‘vcr’,2000]+s[‘tv’,2000]
   Then F3 cannot be pruned as it is referenced by F4.
   The evaluation of a formula becomes unnecessary when the following condi-
tions are satisfied:

— The cells it updates are not used in evaluation of any other formula.
— The cells updated by the formula are filtered out in the outer query block or
  the measure updated by the formula is never referenced in the outer query
  block.

   Identification of formulas that can be pruned is done by the following algo-
rithm based on the dependency graph G. Let sink be a formula with no outgoing
edge, that is, one no other formula depends on.
     PruneFormulas(G)
     {
         Find a set FS of all SINKS
         WHILE (FS is not empty)
         {
             Pick a formula Fi from FS,
             FS = FS - {Fi } /* remove Fi from FS */

               IF ( all the cells referenced on the left side of F
                    are filtered out in the outer query block
                         OR
                    the measure updated by the left side of F is
                    not referenced in the outer query block)
               {
                 F = F - {Fi } /* delete Fi from list F*/
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                       Advanced SQL Modeling in RDBMS                  •      97

                 E = E - {all incoming edges into Fi },

                 IF deletion of F generates new ‘sink’ nodes
                   insert them into the set FS
             }
         }
  }

5.3 Rewriting Formulas
Pruning formulas alone is not sufficient to avoid unnecessary computations dur-
ing spreadsheet evaluation. In some cases, the results computed by a formula
may be partially filtered out in the outer query block. Consider the following
query which predicts the sale of all products in 2002 to be twice the cost of the
same product in 2002, and then selects the sale and cost values for ‘dvd’ and
‘vcr’ for years ≥ 2000:
  SELECT * FROM
  ( SELECT r, p, t, s FROM f
    SPREADSHEET PBY(r) DBY (p, t) MEA (s,c) UPDATE
    (
      F1: s[*,2002]=c[cv(p), 2002]*2,
    )
  )
  WHERE p in (‘dvd’,‘vcr’) and t ≥ 2000;
   The formula F1 cannot be pruned away as part of its result is needed in
the outer query block. Still, we do not need to compute the “s” values for all
products in 2002 as the outer query filters out all the rows except for products
‘dvd’ and ‘vcr’. Hence we rewrite the left side of formula F1 as follows to avoid
unnecessary computation:
  F1’: s[p in (‘dvd’,‘vcr’),2002]= c[cv(p), 2002]*2
   The rewriting of formulas is done with a small extension of the algorithm
PruneFormulas. In the new PruneFormulas, we try to rewrite the formulas
in all sink nodes that we cannot prune. Note that, similar to the pruning of
a formula, the rewrite of a formula may also change the dependency graph
(some incoming edges of the formula might be deleted), possibly leading to the
generation of new sink nodes, so it is only natural that both rewrite and pruning
of formulas are handled in the same procedure.

5.4 Pushing Predicates Through Spreadsheet Clauses
Pushing predicates into an inner query block [Srivastava and Ramakrishnan
1992] and its generalization “predicate move-around” [Levy et al. 1994] is an
important optimization and has been incorporated into queries with spread-
sheets. We perform three types of pushing optimization: pushing on PBY and
independent DBY dimensions, pushing based on bounding rectangle analysis,
and pushing through reference spreadsheets.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
98      •      A. Witkowski et al.

   Pushing predicates through the PBY expressions in or out of the query block
is always correct as they filter entire partitions. For example, in
     SELECT * FROM
     ( SELECT r, p, t, s FROM f
       SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPDATE
       (
         F1:s[‘dvd’,2000]=s[‘dvd’,1999]+s[‘dvd’,1997],
         F2:s[‘vcr’,2000]=s[‘vcr’,1998]+s[‘vcr’,1999]
       )
     )
     WHERE r = ‘east’ and t = 2000 and p = ‘dvd’;
we push the predicate r = ‘east’ through the spreadsheet clause into the WHERE
clause of the inner query.
   Pushing can be extended to independent dimensions. A dimension d is called
an independent dimension if the value of d referenced on the right side is the
same as the value of d on the left side for every formula. For example, in the
above spreadsheet, the left side of F1 refers to the same values of p on the right
side. This is true for formula F2 as well, thereby making p an independent
dimension. t, however is not an independent dimension. Observe that in the
absence of UPSERT rules, independent dimensions are functionally equivalent
to the partitioning dimensions and can be moved from the DBY to the PBY
clause. For example, in the above spreadsheet, we could replace the PBY/DBY
clauses with
     SPREADSHEET PBY(r, p) DBY (t) MEA (s) UPDATE
   Consequently, we can push predicate p = ‘dvd’ into the inner query.
   We also pull predicates on the PBY and independent DBY columns out of the
query to effect the predicate move-around described in Levy et al. [1994].
   The outer predicates on the DBY other (not independent) columns can also
be pushed in, but we need to extend the predicates so they do not filter out
the cells referenced by the right sides of the formulas. For each formula, we
construct a predicate defining the rectangle bounding the cells referenced on
the right side. For example, for F2 these predicates are p = ‘vcr’ and t in (1998,
1999) and for F1 p = ‘dvd’ and t in (1997, 1999). Then a bounding rectangle
for the entire spreadsheet is obtained using methods described in Guttman
[1984] Beckmann et al. [1990], which is a union of bounding rectangles for each
formula. This in our case is p in (‘vcr’, ‘dvd’) and t in (1997, 1998, 1999). Then
the predicates on the DBY columns from the outer query are extended with the
corresponding predicates from the spreadsheet bounding rectangle, and these
are pushed into the query. In our example, we extend the outer predicate t =
2000 with t in (1997, 1998, 1999), which results in pushing t in (1997, 1998,
1999, 2000). The predicates on the DBY expressions in the outer query block
are kept in place unless the pushdown filter is the same as the outer filter and
there are no upsert formulas in the spreadsheet.
   We apply the above optimization if all formulas operate in the UPDATE
mode or if the spreadsheet has no PBY clause. With a PBY clause, the pushed
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Advanced SQL Modeling in RDBMS                  •      99

predicate could eliminate an entire partition and upsert of new cells would
never take place for it, resulting in missing rows in the output. For example,
consider
  SELECT * FROM
  ( SELECT r, p, t, s FROM f
    SPREADSHEET PBY(r) DBY (p, t) MEA (s) UPSERT
    ( s[‘dvd’,2003] = s[‘tv’,2003]* 0.5 )
  )
  WHERE p IN (‘dvd’, ‘vcr’)
Based on the bounding rectangle analysis, the unioned predicate p IN (‘dvd’,
‘vcr’, ‘tv’) is a candidate for pushing down. If, however, there is a re-
gion, say ‘west’, with no ‘dvd’ or ‘vcr’ or ‘tv’ sales and the predicate is
pushed down, the entire region is eliminated, and the new row (r=‘west’,
p=‘dvd’, t=2003, s=null) will not be upserted, violating spreadsheet
semantics.
   However, even in the presence of PBY and UPSERT formulas, the predicate
can be pushed in many situations. If the upserted cells for the empty partition
will be filtered out by the outer query, then it doesn’t matter whether the rows
for a partition are filtered out before or after spreadsheet computation. For
example, assume that the outside filter was
  p IN (‘dvd’,‘vcr’) AND s IS NOT NULL
   Since, region ‘west’ by assumption has no tv sales, the spreadsheet upserts
the row (r=‘west’, p=‘dvd’, t=2003, s=null) that is subsequently elimi-
nated by the outer filter s IS NOT NULL. Our analysis determines if upserted
measures can assume null values and if an outside filter filters the null
values of these measures. If so, we apply pushing predicates derived from
bounding rectangle analysis. In practical scenarios, applications operate in
upsert mode and are not interested in NULL measures, making this option
useful.
   A challenging scenario arises when the bounding rectangle for a formula
cannot be determined at optimization time since it may depend on a sub-
query S whose bounds are known only after S’s execution. This is common
in OLAP queries, which frequently inquire about the relationship of a mea-
sure at a child level to that of its parent (e.g., sales of a state as a per-
centage of sales of a country), or inquire about a prior value of a measure
(e.g., sales in March 2002 vs. sales the same month a year ago or a quar-
ter ago). These relationships are obtained by querying dimension tables. For
example, assume that the primary key of time dimension time dt is month m
and the table time dt stores the corresponding month a year ago as m yago,
and the corresponding month a quarter ago as m qago. Note that “quar-
ter ago” means three months ago, so quarter ago of 1999-01 is 1998-10 (see
Table II).
   An analyst wants to compute for a product ‘dvd’ and months (1999-01,
1999-03) the ratio of each month’s sales to the sales in the corresponding
months a year and quarter ago, respectively (r yago and r qago). Using SQL
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
100       •     A. Witkowski et al.

                                    Table II. Mapping Between m
                                          and y ago/m qago
                                    m           m yago      m qago
                                    1999-01     1998-01     1998-10
                                    1999-02     1998-02     1998-11
                                    1999-03     1998-03     1998-12


Sspreadsheet, this query, which we will call Q1, is
   Q1:
   SELECT p, m, s, r yago, r qago FROM
   ( SELECT p, m, s FROM f GROUP BY p, m
     SPREADSHEET
       REFERENCE prior ON
        (SELECT m, m yago, m qago FROM time dt)
         DBY(m) MEA(m yago, m qago)
     PBY(p) DBY (m) MEA (sum(s) s,r yago,r qago)
     (
       F1: r yago[*] = s[cv(m)] / s[m yago[cv(m)]],
       F2: r qago[*] = s[cv(m)] / s[m qago[cv(m)]]
     )
   )
   WHERE p = ‘dvd’ and m IN (1999-01, 1999-03)
   A reference spreadsheet serves as a one-dimensional lookup table mapping
month m to the corresponding month a year ago (m yago) and a quarter ago
(m qago). An alternative formulation of the query using ANSI SQL requires the
joins f >< time dt >< f >< f , where the first join gives the month values a
year and a quarter ago for each row in the fact table and the other two joins give
the sales values in the same month, a quarter ago, and an year ago, respectively.
The number of joins is reduced to one using a reference spreadsheet.
   The predicate p = ‘dvd’ on the PBY column can be pushed into the inner
block. However, m is not an independent dimension, nor can bounding rect-
angles be determined for it as the values m yago and m qago are unknown.
Consequently, a restriction on m cannot be pushed in, resulting in all time pe-
riods being pumped to the spreadsheet, out of which all except 1999-01 and
1999-03 are subsequently discarded in the outer query. Let’s call a dimension
d a functionally independent dimension if, for every formula, the value of d
referenced on the right side is either the same as the value of d on the left
side or a function of the value of d on the left side via a reference spreadsheet.
In query Q1 given above, m is a functionally independent dimension, as the
right side uses m directly or uses a function of the value of m on the left side:
m yago[cv(m)] and m qago[cv(m)].
   We experimented with three transformations to push predicates through
functionally independent dimensions. In the first, called ref-sub-query push-
ing, we add into the inner block a subquery predicate, which selects all values
needed by the spreadsheet and the outer query. The transform is similar to the
magic set transformation [Mumick et al. 1990] which pushes a query derived
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                    Advanced SQL Modeling in RDBMS                 •      101

from outer predicates into the inner block. In the above case, the outer query
needs m IN (1999-01, 1999-03), and the spreadsheet needs these values plus
their corresponding m yago and m qago values from the reference spreadsheet.
These values can be obtained by constructing a subquery over the reference
spreadsheet as shown in Q2:
Q2:
  WITH ref sub-query AS
    (SELECT m, m yago, m qago FROM time dt
     WHERE m IN (1999-01, 1999-03))
  SELECT m AS m value FROM ref sub-query
  UNION
  SELECT m yago AS m value FROM ref sub-query
  UNION
  SELECT m qago AS m value FROM ref sub-query
and then pushing it into the inner block of the query:
  SELECT p, m, s, r yago, r qago FROM
  ( SELECT p, m, s FROM f
    WHERE m IN (SELECT m value FROM Q2 on page 20)
    GROUP BY p, m
    SPREADSHEET
    <.. as above in query Q1 on page 19 .. >
  )
  WHERE p = ‘dvd’ and m IN (1999-01, 1999-03)
   In the second transformation, called extended pushing, we construct the
pushed-in predicates by executing the reference spreadsheet query, obtaining
the referenced values and building predicates on the dimension, and finally
disjuncting them with the outer predicates. In the above case we execute
  SELECT DISTINCT m yago, m qago FROM time dt
  WHERE m IN (1999-01, 1999-03)
to obtain the values for m yago and m qago corresponding to m IN (1999-01,
1999- 03). Let’s assume that the corresponding m yago is (1998-01, 1998-03) and
m qago is (1998-10, 1998-12) that is, the first and third month of the previous
quarter. Finally, we push this predicate into the inner query:
  SELECT p, m, s, r yago, r qago FROM
  ( SELECT p, m, s FROM f
    WHERE m IN (1999-01, 1999-03, /* outer preds */
                1998-01, 1998-03, /* previous year */
                1998-10, 1998-12) /* previous quart */
    GROUP BY p, m
    SPREADSHEET
    <.. as above in query Q1 on page 19.. >
  )
  WHERE p = ‘dvd’ and m IN (1999-01, 1999-03)
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
102       •     A. Witkowski et al.

   In the third transformation, called formula unfolding, we transform the for-
mulas by replacing the reference spreadsheet with its values. Similarly to the
second transformation, we execute reference spreadsheet and obtain its mea-
sure for each of the dimension values requested by the outer query. These
values are then used to unfold the formulas. For example, for m = 1999-01,
value of m yago = 1998-01, and m qago = 1998-10, and for m = 1999-03,
value of m yago = 1998-03, and m qago = 1998-12. Thus formulas are unfolded
as
   SELECT p, m, s, r yago, r qago FROM
   ( SELECT p, m, s FROM f GROUP BY p, m
     SPREADSHEET
       REFERENCE prior ON
        (SELECT m, m yago, m qago FROM time dt)
         DBY(m) MEA(m yago, m qago)
     PBY(p) DBY (m) MEA (sum(s) s,r yago,r qago)
     (
       F1: r yago[1999-01] = s[1999-01] / s[1998-01],
       F1 : r yago[1999-03] = s[1999-03] / s[1998-03],
       F2: r qago[1999-01] = s[1999-01] / s[1998-10],
       F2 : r qago[1999-03] = s[1999-01] / s[1998-12]
     )
   )
   WHERE p = ‘dvd’ and m IN (1999-01, 1999-03)
   Following formula unfolding, we perform analysis of the bounding rectangles
described above and push the resulting bounding predicate into the inner query.
   In our experiments (see Section 8), the extended pushing and formula un-
folding transformations resulted in similar performance as in most cases they
push in the same predicates. In comparison, the ref-sub-query push transform
had inferior performance. The use of ref-sub-query gives the optimizer a choice
of join method between the subquery and the main query block. The optimizer
sometimes selects a more expensive join method, thereby slowing down the
query (see experimental results in Section 8).

5.5 Optimizations of Aggregates
SQL Spreadsheet allows us to express complex business models within a single
query block. Frequently the models will include multiple aggregates on subsets
of the data relative to the current row; hence their optimization is critical.
Consider this query:
   SELECT r,p,t,s, ps FROM t
   SPREADSHEET PBY(r) DBY(p,t) MEA(s, t, 0 ps)
   UPDATE
   (
     ps[*,*] = s[cv(p), cv(t)-1] *
               (1+slope(s,t)[cv(p), cv(t)-5 <= t <= cv(t)-1])
   )
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                    Advanced SQL Modeling in RDBMS                 •      103

   It computes the projected sales, ps, of each product for every year. ps is
computed by multiplying the actual sales s from the previous year by the rate
of increase of sales (expressed as the slope aggregate function) over last 5
years. The aggregate is relative to the current row: for each row on the left
we get its product and within it compute slope for 5 previous years. With
a naive execution, this query is expensive. The right side of the formula is
computed for each row coming into spreadsheet. The right side has an aggre-
gate function, which requires a full table scan of table t. Hence there are as
many full table scans as rows in table t—a prohibitively expensive execution
plan.
   Spreadsheet evaluation can be optimized by reducing the number of table
scans. For each cell on the left side, that is, for each product and each year,
we have to access sales for the previous 5 years of the product to compute
the requested slope aggregate. Suppose that before evaluating the formula we
partition data by product and order each partition by year. Then within each
product partition we will consider a sliding window of 5 past years. Thus, for
year 2000 we will look at years 1995–1999, for year 2001 at years 1996–2000,
etc. As we slide the window we can compute, with a single scan of sorted data,
the slope aggregate for each window frame. The slope can be expressed as sum
and count aggregates and thus belongs to the family of algebraic aggregates
[Gray et al. 1996] and hence can be maintained incrementally during sliding
window operation. ANSI SQL provides window functions [Zemke et al. 1999]
for that operation and many database systems (Oracle, DB2) already provide a
native implementation for them. The slope aggregate
  slope(s,t)[cv(p), cv(t)-5 <= t <= cv(t)-1]
can be rewritten using ANSI SQL window formulation as
  slope(s,t) OVER (PARTITION BY p ORDER BY t RANGE
                BETWEEN 5 PRECEDING AND 1 PRECEDING)
  We can rewrite an aggregate with a window function when (1) the for-
mula is not self-cyclic, and (2) one dimension of the aggregate defines a
window relative to the current row using cv() and all other dimensions
are qualified by the values from the current row, that is, by cv(). Let’s de-
note the other dimensions as Dcv. The Dcv dimensions are used to partition
the data (see the product dimension above) and the dimension defining the
window is used for storing within the partitions (see the time dimension
above).
  The algorithm GenLevels assigning formulas to execution levels places for-
mulas with window functions at the same level if they can share a sort. For
example, the two formulas in
  SPREADSHEET DBY(r,p,t) MEA(s, t, 0 r, 0 w) UPDATE
  (
    w[*,*,*]= AVG(s)[cv(), cv(), cv(t)-5 <= t <= cv(t)],
    r[*,*,*] =SUM(s)[cv(), p =‘vcr’, 1999 < t <= 2000]
  )
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
104       •     A. Witkowski et al.

will be rewritten with two window functions:
   AVG(s) OVER (PARTITION BY r, t ORDER BY t RANGE
                BETWEEN 5 PRECEDING AND CURRENT),
   SUM(CASE p=‘vcr’ & 1999<t<=2000 THEN S ELSE NULL) OVER
                    (PARTITION BY r)
and a single sort on (r, p, t) will satisfy both formulas.

5.6 Optimization of Qualified Aggregates
It is common for aggregates to apply to a predetermined set of cells that is much
smaller than the total number of cells in the model. Consider aggregates applied
to a window of cells around the current cell, as in this example of forecasting
sales for DVDs in the next 2 years based on the 3-year moving average over the
model:
   SPREADSHEET DBY (p, t) MEA (s, 0 mavg)
   (
     mavg[‘dvd’, FOR t FROM 2000 TO 2001] =
        1.05 * AVG(s)[cv(), cv()-3 <= t <= cv()-1]
   )
   This aggregate would normally require a scan of all the rows in the access
structure to determine the rows satisfying the predicate in the aggregate cell
reference. If the aggregate set is significantly smaller than the partition cur-
rently being processed, it may be more efficient to explicitly enumerate and look
up each value that falls within this set rather than perform a scan. To allow
this functionality, we provide the qualified loop operator that allows a user to
specify an enumerated set on which to compute aggregates in the assignment
expression.
   An equivalent expression for the forecast above (when years are positive
integers) is
   SPREADSHEET DBY (p, t) MEA (s, 0 mavg)
   (
     mavg[‘dvd’, FOR t FROM 2000 TO 2001] =
       1.05 * AVG(s)[cv(), FOR t FROM cv()-3 TO
                               cv()-1 INCREMENT 1]
   )
   In this case, for each unfolded formula on the left side, there will be three
direct cell lookups generated by the formula on the right side rather than a
condition applied across a scan of the entire dataset. Such an aggregate is
called a qualified aggregate.
   Each dimension in a qualified aggregate must be fully qualified as either a
FOR loop or as a single-value equality expression. Individual cell values for
lookup are generated by incrementing the qualified expressions from left to
right through the cell index.
   The performance of qualified and existential aggregates is discussed in the
Section 8.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                           Advanced SQL Modeling in RDBMS                 •      105

                         Table III. Summary of Formula Transformations
 Transformation                                        Major Technique
 Pruning of formulas           Determining formula sinks and rows filtered by the outer query
 Rewriting of formulas         Changing the scope of formulas based on the outer filters
 Pushing of Predicates         Pushing through PBY, pushing through independent
                                 dimensions, bounding rectangles on outside filters
 Data-dependent pushing        Ref-sub-query pushing, formula unfolding, extended pushing
   of predicates
 Optimization of               Conversion of aggregates to window functions
   aggregates
 Optimization of qualified      Using point access instead of scans for aggregates
   aggregates


  Table III summarizes the transformation strategies used for SQL Spread-
sheet optimizations.

6. SQL SPREADSHEET EXECUTION
Spreadsheet evaluation is handled just like any other operation in the RDBMS
query evaluation engine. The spreadsheet evaluation operator takes a set of
rows as input, maps these rows into a multidimensional access structure (a
hash table) based on PBY and DBY specifications in the spreadsheet clause,
then evaluates the formulas of the spreadsheet to upsert new cells or to modify
the measures in the cells, and finally returns these cells as a set of output
rows. If there are reference spreadsheets specified in the spreadsheet clause,
the spreadsheet operator takes input rows for each reference spreadsheet from
their respective query blocks and builds a hash table on each of them so that
they can be “referenced” during formula evaluation. The hash tables created
for reference spreadsheets are created as read-only and they are discarded at
the end of spreadsheet computation.
   We elaborate on the evaluation steps of the spreadsheet operator below.

6.1 Access Structure
For efficient access to single cells (like s[p = ‘dvd’, t = 2000]), we build a two-
level hash access structure. In the first level, called the hash partition, data is
hash-partitioned on the PBY columns. Please note that data for more than one
spreadsheet partition may end up in the same hash partition. In the second
level, a hash table is built on the PBY and DBY columns within each first-level
partition; hence multiple spreadsheet partitions exist within a hash partition.
We use the term partitioning phase to describe splitting the data into hash
partitions (the first phase).
   This two-level scheme enables us to evaluate spreadsheets efficiently and to
reduce the memory requirement as well. Memory required at any point during
spreadsheet execution is equal to the size of the hash partition being operated
on at that time. To minimize the size and build time of the hash access structure,
we build the access structure only on rows required by the formulas as defined
by the spreadsheet bounding rectangle (see Section 4). The number of hash
partitions is chosen based on the estimated data size and the amount of memory
available. The goal is to have the largest hash partition fit in memory.
                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
106       •     A. Witkowski et al.

   After the partitioning phase, we go to the formula execution phase. In this
phase, spreadsheet formulas are evaluated one spreadsheet partition at a time.
We pin a hash partition in memory. As it can contain more than one spread-
sheet partition, we consider one spreadsheet partition at a time and evaluate
formulas within it. We repeat this for all the spreadsheet partitions of the hash
partition and then move on the next hash partition. In some cases, as explained
later, we are able to evaluate all formulas within a hash partition at once for
better performance.
   In most cases, a hash partition fits in memory, resulting in efficient evalua-
tion of the formulas. There are situations, such as data skew or shortage of run-
time memory, when memory is not sufficient to hold some hash partitions. When
this happens, we build a disk-based hash table. It employs techniques such as
a weighted LRU scheme for block replacement, pointer swizzling to make ref-
erences lightweight, and write-back of only those disk blocks that are dirty.
   To overlap computation and I/O, we use asynchronous reads and writes when-
ever possible during the construction and use of the hash access structure.
During the partitioning phase, we issue asynchronous writes of full blocks to
free them for new data. Similarly, asynchronous reads are issued during scan
operations.
   The hash access structure supports operations such as probe, update, upsert,
insert, and scans. A scan operation can return all records matching a given
DBY key or return records within a hash or spreadsheet partition. As a part of
these scan operations, the hash access structure also allows the current record
to be updated. Update of the current row being scanned is very useful while
evaluating existential formulas. By doing so, we avoid the additional lookup
needed for performing the update.
   Collision occurs when records with different PBY and DBY keys get mapped
to the same hash bucket. We handle collisions by chaining the colliding records
in the hash bucket. Collisions degrade performance of the lookup operation.
We try to reduce collisions by sizing the hash table in spreadsheet partitions
to have N times (N = 2 by default) as many buckets as the number of records
in the hash partition. We count the number of records within a hash partition
during the initial partitioning step and use that number to size the hash table
within the hash partition. Records within a hash bucket are clustered based on
key values, thereby making scans of records with the same key value efficient.
   We now describe the execution algorithms used in evaluating the spreadsheet
queries.

6.2 Execution
Formulas in SQL Spreadsheet operate in automatic order or sequential order.
Figure 2 classifies the spreadsheet based on the evaluation order and depen-
dency analysis and identifies the execution algorithm. There are three algo-
rithms: Auto-Acyclic, Auto-Cyclic, and Sequential.
  6.2.1 Automatic Order. The order of evaluation of formulas in an auto-
matic order spreadsheet is given by their dependencies (see Section 5.1). We
have two methods for its execution.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                    Advanced SQL Modeling in RDBMS                 •      107




                  Fig. 2. Classification of spreadsheet evaluation.


  6.2.1.1 Auto-Acyclic Algorithm. The Auto-Acyclic algorithm is used when
there are no cycles detected in the formula dependency graph:
  Auto-Acyclic()
  {
     FOR each spreadsheet partition P
     {
        FOR level Li from L1 to Ln
        {
           /* LSi =    set of formulas in Li with single cell refs on
                       left side
              LEi =    set of formulas in Li with existential
                       conditions on left side
              First, evaluate all aggregates in set LSi , then all
              formulas in that set
           */
           FOR each record r in P               -- (Scan I)
              for each aggregate A in LSi
                apply r to A;
           FOR each formula F in LSi
              evaluate F;

           /* Evaluate all formulas in LEi */
           FOR each record r in P               -- (Scan II)
           {
              find formulas EF in LEi to be evaluated for r
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
108        •       A. Witkowski et al.

                       FOR each record r in P                       -- (Scan III)
                           FOR each aggregate A in EF
                               apply r to A;
                       FOR each formula EF
                           evaluate EF;
                   }
               }
       }
   }
   Notice that all the aggregates at any level are computed before evaluation
of formulas at that level so they are available for the formulas. This requires a
scan of records in the partition for each level. In the absence of existential for-
mulas, and the presence of only those aggregate functions for which an inverse
is defined (for example, SUM, COUNT, etc.), the aggregates for all the levels
are computed in a single scan. With each formula we store a list of aggregates
dependent on the cell being upserted (or updated) by it. It is possible to deter-
mine such a list because there are only single cell references on the left side
and these values can be substituted in the aggregate cell reference predicate to
find the dependent formulas. So, if a formula changes the value of its measure,
the corresponding dependent aggregates are updated by applying the current
value and inverse of the old value of the measure. In the above algorithm, we
can also combine scan I with the scan II or scan III.
   An example of an acyclic spreadsheet:
   SELECT r, p, t, s
   FROM f
   SPREADSHEET PBY(r) DBY (p, t) MEA (s)
   (
     s[‘tv’, 2002] =s[‘tv’, 2001] * 1.1,
     s[‘vcr’,2002] =s[‘vcr’, 1998] + s[‘vcr’, 1999],
     s[‘dvd’,2002] =(s[‘dvd’,1997]+s[‘dvd’,1998])/2,
     s[*, 2003] =s[cv(p), 2002] * 1.2
   )
   The above query makes sales forecasts for years 2002 and 2003. The formulas
are split into two levels. The first level consists of the first three formulas, pro-
jecting sales for 2002, and the second level, dependent on the first level, consists
of the last formula, projecting sales for 2003. The Auto-Acyclic algorithm eval-
uates formulas in the first level before evaluating formulas in the second level.


   6.2.1.2 Auto-Cyclic Algorithm. There are also automatic order spread-
sheets which are either cyclic, or have complex predicates that make the exis-
tence of cycles indeterminate. In such cases (see Section 5.1), the dependency
analysis approximately groups the formulas into levels by finding sets of for-
mulas comprising strongly connected components (SCCs), and assigning the
formulas in an SCC to consecutive levels. The Auto-Cyclic algorithm evalu-
ates formulas that are not contained in SCCs as in the acyclic case, but when
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Advanced SQL Modeling in RDBMS                 •      109

formulas in SCCs are encountered, it iterates over the consecutive SCC formu-
las until a fixed point is reached, but only up to a maximum of N iterations
where N = number of cells upserted (or updated) in the first iteration. If the
spreadsheet was actually acyclic, the formulas will converge after at most N
iterations. In the worst case, if the formulas were evaluated in exactly the oppo-
site order of (real) dependency, each iteration will propagate one correct value
to another formula, hence requiring N iterations. Therefore, to evaluate all
acyclic spreadsheets which could not be classified as acyclic and limit the num-
ber of iterations for cyclic spreadsheets, the maximum number of iterations for
evaluation of formulas is fixed at N . If the spreadsheet does not converge in
N iterations, an error is returned to the user. To determine if the spreadsheet
has converged after an iteration, a flag is stored with the measure that is set
whenever the measure is referenced while evaluating a formula. Later, an up-
date of a measure, which has the flag set, to a different value indicates that
additional iterations are required to reach a fixed point. Similarly, the inser-
tion of a new cell (by an UPSERT formula) signals additional iterations. This
technique requires resetting flags for each measure after each iteration—an
expensive proposition. Hence, instead of a single flag, two flags are stored, each
one being used in alternate iterations—as one of the flags is set, the other one
can be cleared.

   6.2.2 Sequential Order. In a sequential order spreadsheet, formulas are
evaluated in the order they appear in the spreadsheet clause. The dependency
analysis still groups the formulas into levels consisting of independent formu-
las so that the number of scans required for the computation of aggregate func-
tions is minimized. The algorithm is similar to Auto-Acyclic, but there may
be multiple iterations as specified in the ITERATE spreadsheet processing
option.

6.3 Parallel Execution of SQL Spreadsheet
To improve the scalability of spreadsheet evaluation, formulas can be evalu-
ated in parallel for different partitions. The technique for parallel evaluation
of spreadsheet queries is covered in Witkowski et al. [2003] in its entirety and
we omit it here for lack of space.

7. PARAMETERIZING SQL SPREADSHEET

7.1 Parameterization of the SQL Query Block
Oracle users have two ways of abstracting and persisting complex
computations—ANSI SQL views and functions. A disadvantage of views is that
data cannot be passed to them. Thus, computation is always on a fixed set of
objects specified in the FROM clause of the view query. Allowing the view to be
parameterized by making it possible to pass subqueries and scalars to it would
significantly expand its capabilities as a computational object.
   Functions, which can return row-sets and hence participate in further SQL
processing, are the other way of expressing complex computations. Users have
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
110       •     A. Witkowski et al.

multiple implementation languages to choose from (C, Java, PLSQL), with
Oracle PL/SQL being the most common. Functions implemented in PL/SQL
can use a mix of imperative and declarative SQL styles of programming, but
this flexibility comes at the expense of suboptimal plans. An SQL query Q
invoked from a procedural PL/SQL function F is optimized in isolation and
does not participate in interquery optimization. For example, predicates out-
side of F cannot be pushed into Q and Q is not merged with queries invoking
F , etc.
   To alleviate these disadvantages, we propose to express functions declara-
tively with SQL. An SQL-language function is a function whose body is an SQL
query. Its parameters can either be scalars or subqueries producing row-sets.
We support two types of SQL-language functions: strongly typed, where the
type checking is done at the function creation time, and weakly typed, where
type checking is deferred to invocation time. For example,
   CREATE FUNCTION region sales 2002
     (f TABLE OF ROW (r VARCHAR, p VARCHAR, t INT, s NUMBER),
     region VARCHAR)
   RETURN MULTISET LANGUAGE SQL AS
   SELECT r, p, t, s
   FROM f param f
   WHERE r = region
   SPREADSHEET PBY(r) DBY (p, t) MEA (s)
   (
     s[‘vcr’,2002] =s[‘vcr’, 1998] + s[‘vcr’, 1999],
     s[‘dvd’,2002] =avg(s)[‘dvd’, 1990 < t < 2001],
     s[*, 2003] =s[cv(p), 2002] * 1.2
   )
defines a strongly typed SQL-language function with two parameters: a sub-
query f and a scalar region. The subquery parameter is defined using the TABLE
OF ROW clause describing f ’s schema. The resulting type of the function is de-
rived from the SELECT list of the query and in this case is the same as input
parameter f. This type can also be specified using the TABLE OF ROW clause in
the RETURN subclause.
   The subclause RETURN MULTISET LANGUAGE SQL indicates that the function
produces a row-set and is implemented in SQL.
   The function designer may not know in advance the data types of the pa-
rameters, and for this case we provide weakly typed functions where a re-
served type ANYTYPE delays type checking till the invocation time. For example,
region sales 2002 can be weakly defined as
   CREATE FUNCTION region sales 2002
     (f TABLE OF ROW (r ANYTYPE,p ANYTYPE,t ANYTYPE,s ANYTYPE),
      region ANYTYPE)
   RETURN MULTISET LANGUAGE SQL AS <...>
  SQL-language functions are invoked by placing them in the FROM clauses of
queries. For example, the following query
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                    Advanced SQL Modeling in RDBMS                 •      111

  SELECT r, p, t, s
  FROM region sales 2002
       (
         (SELECT reg, prod, time, sale FROM t), ‘west’
       )
  WHERE p = ‘tv’;
invokes the region sales 2002 function and passes it a subquery and scalar. The
entire query is expanded by our view expansion to
  SELECT r, p, t, s
  FROM
  (
    SELECT r, p, t, s
    FROM (SELECT reg r, prod p ,time t, sale s FROM t)
    WHERE r = ‘west’
    SPREADSHEET PBY(r) DBY (p, t) MEA (s)
    (
      s[‘vcr’,2002] = s[‘vcr’, 1998] + s[‘vcr’, 1999],
      s[‘dvd’,2002] = avg(s)[‘dvd’, 1990 < t < 2001],
      s[*, 2003] = s[cv(p), 2002] * 1.2
    )
  )
  WHERE p = ‘tv’;
   Following that, the dynamic optimizations described in Section 5 are applied,
resulting in pruning the first two rules, rewriting the third one, pushing pred-
icate p=‘tv’ inside, and pushing the predicate t IN(2003, 2002) derived from
the bounding rectangle analysis into the query block. This results in
  SELECT r, p, t, s
  FROM f
  WHERE r = ‘west’ AND p = ‘tv’ r = region
    AND t in (2003, 2002)
  SPREADSHEET PBY(r) DBY (p, t) MEA (s)
  (
    s[‘tv’,2003] =s[cv(p), 2002] * 1.2
  )
   Observe that the resulting query benefited greatly from interquery optimiza-
tion, a feature not available in functions implemented procedurally, such as in
C or in PL/SQL.

7.2 Parameterization of the SQL Spreadsheet Clause
Parameterization of SQL-language functions allows us to build SQL mod-
els, which preserve spreadsheet optimizations without knowing object names
and schemas of user applications. However, it does not provide a framework
for building user-defined functions using SQL Spreadsheet’s most potent con-
structs: representing a relation as an array and defining formulas on it.
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
112       •     A. Witkowski et al.

   For this, we extend the concept of functions whose bodies are SQL queries to
procedures whose bodies contain SQL Spreadsheet clauses. This is useful for
implementing functions present in classic spreadsheets, like net present value
(NPV or npv), that are not present in ANSI SQL.
   The SQL Spreadsheet procedure is a function whose body contains SQL
Spreadsheet formulas and whose parameters are scalars and multidimen-
sional, multimeasure arrays. The arrays can be declared as input, output, or
input/output parameters denoted by IN, OUT, or INOUT following the Oracle
PL/SQL convention. The subscript of the array is always an IN parameter. Like
SQL-language functions, the declaration of arrays allows for strong and weak
types.
   For example, the SQL Spreadsheet procedure
   CREATE PROCEDURE net present value
     (ARRAY DBY (i IN INTEGER)
            MEA (amount IN NUMBER, npv OUT NUMBER),
      rate NUMBER)
   LANGUAGE SQL SPREADSHEET AS RULES IGNORE NAV
   (
     npv[1] = amount[1],
     npv[i > 1] ORDER BY i
         = amount[CV(i)/POWER(1+rate, CV(i)] + npv[CV(i) - 1]
   )
calculates net present value npv of amount for sequential time periods i based
on
                                     amounti
                                                .
                                    (1 + rate)i
   Observe that, in the SQL formulation, the summation operator is replaced
by looping over all values in the array in order (npv[*] ORDER BY i) and adding
a previously calculated NPV value (npv[CV(i) -1]) to the one currently com-
puted.
   The function accepts two parameters: an array dimensioned by an integer
iwith two measures: amount (an IN parameter) and NPV (an OUT parameter),
and a scalar parameter rate. The body of the function is the RULES subclause
of SQL Spreadsheet, which implements the net present value given above.
   The SQL Spreadsheet procedure is invoked from the SQL Spreadsheet
clause. The invoker maps the rectangular regions of the main or reference
spreadsheet to the actual array parameters of the function. These regions are
defined using predicates on the DBY columns of the spreadsheet and then
mapped to arrays indicating which columns of the regions form indexes and
measures of the array.
   We explain this with an example. Consider a relational table cash flow(year,
period, prod, amount) expressing a cash flow for electronic products in years
1999–2002. Years are assigned sequential time periods 1–4—see Table IV. This
analysis is made from the time perspective of the first day of 1999. For each
product, there is an initial negative cash flow at the end of 1999 representing the
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Advanced SQL Modeling in RDBMS                •      113

                             Table IV. Cash Flow Table
                   Year   Period i   Prod    Amount      Nvl Result
                   1999      1        vcr    −100.00      −100.00
                   2000      2        vcr      12.00       −90.70
                   2001      3        vcr      10.00       −84.01
                   2002      4        vcr      20.00       −72.17
                   1999      1       dvd     −200.00      −200.00
                   2000      2       dvd       22.00      −183.07
                   2001      3       dvd       12.00      −174.97
                   2002      4       dvd       14.00      −166.68


investment in products. The later years have positive cash flows representing
the sales of products.
   Assume that i and prod form the DBY clause of this SQL Spreadsheet:
  SPREADSHEET DBY (prod,i) MEA (year,amount,0 npv) ()
   The (amount, npv) [‘vcr’, *] designates a rectangular region with two
measures amount and npv within that spreadsheet. The first dimension in this
rectangle is qualified by a constant. Hence, we can map it to a one-dimensional
array with prod dimension and two measures using the SQL CAST operator:
  CAST ((amount, npv)[‘vcr’, *] AS
         ARRAY DBY (i IN INTEGER)
               MEA (amount IN NUMBER,npv OUT NUMBER))
   A default casting is also provided. If the region fits the shape of the array,
the CAST operator is not needed. In the (amount, npv) [‘vcr’, *] region, prod
dimension is qualified to be a constant while the i dimension is unqualified.
This can, by default, be mapped to one-dimensional array.
   Casting operations may be expensive if the array shape is not compatible
with the spreadsheet frame. In this case, we build another random access hash
structure for the array during runtime. If the array shape is compatible with
spreadsheet frame, we reuse the hash access structure of the spreadsheet. In
our example, the (amount, npv) [‘vcr’, *] region can reuse the spreadsheet
access structure, increasing the efficiency of the computation.
   To calculate the net present value of ‘vcr’ and ‘dvd’ products, one would
then write
  SELECT year, i, prod, amount, npv
  FROM cash flow
  SPREADSHEET DBY (prod, i) MEA (year, s, NULL npv)
  (
    net present value((amount,npv)[‘vcr’,*], 0.14),
    net present value((amount,npv)[‘dvd’,*], 0.14)
  )
This is then expanded to the equivalent form of
  SELECT year, i, prod, amount, npv
  FROM cash flow
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
114       •     A. Witkowski et al.

   SPREADSHEET DBY(prod,i) MEA(year,amount,null npv)
   IGNORE NAV
   (
     npv[‘vcr’, 1] = amount[‘vcr’, 1],
     npv[‘vcr’, i > 1] ORDER BY i
      = amount[CV(prod), CV(i)]/POWER(1+rate, CV(i))
      + npv[CV(prod), CV(i) - 1],
     npv[‘dvd’, 1] = amount[‘dvd’, 1],
     npv[‘dvd’, i > 1] ORDER BY i
      = amount[CV(prod), CV(i)]/POWER(1+rate, CV(i))
      + npv[CV(prod), CV(i) - 1]
   )
   The equivalent form is then subject to all the optimizations described in
Section 5. In the above case, the bounding rectangle analysis will push predicate
p IN (‘vcr’, ‘dvd’) into the WHERE clause of the query block. This would
not be possible (or would be too hard) if the net present value function were
implemented using a procedural language.
   The output of the query, using the amounts below with an annual interest
rate of 14%, is shown in Table IV in the “NVL result” column.

8. EXPERIMENTAL RESULTS
We conducted experiments on the APB benchmark database2 populated with
0.1 density data. The APB schema has a fact table with 4 hierarchical dimen-
sions: channel with two levels, time with three levels, customer with three
levels, and product with seven levels. We constructed a cube over the fact table
and materialized it in the apb cube table. Like the fact table, the cube has four
dimensions—t(ime), p(roduct), c(ustomer), h(channel), each represented as a
single column with all hierarchical levels encoded into a single value. The cube
had bitmap indexes on the dimension columns and had 22,721,998 rows. The
experiments were conducted on a 12 CPU, 336-MHz, shared memory machine
with a total of 12 GB of memory. The experiments report units of time rather
than absolute time measures like seconds, as they were done on a commercial
prototype still undergoing tuning.

8.1 Pushing Predicates Experiment
We used a spreadsheet query calculating the ratio of sales for every product
level to its first, second, and third parents in the product hierarchy. The APB
product hierarchy has seven levels: prod, class, group, family, line, division, and
top. Thus, for a product in the prod level, we calculated the share of its sales
relative to its corresponding class, group, and family levels. Assuming that the
parent information of a product was stored in a dimension table product dt
with columns p, parent1, parent2, parent3 (product, its parent, grandparent,

2 APB benchmark specifications. Go online to http://www.olapcouncil.org/research/APB1R2
spec.pdf.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                    Advanced SQL Modeling in RDBMS                 •      115




                            Fig. 3. Pushing predicates.


and great-grandparent, respectively), the query had the form
  Q3:
  SELECT
     s, share 1, share 2, share 3, p, c, h, t,
  FROM apb cube
  SPREADSHEET
     REFERENCE ON
       (SELECT p, parent1, parent2, parent3 FROM product dt)
        DBY (p) MEA (parent1, parent2, parent3)
     PBY (c,h,t) DBY (p)
     MEA (s, 0 share 1, 0 share 2, 0 share 3) RULES UPDATE
     (
       F1: share 1[*] = s[cv(p)] / s[parent1[cv(p)]]
       F2: share 2[*] = s[cv(p)] / s[parent2[cv(p)]]
       F3: share 3[*] = s[cv(p)] / s[parent3[cv(p)]]
  )
   A hypothetical user indicates products of interest via a predicate on p in the
outer query. We studied three algorithms (sub-query, extended-pushing, and
formula-unfolding) for pushing predicates by changing the selectivity (fraction
of rows selected) of the predicate.
   As shown in Figure 3, we observed 5 to 20 times the improvement in the
query response time (serial execution) by pushing predicates as compared
to not pushing them at all. In general, the improvement can be arbitrarily
large. The extended-pushing and formula-unfolding algorithms performed al-
most identically, as expected, and their response times were predictable. The
sub-query pushing algorithm offered a surprise, as the response time curve
was not smooth. For low selectivity of the predicates (up to 0.006), the opti-
mizer chose a nested-loop join between the subquery and the apb cube (see the
sub-query-nested loop curve). This was not the optimal choice and caused lin-
ear degradation in performance up to three times over the extended-pushing
                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
116       •     A. Witkowski et al.




                                 Fig. 4. Optimization of aggregates.

method. Beyond the 0.006 selectivity, the optimizer chose the more optimal
hash join. However, the response time was still 20% worse than the response
time for the extended-pushing method when we forced the optimizer to always
choose a hash join between the subquery and apb cube (see sub-query-forced
hash graph). The response time for the subquery method was about 20% worse
than extended-pushing for the entire range of investigated selectivity values.

8.2 Optimization of Aggregates Experiment
We evaluated performance the transforming relative aggregates to their corre-
sponding window aggregates by using an example query computing a moving
average of the past 100 months:
   SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r)
   (
     r[*]= avg(s)[ cv() - 1 < t <= cv() - 100]
   )
   The aggregate above can be transformed to
   AVG(s) OVER (ORDER BY t RANGE
                BETWEEN 100 PRECEEDING AND 1 PRECEEDING)
   The average aggregate operated within a partition based on channel, cus-
tomer, and product. We kept the size of the input data constant, but varied the
number of months per partition. We compared the performance of the trans-
formed formula—see the solid line in Figure 4—to the untransformed formula
that used naive execution. The naive execution evaluated the aggregate as
many times as the cordinality of the partition—see the dashed line in Figure 4.
As expected, the performance of untransformed aggregation degraded linearly
with the increasing cardinality of partitions.

8.3 Qualified Aggregates Experiment
We support two ways of computing aggregates for discrete dimensions. The
set to be aggregated can be explicitly enumerated or it can be expressed as a
condition of the dimension. The first formulation, called qualified aggregates,
involves direct access to the spreadsheet cells and the second a scan of the
partition. Here we show the tradeoffs between the formulations.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Advanced SQL Modeling in RDBMS                 •      117




                Fig. 5. Aggregate using scan versus qualified aggregates.

   Consider an average of N time periods. Using qualified aggregates formula-
tion, the computation can be expressed as
  SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r)
  (
    r[1]= avg(s)[ FOR t FROM CV() TO CV() + N]
  )
In the second (which involves a scan), it can be expressed as
  SPREADSHEET PBY(h, c, p) DBY(t) MEA(s, 0 r)
  (
    r[1]= avg(s)[ CV() <= t < = CV() + N]
  )
   The first formulation performed better when number of cells accessed was
a small fraction of the partition, as shown in Figure 5. We kept the size of the
partition constant and varied N , which in the figure is expressed as percentage
of a partition. Observe that there was a significant range, up to 18% of the
partition size, where qualified aggregates outperformed aggregates computed
with a scan.
   This shows a need for an optimization where, for discrete dimensions, we
automatically choose which form of the aggregate computation is most efficient.
We plan to include this in a future project.

8.4 Hash-Join Versus SQL Spreadsheet Experiment
Many SQL Spreadsheet operations can be expressed with standard ANSI SQL
using joins and UNIONs. For example, query Q3 discussed earlier can be ex-
pressed using joins three self-joins of abp cube and a join to product dt:
  SELECT
      s, a1.s/a2.s AS share 1, a1.s/a3.s AS share 2,
      a1.s/a4.s AS share 3, p, c, h, t,
  FROM
      apb cube a1, apb cube a2 (+), apb cube a3 (+),
      apb cube a4 (+), product dt p (+)
  WHERE
    a1.p=p.p &
    a2.p=p.parent1 & a2.c=a1.c & a2.h=a1.h & a3.t=a1.t
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
118       •     A. Witkowski et al.




                   Fig. 6. Hash join versus SQL Spreadsheet function of #rules.




                            Fig. 7. Scalability with number of formulas.

   a3.p=p.parent2 & a3.c=a1.c & a3.h=a1.h & a3.t=a1.t
   a4.p=p.parent3 & a4.c=a1.c & a4.h=a1.h & a4.t=a1.t
   The number of self-joins is equal to the number formulas (say N ), and all joins
to the original apb cube (a1) are right outer joins (the right side of outer joins
is marked with (+) in the FROM list). For hash joins, this requires construc-
tion of N hash tables, while our SQL Spreadsheet needs only one hash access
structure per spreadsheet. Consequently there is a breakeven point Ni, when
the cost of the spreadsheet access structure is amortized, and SQL Spread-
sheet outperforms the ANSI hash-join formulation, as shown in Figure 6. In
the above query, Ni is 3 (i.e., three rules). Above 14 rules, spreadsheet execution
is twice as fast as that using joins. In the experiment, joins and spreadsheet
were processed serially and the access structures for both fit in memory.

8.5 Access Method—Hash Table
We tested the scalability of our execution methods as a function of the number
of formulas and memory available for the hash structure.
   Figure 7 shows almost linear scalability between the response time of a
spreadsheet and the number of formulas. Each formula came from query Q3,
discussed earlier, and simulated a double join apb cube >< product dt ><
apb cube. In the experiment, the physical memory was large enough to accom-
modate every individual partition of the apb cube, which in our case was a
maximum of 15 MB—about 20% of the cube.
   Figure 8 shows the performance of our access structure as a function of
available memory. The memory size is expressed as a percentage of the size
required to fit the largest partition of data in the hash access structure in
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Advanced SQL Modeling in RDBMS                 •      119




                    Fig. 8. Scalability with size of physical memory.

physical memory. Recall from Section 6 that we first partition the data on the
PBY columns, and process one partition at a time to execute the formulas. In
the experiment, we executed a single formula, F1, from query Q3:
  F1: share 1[*] = s[cv(p)] / s[parent1[cv(p)]]
   The formula accesses, within each PBY (c,h,t) partition, sales for a product
and its parent. If a partition does not fit in memory we incur an I/O if a refer-
enced cell is not cached. In a severe case of memory shortage, each reference
may be a cache miss, reducing our access method to an uncached, nested-loop
join. In the case of formula F1, which references a product and its parent, this
occurs when the available memory is less than 30% of the largest partition—
see Figure 8. Thus our method works very well and outperforms equivalent
simulations of formulas with joins (for hash, sort, and nested-loop join meth-
ods) when the PBY partitions fit in memory, as in those cases we reduce the
number of required joins. Note that the equivalent simulations must perform
abp cube >< product dt >< abp cube, while with Spreadsheet we effectively
build an access structure for only one join abp cube >< product dt. For ex-
treme cases of memory shortage, we degrade to the equivalent performance of
simulation with nested-loop joins. Observe that in these cases, hash join simu-
lations would not perform better as they would need to spill all of their data to
disk.

9. CONCLUSIONS
This article extends SQL with a computational clause that allows us to treat a
relation as a multidimensional array and specify a set of formulas over it. The
formulas replace multiple joins and UNION operations that must be performed
for equivalent computations with current ANSI SQL. This not only allows for
ease of programming, but also offers the RDBMSs an opportunity to perform
better optimizations, as there are fewer complex query blocks to optimize—an
Achilles heel of many RDBMSs. We also create a single runtime access structure
which replaces the multiple hash or sort structures needed for equivalent joins
and UNIONs. Our intent is an eventual migration of certain classes of com-
putations from classical spreadsheets into the RDBMS. Such migration would
offer an unprecedented integration of business models, which are currently dis-
tributed among thousands of incompatible and incomparable spreadsheets. In
our model, the result of an SQL Spreadsheet is a relation with well-defined
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
120       •     A. Witkowski et al.

semantics and can easily be compared to other SQL Spreadsheets via joins,
unions, and other relational operations. The SQL Spreadsheet can be stored
in a relational view and, hence, become known to tools through the RDBMS
catalog, thereby enhancing their cooperation.


ELECTRONIC APPENDIX
An electronic appendix with an explanation for parallel execution SQL Spread-
sheets and experimental results is available in The ACM Digital Library.


REFERENCES

BALMIN, A., PAPADIMITRIOU, T., AND PAPAKONSTANTINOU, Y. 2000. Hypothetical queries in an OLAP
  environment. In Proceedings of the 26th International Conference on Very Large Data Bases
  (Cairo, Egypt). 220–231.
BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efficient and
  robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International
  Conference on Management of Data (Atlantic City, NJ). 322–331.
BELLO, R. G., ET AL. 1998. Materialized views in oracle. In Proceedings of the 24th International
  Conference on Very Large Data Bases (New York, NY). 659–664.
BLAKELEY, J. A., LARSON, P., AND TOMPA, F. W. 1986. Efficiently updating materialized views. In
  Proceedings of the ACM SIGMOD International Conference on Management of Data (Washington,
  DC). 61–71.
BLATTNER, P. 1999. Microsoft Excel Functions in Practice. Que Publishing, Indianapolis, IN.
GRAY, J., BOSWORTH, A., LAYMAN, A., AND PIRAHESH, H. 1996. Data cube: A relational operator
  generalizing group-by, cross tab and sub-totals. In Proceedings of the International Conference
  on Data Engineering (New Orleans, LA). 152–159.
GUPTA, A., MUMICK, I. S., AND SUBRAHMANIAN, V. S. 1993. Maintaining views incrementally. In
  Proceedings of the ACM SIGMOD International Conference on Management of Data (Washington,
  DC). 157–166.
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of
  the ACM SIGMOD International Conference on Management of Data (Boston, MA). 47–57.
HOWSON, C. 2002. Business Objects: The Complete Reference. McGraw-Hill/Osborne, New York,
  NY.
LAKSHMANAN, L., PEI, J., AND ZHAO, Y. 2003. QC-trees. Efficient summary structure for semantic
  OLAP. In Proceedings of the ACM SIGMOD International Conference on Management of Data
  (San Diego, CA). 64–75.
LEVY, A. Y., MUMICK, I. S., AND SAGIV, Y. 1994. Query optimization by predicate move-around. In
  Proceedings of the 20th International Conference on Very Large Data Bases (Santiago, Chile).
  96–107.
MUMICK, I. S., FINKELSTEIN, S., PIRAHESH, H., AND RAMAKRISHNAN, R. 1990. Magic is relevant. In
  Proceedings of the ACM SIGMOD International Conference on Management of Data (Atlantic
  City, NJ). 247–258.
Olap Application Developer’s Guide. 2004. Oracle Database 10g Release 1 (10.1) Documentation.
  2004. Oracle, Redwood Shores, CA.
PETERSON, T. AND PINKELMAN, J. 2000. Microsoft OLAP Unleashed. SAMS Publishing, Indianapolis,
  IN.
SIMON, J. 2000. Excel 2000 in a Nutshell. O’Reilly & Associates, Sebastopol, CA.
SISMANIS, Y., ROUSSOPOULOS, N., DELIGIANNAKIS, A., AND KOTIDIS Y. 2002. Dwarf: Shrinking the
  petacube. In Proceedings of the ACM SIGMOD International Conference on Management of Data
  (Madison, WI). 464–475.
SRIVASTAVA, D. AND RAMAKRISHNAN, R. 1992. Pushing constraint selections. In Proceedings of the
  Eleventh Symposium on Principles of Database Systems (PODS) (San Diego, CA). 301–315.
TARJAN, R. 1972. Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146–160.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                             Advanced SQL Modeling in RDBMS                 •      121

THOMSEN, E., SPOFFORD, G., CHASE, D. 1999. Microsoft OLAP Solutions. John Willey & Sons, New
  York, NY.
WITKOWSKI, A., BELLAMKONDA, B., BOZKAYA, T., DORMAN, G., FOLKERT, N., GUPTA, A., SHENG, L., AND
  SUBRAMANIAN, S. 2003. Spreadsheets in RDBMS for OLAP. In Proceedings of the ACM SIGMOD
  International Conference on Management of Data (San Diego, CA). 52–63.
ZEMKE, F., KULKARNI, K., WITKOWSKI, A., AND LYLE, B. 1999. Introduction to OLAP function. Change
  proposal. ANS-NCTS H2-99-14 (April).

Received November 2003; revised May 2004; accepted August 2004




                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
TinyDB: An Acquisitional Query Processing
System for Sensor Networks
SAMUEL R. MADDEN
Massachusetts Institute of Technology
MICHAEL J. FRANKLIN and JOSEPH M. HELLERSTEIN
University of California, Berkeley
and
WEI HONG
Intel Research


We discuss the design of an acquisitional query processor for data collection in sensor networks. Ac-
quisitional issues are those that pertain to where, when, and how often data is physically acquired
(sampled) and delivered to query processing operators. By focusing on the locations and costs of
acquiring data, we are able to significantly reduce power consumption over traditional passive sys-
tems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling
data acquisition, and show how acquisitional issues influence query optimization, dissemination,
and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for
smart sensor devices, and show how acquisitional techniques can provide significant reductions in
power consumption on our sensor devices.
Categories and Subject Descriptors: H.2.3 [Database Management]: Languages—Query lan-
guages; H.2.4 [Database Management]: Systems—Distributed databases; query processing
General Terms: Experimentation, Performance
Additional Key Words and Phrases: Query processing, sensor networks, data acquisition




1. INTRODUCTION
In the past few years, smart sensor devices have matured to the point that it is
now feasible to deploy large, distributed networks of such devices [Pottie and
Kaiser 2000; Hill et al. 2000; Mainwaring et al. 2002; Cerpa et al. 2001]. Sensor
networks are differentiated from other wireless, battery-powered environments
in that they consist of tens or hundreds of autonomous nodes that operate
without human interaction (e.g., configuration of network routes, recharging

Authors’ addresses: S. R. Madden, Computer Science and Artificial Intelligence Lab, Massachusetts
Institute of Technology, Room 32-G938, 32 Vassar Street, Cambridge, MA 02139; email: maddn@
csail.mit.edu; M. J. Franklin and J. M. Hellerstein, Soda Hall, University of California, Berkeley,
Berkeley, CA 94720; email: {franklin,jmh}@cs.berkeley.edu; W. Hong, Intel Research, 2150 Shattuck
Avenue, Penthouse Suite, Berkeley, CA 94704; email: wei.hong@intel.com.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0122 $5.00


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 122–173.
                       Acquisitional Query Processing In Sensor Networks              •      123

of batteries, or tuning of parameters) for weeks or months at a time. Further-
more, sensor networks are often embedded into some (possibly remote) physical
environment from which they must monitor and collect data. The long-term,
low-power nature of sensor networks, coupled with their proximity to physical
phenomena, leads to a significantly altered view of software systems compared
to more traditional mobile or distributed environments.
    In this article, we are concerned with query processing in sensor networks.
Researchers have noted the benefits of a query processor-like interface to sensor
networks and the need for sensitivity to limited power and computational re-
sources [Intanagonwiwat et al. 2000; Madden and Franklin 2002; Bonnet et al.
2001; Yao and Gehrke 2002; Madden et al. 2002a]. Prior systems, however, tend
to view query processing in sensor networks simply as a power-constrained ver-
sion of traditional query processing: given some set of data, they strive to process
that data as energy-efficiently as possible. Typical strategies include minimiz-
ing expensive communication by applying aggregation and filtering operations
inside the sensor network—strategies that are similar to push-down techniques
from distributed query processing that emphasize moving queries to data.
    In contrast, we advocate acquisitional query processing (ACQP), where we
focus not only on traditional techniques but also on the significant new query
processing opportunity that arises in sensor networks: the fact that smart sen-
sors have control over where, when, and how often data is physically acquired
(i.e., sampled) and delivered to query processing operators. By focusing on the
locations and costs of acquiring data, we are able to significantly reduce power
consumption compared to traditional passive systems that assume the a pri-
ori existence of data. Acquisitional issues arise at all levels of query process-
ing: in query optimization, due to the significant costs of sampling sensors; in
query dissemination, due to the physical colocation of sampling and process-
ing; and, most importantly, in query execution, where choices of when to sample
and which samples to process are made. We will see how techniques proposed
in other research on sensor and power-constrained query processing, such as
pushing down predicates and minimizing communication, are also important
alongside ACQP and fit comfortably within its model.
    We have designed and implemented a query processor for sensor networks
that incorporates acquisitional techniques called TinyDB (for more informa-
tion on TinyDB, see the TinyDB Home Page [Madden et al. 2003]). TinyDB
is a distributed query processor that runs on each of the nodes in a sensor
network. TinyDB runs on the Berkeley mote platform, on top of the TinyOS
[Hill et al. 2000] operating system. We chose this platform because the hard-
ware is readily available from commercial sources1 and the operating system
is relatively mature. TinyDB has many of the features of a traditional query
processor (e.g., the ability to select, join, project, and aggregate data), but, as
we will discuss in this article, also incorporates a number of other features
designed to minimize power consumption via acquisitional techniques. These
techniques, taken in aggregate, can lead to orders of magnitude improvements

1 Crossbow,Inc. Wireless sensor networks (Mica Motes). Go online to http://www.xbow.com/
Products/Wireless_Sensor_Networks.htm.

                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
124       •     S. R. Madden et al.




                  Fig. 1. A query and results propagating through the network.

in power consumption and increased accuracy of query results over nonacqui-
sitional systems that do not actively control when and where data is collected.
   We address a number of questions related to query processing on sensor
networks, focusing in particular on ACQP issues such as the following:
(1) When should samples for a particular query be taken?
(2) What sensor nodes have data relevant to a particular query?
(3) In what order should samples for this query be taken, and how should sam-
    pling be interleaved with other operations?
(4) Is it worth expending computational power or bandwidth to process and
    relay a particular sample?
   Of these issues, question (1) is uniquely acquisitional. We show how the re-
maining questions can be answered by adapting techniques that are similar to
those found in traditional query processing. Notions of indexing and optimiza-
tion, in particular, can be applied to answer questions (2) and (3), and question
(4) bears some similarity to issues that arise in stream processing and approx-
imate query answering. We will address each of these questions, noting the
unusual kinds of indices, optimizations, and approximations that are required
under the specific constraints posed by sensor networks.
   Figure 1 illustrates the basic architecture that we follow throughout this
article—queries are submitted at a powered PC (the basestation), parsed, op-
timized, and sent into the sensor network, where they are disseminated and
processed, with results flowing back up the routing tree that was formed as the
queries propagated. After a brief introduction to sensor networks in Section 2,
the remainder of the article discusses each of these phases of ACQP: Section 3
covers our query language, Section 4 highlights optimization issues in power-
sensitive environments, Section 5 discusses query dissemination, and, finally,
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks                •      125

Section 6 discusses our adaptive, power-sensitive model for query execution
and result collection.

2. SENSOR NETWORK OVERVIEW
We begin with an overview of some recent sensor network deployments, and
then discuss properties of sensor nodes and sensor networks in general, provid-
ing specific numbers from our experience with TinyOS motes when possible.
   A number of recent deployments of sensors have been undertaken by the
sensor network research community for environmental monitoring purposes:
on Great Duck Island [Mainwaring et al. 2002], off the coast of Maine, at James
Reserve [Cerpa et al. 2001], in Southern California, at a vineyard in British
Columbia [Brooke and Burrell 2003], and in the Coastal Redwood Forests of
California [Madden 2003]. In these scenarios, motes collect light, temperature,
humidity, and other environmental properties. On Great Duck Island, during
the Summer of 2003, about 200 motes were placed in and around the burrows of
Storm Petrels, a kind of endangered sea bird. Scientists used them to monitor
burrow occupancy and the conditions surrounding burrows that are correlated
with birds coming or going. Other notable deployments that are underway in-
clude a network for earthquake monitoring [UC Berkeley 2001] and a network
for building infrastructure monitoring and control [Lin et al. 2002].2
   Each of these scenarios involves a large number of devices that need to last
as long as possible with little or no human intervention. Placing new devices,
or replacing or recharging batteries of devices in bird nests, earthquake test
sites, and heating and cooling ducts is time consuming and expensive. Aside
from the obvious advantages that a simple, declarative language provides over
hand-coded, embedded C, researchers are particularly interested in TinyDB’s
ability to acquire and deliver desired data while conserving as much power as
possible and satisfying desired lifetime goals.
   We have deployed TinyDB in the redwood monitoring project [Madden 2003]
described above, and are in the process of deploying it in Intel fabrication plants
to collect vibration signals that can be used for early detection of equipment
failures. Early deployments have been quite successful, producing months of
lifetime from tiny batteries with about one-half the capacity of a single AA cell.

2.1 Properties of Sensor Devices
A sensor node is a battery-powered, wireless computer. Typically, these nodes
are physically small (a few cubic centimeters) and extremely low power (a few
tens of milliwatts versus tens of watts for a typical laptop computer).3 Power is
of utmost importance. If used naively, individual nodes will deplete their energy

2 Even  in indoor infrastructure monitoring settings, there is great interest in battery powered
devices, as running power wire can cost many dollars per device.
3 Recall that 1 W (a unit of power) corresponds to power consumption of 1 J (a unit of energy) per

second. We sometimes refer to the current load of a device, because current is easy to measure
directly; note that power (in watts) = current (in amps) * voltage (in volts), and that motes run at
3 V.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
126       •     S. R. Madden et al.




                                      Fig. 2. Annotated motes.

supplies in only a few days.4 In contrast, if sensor nodes are very spartan about
power consumption, months or years of lifetime are possible. Mica motes, for
example, when operating at 2% duty cycle (between active and sleep modes)
can achieve lifetimes in the 6-month range on a pair of AA batteries. This duty
cycle limits the active time to 1.2 s/min.
   There have been several generations of motes produced. Older, Mica motes
have a 4-MHz, 8-bit Atmel microprocessor. Their RFM TR10005 radios run at
40 kbits/s over a single shared CSMA/CA (carrier-sense multiple-access, colli-
sion avoidance) channel. Newer Mica2 nodes use a 7 MHz processor and a radio
from ChipCon6 which runs at 38.4 kbits/s. Radio messages are variable size.
Typically about twenty 50-byte messages (the default size in TinyDB) can be
delivered per second. Like all wireless radios (but unlike a shared EtherNet,
which uses the collision detection (CD) variant of CSMA), both the RFM and
ChipCon radios are half-duplex, which means that they cannot detect collisions
because they cannot listen to their own traffic. Instead, they try to avoid col-
lisions by listening to the channel before transmitting and backing off for a
random time period when it is in use. A third mote, called the Mica2Dot, has
similar hardware as the Mica2 mote, but uses a slower, 4-MHz, processor. A
picture of a Mica and Mica2Dot mote are shown in Figure 2. Mica motes are
visually very similar to Mica2 motes and are exactly the same form factor.
   Motes have an external 32-kHz clock that the TinyOS operating system can
synchronize with neighboring motes to approximately +/− 1 ms. Time syn-
chronization is important in a variety of contexts, for example: to ensure that
readings can be correlated, to schedule communication, or to coordinate the
waking and sleeping of devices.
4 Atfull power, a Berkeley Mica mote (see Figure 2) draws about 15 mA of current. A pair of AA
batteries provides approximately 2200 mAh of energy. Thus, the lifetime of a Mica2 mote will be
approximately 2200/15 = 146 h, or 6 days.
5 RFM Corporation. RFM TR1000 data sheet. Go online to http://www.rfm.com/products/data

tr1000.pdf.
6 ChipCon Corporation. CC1000 single chip very low power RF transceiver data sheet. Go online to

http://www.chipcon.com.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                           Acquisitional Query Processing In Sensor Networks               •      127




                         Fig. 3. Phases of power consumption in TinyDB.

   2.1.1 Power Consumption in Sensor Networks. Power consumption in sen-
sor nodes can be roughly decomposed into phases, which we illustrate in
Figure 3 via an annotated capture of an oscilloscope display showing current
draw (which is proportional to power consumption) on a Mica mote running
TinyDB. In “Snoozing” mode, where the node spends most of its time, the pro-
cessor and radio are idle, waiting for a timer to expire or external event to
wake the device. When the device wakes it enters the “Processing” mode, which
consumes an order of magnitude more power than snooze mode, and where
query results are generated locally. The mote then switches to a “Processing
and Receiving” mode, where results are collected from neighbors over the ra-
dio. Finally, in the “Transmitting” mode, results for the query are delivered
by the local mote—the noisy signal during this period reflects switching as
the receiver goes off and the transmitter comes on and then cycles back to a
receiver-on, transmitter-off state.
   Theses oscilloscope measurements do not distinguish how power is used dur-
ing the active phase of processing. To explore this breakdown, we conducted an
analytical study of the power utilization of major elements of sensor network
query processing; the results of this study are given in Appendix A. In short, we
found that in a typical data collection scenario, with relatively power-hungry
sensing hardware, about 41% of energy goes to communicating or running the
CPU while communicating, with another 58% going to the sensors or to the
CPU while sensing. The remaining 1% goes to idle-time energy consumption.

2.2 TinyOS
TinyOS consists of a set of components for managing and accessing the mote
hardware, and a “C-like” programming language called nesC. TinyOS has been
ported to a variety of hardware platforms, including UC Berkeley’s Rene, Dot,
Mica, Mica2, and Mica2Dot motes, the Blue Mote from Dust Inc.,7 and the MIT
Cricket [Priyantha et al. 2000].

7 Dust   Inc. Go online to the company’s Web site. http://www.dust-inc.com.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
128       •     S. R. Madden et al.

   The major features of TinyOS are the following:
(1) a suite of software designed to simplify access to the lowest levels of hard-
    ware in an energy-efficient and contention-free way, and
(2) a programming model and the nesC language designed to promote exten-
    sibility and composition of software while maintaining a high degree of
    concurrency and energy efficiency; interested readers should refer to Gay
    et al. [2003].
   It is interesting to note that TinyOS does not provide the traditional op-
erating system features of process isolation or scheduling (there is only one
application running at time), and does not have a kernel, protection domains,
memory manager, or multithreading. Indeed, in many ways, TinyOS is simply
a library that provides a number of convenient software abstractions, including
components to modulate packets over the radio, read sensor values for different
sensor hardware, synchronize clocks between a sender and receiver, and put the
hardware into a low-power state.
   Thus, TinyOS and nesC provide a useful set of abstractions on top of the bare
hardware. Unfortunately, they do not make it particularly easy to author soft-
ware for the kinds of data collection applications considered in the beginning
of Section 2. For example, the initial deployment of the Great Duck Island soft-
ware, where the only behavior was to periodically broadcast readings from the
same set of sensors over a single radio hop, consisted of more than 1000 lines
of embedded C code, excluding any of the custom software components written
to integrate the new kinds of sensing hardware used in the deployment. Fea-
tures such as reconfigurability, in-network processing, and multihop routing,
which are needed for long-term, energy-efficient deployments, would require
thousands of lines of additional code.
   Sensor networks will never be widely adopted if every application requires
this level of engineering effort. The declarative model we advocate reduces these
applications to a few short statements in a simple language; the acquisitional
techniques discussed allow these queries to be executed efficiently.8

2.3 Communication in Sensor Networks
Typical communication distances for low power wireless radios such as those
used in motes and Bluetooth devices range from a few feet to around 100 ft,
depending on transmission power and environmental conditions. Such short
ranges mean that almost all real deployments must make use of multihop com-
munication, where intermediate nodes relay information for their peers. On
Mica motes, all communication is broadcast. The operating system provides a
software filter so that messages can be addressed to a particular node, though if
neighbors are awake, they can still snoop on such messages (at no additional en-
ergy cost since they have already transferred the decoded message from the air).

8 The implementation of TinyDB consists of about 20,000 lines of C code, approximately 10,000 of
which are for the low-level drivers to acquire and condition readings from sensors—none of which
is the end-user is expected to have to modify or even look at. Compiled, this uses 58K of the 128K
of available code space on current generations Motes.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks              •      129

Nodes receive per-message, link-level acknowledgments indicating whether a
message was received by the intended neighbor node. No end-to-end acknowl-
edgments are provided.
    The requirement that sensor networks be low maintenance and easy to de-
ploy means that communication topologies must be automatically discovered
(i.e., ad hoc) by the devices rather than fixed at the time of network deployment.
Typically, devices keep a short list of neighbors who they have heard transmit
recently, as well as some routing information about the connectivity of those
neighbors to the rest of the network. To assist in making intelligent routing
decisions, nodes associate a link quality with each of their neighbors.
    We describe the process of disseminating queries and collecting results in
Section 5 below. As a basic primitive in these protocols, we use a routing tree
that allows a basestation at the root of the network to disseminate a query
and collect query results. This routing tree is formed by forwarding a routing
request (a query in TinyDB) from every node in the network: the root sends a
request, all child nodes that hear this request process it and forward it on to
their children, and so on, until the entire network has heard the request.
    Each request contains a hop-count, or level indicating the distance from the
broadcaster to the root. To determine their own level, nodes pick a parent node
that is (by definition) one level closer to the root than they are. This parent will
be responsible for forwarding the node’s (and its children’s) query results to the
basestation. We note that it is possible to have several routing trees if nodes
keep track of multiple parents. This can be used to support several simultaneous
queries with different roots. This type of communication topology is common
within the sensor network community [Woo and Culler 2001].

3. ACQUISITIONAL QUERY LANGUAGE
In this section, we introduce our query language for ACQP focusing on issues
related to when and how often samples are acquired. Appendix B gives a com-
plete syntactic specification of the language; here, we rely primarily on example
queries to illustrate the different language features.

3.1 Data Model
In TinyDB, sensor tuples belong to a table sensors which, logically, has one row
per node per instant in time, with one column per attribute (e.g., light, temper-
ature, etc.) that the device can produce. In the spirit of acquisitional processing,
records in this table are materialized (i.e., acquired) only as needed to satisfy
the query, and are usually stored only for a short period of time or delivered
directly out of the network. Projections and/or transformations of tuples form
the sensors table may be stored in materialization points (discussed below).
   Although we impose the same schema on the data produced by every device
in the network, we allow for the possibility of certain devices lacking certain
physical sensors by allowing nodes to insert NULLs for attributes correspond-
ing to missing sensors. Thus, devices missing sensors requested in a query will
produce data for that query anyway, unless NULLs are explicitly filtered out
in the WHERE clause.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
130       •      S. R. Madden et al.

   Physically, the sensors table is partitioned across all of the devices in the
network, with each device producing and storing its own readings. Thus, in
TinyDB, to compare readings from different sensors, those readings must be
collected at some common node, for example, the root of the network.

3.2 Basic Language Features
Queries in TinyDB, as in SQL, consist of a SELECT-FROM-WHERE-GROUPBY clause
supporting selection, join, projection, and aggregation.
   The semantics of SELECT, FROM, WHERE, and GROUP BY clauses are as in
SQL. The FROM clause may refer to both the sensors table as well as stored
tables, which we call materialization points. Materialization points are created
through special logging queries, which we describe below. They provide basic
support for subqueries and windowed stream operations.
   Tuples are produced at well-defined sample intervals that are a parame-
ter of the query. The period of time between the start of each sample pe-
riod is known as an epoch. Epochs provide a convenient mechanism for struc-
turing computation to minimize power consumption. Consider the following
query:
              SELECT nodeid, light, temp
                FROM sensors
                SAMPLE PERIOD 1s FOR 10s

   This query specifies that each device should report its own identifier (id),
light, and temperature readings (contained in the virtual table sensors) once
per second for 10 s. Results of this query stream to the root of the network in an
online fashion, via the multihop topology, where they may be logged or output
to the user. The output consists of a stream of tuples, clustered into 1-s time
intervals. Each tuple includes a time stamp corresponding to the time it was
produced.
   Nodes initiate data collection at the beginning of each epoch, as specified in
the SAMPLE PERIOD clause. Nodes in TinyDB run a simple time synchronization
protocol to agree on a global time base that allows them to start and end each
epoch at the same time.9
   When a query is issued in TinyDB, it is assigned an id that is returned to the
issuer. This identifier can be used to explicitly stop a query via a “STOP QUERY
id” command. Alternatively, queries can be limited to run for a specific time
period via a FOR clause (shown above,) or can include a stopping condition as
an event (see below.)
   Note that because the sensors table is an unbounded, continuous data
stream of values, certain blocking operations (such as sort and symmetric join)
are not allowed over such streams unless a bounded subset of the stream, or
window, is specified. Windows in TinyDB are defined via materialization points
over the sensor streams. Such materialization points accumulate a small buffer


9 Weuse a time-synchronization protocol that is quite similar to the one described in work by
Ganeriwal et al. [2003]; typical time-synchronization error in TinyDB is about 10 ms.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks              •      131

of data that may be used in other queries. Consider, as an example:
         CREATE
           STORAGE POINT recentlight SIZE 8
           AS (SELECT nodeid, light FROM sensors
           SAMPLE PERIOD 10s)

   This statement provides a local (i.e., single-node) location to store a stream-
ing view of recent data similar to materialization points in other stream-
ing systems like Aurora, TelegraphCQ, or STREAM [Carney et al. 2002;
Chandrasekaran et al. 2003; Motwani et al. 2003], or materialized views in
conventional databases. Multiple queries may read a materialization point.
   Joins are allowed between two storage points on the same node, or between
a storage point and the sensors relation, in which case sensors is used as the
outer relation in a nested-loops join. That is, when a sensors tuple arrives, it is
joined with tuples in the storage point at its time of arrival. This is effectively a
landmark query [Gehrke et al. 2001] common in streaming systems. Consider,
as an example:
         SELECT COUNT(*)
           FROM sensors AS s, recentLight AS rl
           WHERE rl.nodeid = s.nodeid
           AND s.light < rl.light
           SAMPLE PERIOD 10s

   This query outputs a stream of counts indicating the number of recent light
readings (from zero to eight samples in the past) that were brighter than the
current reading. In the event that a storage point and an outer query deliver
data at different rates, a simple rate matching construct is provided that al-
lows interpolation between successive samples (if the outer query is faster),
via the LINEAR INTERPOLATE clause shown in Appendix B. Alternatively, if the
inner query is faster, the user may specify an aggregation function to combine
multiple rows via the COMBINE clause shown in Appendix B.

3.3 Aggregation Queries
TinyDB also includes support for grouped aggregation queries. Aggregation
has the attractive property that it reduces the quantity of data that must
be transmitted through the network; other sensor network research has
noted that aggregation is perhaps the most common operation in the domain
[Intanagonwiwat et al. 2000; Yao and Gehrke 2002]. TinyDB includes a mech-
anism for user-defined aggregates and a metadata management system that
supports optimizations over them, which we discuss in Section 4.1.
   The basic approach of aggregate query processing in TinyDB is as follows: as
data from an aggregation query flows up the tree, it is aggregated in-network
according to the aggregation function and value-based partitioning specified in
the query.

   3.3.1 Aggregate Query Syntax and Semantics. Consider a user who wishes
to monitor the occupancy of the conference rooms on a particular floor of a build-
ing. She chooses to do this by using microphone sensors attached to motes, and
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
132       •      S. R. Madden et al.

looking for rooms where the average volume is over some threshold (assuming
that rooms can have multiple sensors). Her query could be expressed as:
              SELECT AVG(volume),room FROM sensors
                WHERE floor = 6
                GROUP BY room
                HAVING AVG(volume) > threshold
                SAMPLE PERIOD 30s

This query partitions motes on the sixth floor according to the room where
they are located (which may be a hard-coded constant in each device, or may be
determined via some localization component available to the devices.) The query
then reports all rooms where the average volume is over a specified threshold.
Updates are delivered every 30 s. The query runs until the user deregisters
it from the system. As in our earlier discussion of TinyDB’s query language,
except for the SAMPLE PERIOD clause, the semantics of this statement are similar
to SQL aggregate queries.
   Recall that the primary semantic difference between TinyDB queries and
SQL queries is that the output of a TinyDB query is a stream of values, rather
than a single aggregate value (or batched result). For these streaming queries,
each aggregate record consists of one <group id,aggregate value> pair per
group. Each group is time-stamped with an epoch number and the readings
used to compute an aggregate record all belong to the same the same epoch.
   3.3.2 Structure of Aggregates. TinyDB structures aggregates similarly to
shared-nothing parallel database engines (e.g., Bancilhon et al. [1987]; Dewitt
et al. [1990]; Shatdal and Naughton [1995]). The approach used in such systems
(and followed in TinyDB) is to implement agg via three functions: a merging
function f , an initializer i, and an evaluator, e. In general, f has the following
structure:
                                   < z > = f (< x >, < y >),
where < x > and < y > are multivalued partial state records, computed over
one or more sensor values, representing the intermediate state over those val-
ues that will be required to compute an aggregate. < z > is the partial-state
record resulting from the application of function f to < x > and < y >. For
example, if f is the merging function for AVERAGE, each partial state record will
consist of a pair of values: SUM and COUNT, and f is specified as follows, given
two state records < S1 , C1 > and < S2 , C2 >:
                   f (< S1 , C1 >, < S2 , C2 >) = < S1 + S2 , C1 + C2 > .
The initializer i is needed to specify how to instantiate a state record for a
single sensor value; for an AVERAGE over a sensor value of x, the initializer i(x)
returns the tuple < x, 1 >. Finally, the evaluator e takes a partial state record
and computes the actual value of the aggregate. For AVERAGE, the evaluator
e(< S, C >) simply returns S/C.
   These three functions can easily be derived for the basic SQL aggregates; in
general, the only constraint is that the merging function be commutative and
associative.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Acquisitional Query Processing In Sensor Networks               •      133

  TinyDB includes a simple facility for allowing programmers to extend the
system with new aggregates by authoring software modules that implement
these three functions.

3.4 Temporal Aggregates
In addition to aggregates over values produced during the same sample interval
(for an example, as in the COUNT query above), users want to be able to perform
temporal operations. For example, in a building monitoring system for confer-
ence rooms, users may detect occupancy by measuring maximum sound volume
over time and reporting that volume periodically; for example, the query
            SELECT WINAVG(volume, 30s, 5s)
              FROM sensors
              SAMPLE PERIOD 1s

will report the average volume over the last 30 s once every 5 s, sampling
once per second. This is an example of a sliding-window query common in
many streaming systems [Motwani et al. 2003; Chandrasekaran et al. 2003;
Gehrke et al. 2001]. We note that the same semantics are available by running
an aggregate query with SAMPLE PERIOD 5 s over a 30-s materialization point;
temporal aggregates simply provide a more concise way of expressing these
common operations.

3.5 Event-Based Queries
As a variation on the continuous, polling based mechanisms for data acquisition,
TinyDB supports events as a mechanism for initiating data collection. Events
in TinyDB are generated explicitly, either by another query or by a lower-level
part of the operating system (in which case the code that generates the event
must have been compiled into the sensor node10 ). For example, the query:
            ON EVENT bird-detect(loc):
              SELECT AVG(light), AVG(temp), event.loc
              FROM sensors AS s
              WHERE dist(s.loc, event.loc) < 10m
              SAMPLE PERIOD 2 s FOR 30 s

could be used to report the average light and temperature level at sensors near
a bird nest where a bird has just been detected. Every time a bird-detect event
occurs, the query is issued from the detecting node and the average light and
temperature are collected from nearby nodes once every 2 s for 30 s. In this
case, we expect that bird-detection is done via some low-level operating system
facility—e.g., a switch that is triggered when a bird enters its nest.
   Such events are central in ACQP, as they allow the system to be dormant
until some external conditions occurs, instead of continually polling or blocking
10 TinyDB  provides a special API for generating events; it is described in the TinyOS/TinyDB
distribution as a part of the TinySchema package. As far as TinyDB is concerned, this API allows
TinyDB to treat OS-defined events as black-boxes that occur at any time; for example, events may
periodically sample sensors using low-level OS APIs (instead of TinyDB) to determine if some
condition is true.

                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
134       •      S. R. Madden et al.




Fig. 4. External interrupt driven event-based query (top) versus polling driven event-based query
(bottom).



on an iterator waiting for some data to arrive. Since most microprocessors in-
clude external interrupt lines than can wake a sleeping device to begin process-
ing, events can provide significant reductions in power consumption, shown in
Figure 4.
   This figure shows an oscilloscope plot of current draw from a device running
an event-based query triggered by toggling a switch connected to an external
interrupt line that causes the device to wake from sleep. Compare this to plot at
the bottom of Figure 4, which shows an event-based query triggered by a second
query that polls for some condition to be true. Obviously, the situation in the top
plot is vastly preferable, as much less energy is spent polling. TinyDB supports
such externally triggered queries via events, and such support is integral to its
ability to provide low power processing.
   Events can also serve as stopping conditions for queries. Appending a clause
of the form STOP ON EVENT(param) WHERE cond(param) will stop a continuous
query when the specified event arrives and the condition holds.
   Besides the low-level API which can be used to allow software compo-
nents to signal events (such as the bird-detect event above), queries may
also signal events. For example, suppose we wanted to signal an event when-
ever the temperature went above some threshold; we can write the following
query:
              SELECT nodeid,temp
                WHERE temp > thresh
                OUTPUT ACTION SIGNAL hot(nodeid,temp)
                SAMPLE PERIOD 10s

   Clearly, we lose the power-saving advantages of having an event fired di-
rectly in response to a low-level interrupt, but we still retain the programmatic
advantages of linking queries to the signaling of events. We describe the OUTPUT
ACTION clause in more detail in Section 3.7 below.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks               •      135

   In the current implementation of TinyDB, events are only signaled on the
local node—we do not currently provide a fully distributed event propagation
system. Note, however, that queries started in response to a local event may be
disseminated to other nodes (as in the example above).

3.6 Lifetime-Based Queries
In lieu of an explicit SAMPLE PERIOD clause, users may request a specific query
lifetime via a QUERY LIFETIME <x> clause, where <x> is a duration in days,
weeks, or months. Specifying lifetime is a much more intuitive way for users to
reason about power consumption. Especially in environmental monitoring sce-
narios, scientific users are not particularly concerned with small adjustments
to the sample rate, nor do they understand how such adjustments influence
power consumption. Such users, however, are very concerned with the lifetime
of the network executing the queries. Consider the query
           SELECT nodeid, accel
             FROM sensors
             LIFETIME 30 days

This query specifies that the network should run for at least 30 days, sampling
light and acceleration sensors at a rate that is as quick as possible and still
satisfies this goal.
   To satisfy a lifetime clause, TinyDB performs lifetime estimation. The goal
of lifetime estimation is to compute a sampling and transmission rate given a
number of Joules of energy remaining. We begin by considering how a single
node at the root of the sensor network can compute these rates, and then discuss
how other nodes coordinate with the root to compute their delivery rates. For
now, we also assume that sampling and delivery rates are the same. On a single
node, these rates can be computed via a simple cost-based formula, taking
into account the costs of accessing sensors, selectivities of operators, expected
communication rates and current battery voltage.11 We show below a lifetime
computation for simple queries of the form:
           SELECT a1 , ... , anumSensors
             FROM sensors
             WHERE p
             LIFETIME l hours

   To simplify the equations in this example, we present a query with a single
selection predicate that is applied after attributes have been acquired. The
ordering of multiple predicates and interleaving of sampling and selection are
discussed in detail in Section 4. Table I shows the parameters we use in this
computation (we do not show processor costs since they will be negligible for
the simple selection predicates we support, and have been subsumed into costs
of sampling and delivering results).
   The first step is to determine the available power ph per hour, ph = crem / l .

11 Throughout  this section, we will use battery voltage as a proxy for remaining battery capacity,
as voltage is an easy quantity to measure.

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
136       •     S. R. Madden et al.

                         Table I. Parameters Used in Lifetime Estimation
                    Parameter                   Description               Units
                    l              Query lifetime goal                   hours
                    crem           Remaining battery capacity            Joules
                    En             Energy to sample sensor n             Joules
                    Etrans         Energy to transmit a single sample    Joules
                    Ercv           Energy to receive a message           Joules
                    σ              Selectivity of selection predicate
                    C              # of children routing through node




      Fig. 5. Predicted versus actual lifetime for a requested lifetime of 24 weeks (168 days).

   We then need to compute the energy to collect and transmit one sample, es ,
including the costs to forward data for its children:
                         numSensors
                 es =                 Es + (Ercv + Etrans ) × C + Etrans × σ.
                             s=0

   The energy for a sample is the cost to read all of the sensors at the node, plus
the cost to receive results from children, plus the cost to transmit satisfying
local and child results. Finally, we can compute the maximum transmission
rate, T (in samples per hour), as
                                             T = ph /es .
   To illustrate the effectiveness of this simple estimation, we inserted a
lifetime-based query (SELECT voltage, light FROM sensors LIFETIME x) into
a sensor (with a fresh pair of AA batteries) and asked it to run for 24 weeks,
which resulted in a sample rate of 15.2 s per sample. We measured the voltage
on the device nine times over 12 days. The first two readings were outside the
range of the voltage detector on the mote (e.g., they read “1024”—the maximum
value) so are not shown. Based on experiments with our test mote connected to
a power supply, we expect it to stop functioning when its voltage reaches 350.
Figure 5 shows the measured lifetime at each point in time, with a linear fit
of the data, versus the “expected voltage” which was computed using the cost
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks               •      137

model above. The resulting linear fit of voltage is quite close to the expected
voltage. The linear fit reaches V = 350 about 5 days after the expected voltage
line.
   Given that it is possible to estimate lifetime on a single node, we now dis-
cuss coordinating the transmission rate across all nodes in the routing tree.
Since sensors need to sleep between relaying of samples, it is important that
senders and receivers synchronize their wake cycles. To do this, we allow nodes
to transmit only when their parents in the routing tree are awake and listen-
ing (which is usually the same time they are transmitting.) By transitivity, this
limits the maximum rate of the entire network to the transmission rate of the
root of the routing tree. If a node must transmit slower than the root to meet
the lifetime clause, it may transmit at an integral divisor of the root’s rate.12 To
propagate this rate through the network, each parent node (including the root)
includes its transmission rate in queries that it forwards to its children.
   The previous analysis left the user with no control over the sample rate,
which could be a problem because some applications require the ability to mon-
itor physical phenomena at a particular granularity. To remedy this, we allow
an optional MIN SAMPLE RATE r clause to be supplied. If the computed sample
rate for the specified lifetime is greater than this rate, sampling proceeds at the
computed rate (since the alternative is expressible by replacing the LIFETIME
clause with a SAMPLE PERIOD clause.) Otherwise, sampling is fixed at a rate of
r and the prior computation for transmission rate is done assuming a different
rate for sampling and transmission. To provide the requested lifetime and sam-
pling rate, the system may not be able to actually transmit all of the readings—it
may be forced to combine (aggregate) or discard some samples; we discuss this
situation (as well as other contexts where it may arise) in Section 6.3.
   Finally, we note that since estimation of power consumption was done us-
ing simple selectivity estimation as well as cost-constants that can vary from
node-to-node (see Section 4.1) and parameters that vary over time (such as
number of children, C), we need to periodically reestimate power consumption.
Section 6.4.1 discusses this runtime reestimation in more detail.

3.7 Types of Queries in Sensor Networks
We conclude this section with a brief overview of some of the other types of
queries supported by TinyDB.
— Monitoring queries: Queries that request the value of one or more attributes
  continuously and periodically—for example, reporting the temperature in
  bird nests every 30 s; these are similar to the queries shown above.
— Network health queries: Metaqueries over the network itself. Examples in-
  clude selecting parents and neighbors in the network topology or nodes with
  battery life less than some threshold. These queries are particularly impor-
  tant in sensor networks due to their dynamic and volatile nature. For exam-
  ple, the following query reports all sensors whose current battery voltage is

12 One possible optimization, which we do not explore, would involve selecting or reassigning the
root to maximize transmission rate.

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
138       •      S. R. Madden et al.

   less than k:
              SELECT nodeid,voltage
                WHERE voltage < k
                FROM sensors
                SAMPLE PERIOD 10 minutes

— Exploratory queries: One-shot queries examining the status of a particular
  node or set of nodes at a point in time. In lieu of the SAMPLE PERIOD clause,
  users may specify the keyword ONCE. For example:
              SELECT light,temp,volume
                WHERE nodeid = 5
                FROM sensors
                ONCE

— Nested queries: Both events and materialization points provide a form of
  nested queries. The TinyDB language does not currently support SQL-style
  nested queries, because the semantics of such queries are somewhat ill-
  defined in a streaming environment: it is not clear when should the outer
  query be evaluated given that the inner query may be a streaming query
  that continuously accumulates results. Queries over materialization points
  allow users to choose when the query is evaluated. Using the FOR clause,
  users can build a materialization point that contains a single buffer’s worth
  of data, and can then run a query over that buffer, emulating the same ef-
  fect as a nested query over a static inner relation. Of course, this approach
  eliminates the possibility of query rewrite based optimizations for nested
  queries [Pirahesh et al. 1992], potentially limiting query performance.
— Actuation queries: Users want to able to take some physical action in response
  to a query. We include a special OUTPUT ACTION clause for this purpose. For
  example, users in building monitoring scenarios might want to turn on a fan
  in response to temperature rising above some level:
              SELECT nodeid,temp
                FROM sensors
                WHERE temp > threshold
                OUTPUT ACTION power-on(nodeid)
                SAMPLE PERIOD 10s

  The OUTPUT ACTION clause specifies an external command that should be in-
  voked in response to a tuple satisfying the query. In this case, the power-on
  command is a low-level piece of code that pulls an output pin on the micro-
  processor high, closing a relay circuit and giving power to some externally
  connected device. Note that a separate query could be issued to power-off
  the fan when the temperature fell below some other threshold. The OUTPUT
  ACTION suppresses the delivery of messages to the basestation.
— Offline delivery: There are times when users want to log some phenomenon
  that happens faster than the data can be transmitted over the radio. TinyDB
  supports the logging of results to EEPROM for offline, non-real time delivery.
  This is implemented through the materialization point mechanism described
  above.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks             •      139

   Together, the these query types provide users of TinyDB with the mecha-
nisms they need to build data collection applications on top of sensor networks.

4. POWER-BASED QUERY OPTIMIZATION
Given our query language for ACQP environments, with special features for
event-based processing and lifetime queries, we now turn to query processing
issues. We begin with a discussion of optimization, and then cover query dis-
semination and execution. We note that, based on the applications deployed
so far, single table queries with aggregations seem to be the most pressing
workload for sensor networks, and hence we focus primarily in this section on
optimizations for acquisition, selection, and aggregation.
   Queries in TinyDB are parsed at the basestation and disseminated in a sim-
ple binary format into the sensor network, where they are instantiated and
executed. Before queries are disseminated, the basestation performs a simple
query optimization phase to choose the correct ordering of sampling, selections,
and joins.
   We use a simple cost-based optimizer to choose a query plan that will yield
the lowest overall power consumption. Optimizing for power allows us to sub-
sume issues of processing cost and radio communication, which both contribute
to power consumption and so will be taken into account. One of the most inter-
esting aspects of power-based optimization, and a key theme of acquisitional
query processing, is that the cost of a particular plan is often dominated by the
cost of sampling the physical sensors and transmitting query results, rather
than the cost of applying individual operators. For this reason, we focus in this
section on optimizations that reduce the number and costs of data acquisition.
   We begin by looking at the types of metadata stored by the optimizer. Our
optimizer focuses on ordering joins, selections, and sampling operations that
run on individual nodes.

4.1 Metadata Management
Each node in TinyDB maintains a catalog of metadata that describes its local
attributes, events, and user-defined functions. This metadata is periodically
copied to the root of the network for use by the optimizer. Metadata is registered
with the system via static linking done at compile time using the TinyOS C-like
programming language. Events and attributes pertaining to various operating
system and TinyDB components are made available to queries by declaring
them in an interface file and providing a small handler function. For example,
in order to expose network topology to the query processor, the TinyOS Network
component defines the attribute parent of type integer and registers a handler
that returns the id of the node’s parent in the current routing tree.
   Event metadata consists of a name, a signature, and a frequency estimate
that is used in query optimization (see Section 4.3 below.) User-defined predi-
cates also have a name and a signature, along with a selectivity estimate which
is provided by the author of the function.
   Table II summarizes the metadata associated with each attribute, along
with a brief description. Attribute metadata is used primarily in two contexts:
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
140        •     S. R. Madden et al.

                         Table II. Metadata Fields Kept with Each Attribute
                 Metadata                                Description
                 Power               Cost to sample this attribute (in J)
                 Sample time         Time to sample this attribute (in s)
                 Constant?           Is this attribute constant-valued (e.g., id)?
                 Rate of change      How fast the attribute changes (units/s)
                 Range               Dynamic range of attribute values (pair of units)

         Table III. Summary of Power Requirements of Various Sensors Available for Motes
                                                Time per          Startup        Current        Energy per
Sensor                                         Sample (ms)       Time (ms)        (mA)         Sample (mJ)
                                   Weather board sensors
Solar radiation [TAOS, Inc. 2002]          500           800                       0.350           .525
Barometric pressure [Intersema 2002]        35            35                       0.025          0.003
Humidity [Sensirion 2002]                  333            11                        .500          0.5
Surface temp. [Melexis, Inc. 2002]        0.333            2                       5.6            0.0056
Ambient temp. [Melexis, Inc. 2002]        0.333            2                       5.6            0.0056
                                      Standard mica mote sensors
Accelerometera                                   0.9            17                 0.6            0.0048
(Passive) Thermistorb                            0.9             0                 0.033          0.00009
Magnetometerc [Honeywel, Inc.]                    .9            17                 5               .2595
                                               Other sensors
Organic byproducts                                   .9            >1000             5               >5
a
  Analog Devices, Inc. Adxl202e: Low-cost 2 g dual-axis accelerometer. Tech rep. Go online to http://products.
analog.com/products/info.asp?product=ADXL202.
b
  Atmel Corporation. Atmel ATMega 128 Microcontroller datasheet. Go online to http://www.atmel.com/atmel/
acrobat/doc2467.pdf.
c
  Honeywell, Inc. Magnetic Sensor Specs HMC1002. Tech. rep. Go online to http://www.ssec.honeywell.com/
magnetic/spec_sheets/specs_1002.html.


information about the cost, time to fetch, and range of an attribute is used
in query optimization, while information about the semantic properties of at-
tributes is used in query dissemination and result processing. Table III gives
examples of power and sample time values for some actual sensors—notice that
the power consumption and time to sample can differ across sensors by several
orders of magnitude.
   The catalog also contains metadata about TinyDB’s extensible aggregate
system. As with other extensible database systems [Stonebraker and Kemnitz
1991], the catalog includes names of aggregates and pointers to their code. Each
aggregate consists of a triplet of functions, that initialize, merge, and update
the final value of partial aggregate records as they flow through the system.
As in the TAG [Madden et al. 2002a] article, aggregate authors must provide
information about functional properties. In TinyDB, we currently require two:
whether the aggregate is monotonic and whether it is exemplary or summary.
COUNT is a monotonic aggregate as its value can only get larger as more values
are aggregated. MIN is an exemplary aggregate, as it returns a single value from
the set of aggregate values, while AVERAGE is a summary aggregate because it
computes some property over the entire set of values.
   TinyDB also stores metadata information about the costs of processing and
delivering data, which is used in query-lifetime estimation. The costs of these
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks               •      141

phases in TinyDB were shown in Figure 3—they range from 2 mA while sleep-
ing, to over 20 mA while transmitting and processing. Note that actual costs
vary from mote to mote—for example, with a small sample of five motes (using
the same batteries), we found that the average current with processor active
varied from 13.9 to 17.6 mA (with the average being 15.66 mA).

4.2 Technique 1: Ordering of Sampling And Predicates
Having described the metadata maintained by TinyDB, we now describe how
it is used in query optimization.
    As shown in Section 2, sampling is often an expensive operation in terms
of power. However, a sample from a sensor s must be taken to evaluate any
predicate over the attribute sensors.s. If a predicate discards a tuple of the
sensors table, then subsequent predicates need not examine the tuple—and
hence the expense of sampling any attributes referenced in those subsequent
predicates can be avoided. Thus these predicates are “expensive,” and need to
be ordered carefully. The predicate ordering problem here is somewhat different
than in the earlier literature (e.g., Hellerstein [1998]) because (a) an attribute
may be referenced in multiple predicates, and (b) expensive predicates are only
on a single table, sensors. The first point introduces some subtlety, as it is not
clear which predicate should be “charged” the cost of the sample.
    To model this issue, we treat the sampling of a sensor t as a separate
“job” τ to be scheduled along with the predicates. Hence a set of predicates
P = { p1 , . . . , pm } is rewritten as a set of operations S = {s1 , . . . , sn }, where
P ⊂ S, and S − P = {τ1 , . . . , τn−m } contains one sampling operator for each
distinct attribute referenced in P . The selectivity of sampling operators is al-
ways 1. The selectivity of selection operators is derived by assuming that at-
tributes have a uniform distribution over their range (which is available in
the catalog).13 Relaxing this assumption by, for example, storing histograms or
time-dependent functions per attribute remains an area of future work. The
cost of an operator (predicate or sample) can be determined by consulting the
metadata, as described in the previous section. In the cases we discuss here,
selections and joins are essentially “free” compared to sampling, but this is not
a requirement of our technique.
    We also introduce a partial order on S, where τi must precede p j if p j ref-
erences the attribute sampled by τi . The combination of sampling operators
and the dependency of predicates on samples captures the costs of sampling
operators and the sharing of operators across predicates.
    The partial order induced on S forms a graph with edges from sampling oper-
ators to predicates. This is a simple series-parallel graph. An optimal ordering
of jobs with series-parallel constraints is a topic treated in the Operations Re-
search literature that inspired earlier optimization work [Ibaraki and Kameda
1984; Krishnamurthy et al. 1986; Hellerstein 1998]; Monma and Sidney [1979]

13 Scientistsare particularly interested in monitoring the micro-climates created by plants and
their biological processes. See Delin and Jackson [2000] and Cerpa et al. [2001]. An example of
such a sensor is Figaro Inc’s H2 S sensor (Figaro, Inc. Tgs-825—special sensor for hydrogen sulfide.
Tech rep. Go online to www.figarosensor.com).

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
142       •      S. R. Madden et al.

presented the series-parallel algorithm using parallel chains, which gives an
optimal ordering of the jobs in O(|S| log |S|) time.
   Besides predicates in the WHERE clause, expensive sampling operators must
also be ordered appropriately with respect to the SELECT, GROUP BY, and HAVING
clauses. As with selection predicates, we enhance the partial order such that τi
must precede any aggregation, GROUP BY, or HAVING operator that uses i. Note
that projections do not require access to the value of i, and thus do not need to
be included in the partial order. Thus, the complete partial order is as follows:
(1)   acquisition of attribute a ≺ any operator that references a,
(2)   selection ≺ aggregation, GROUP BY, and HAVING,
(3)   GROUP BY ≺ aggregation and HAVING,
(4)   aggregation ≺ HAVING.
Of course, the last three rules are also present in standard SQL. We also need
to add the operators representing these clauses to S with the appropriate costs
and selectivities; the process of estimating these values has been well studied
in the database query optimization and cost estimation literature.
   As an example of this process, consider the query
              SELECT accel,mag
                FROM sensors
                WHERE accel > c1
                AND mag > c2
                SAMPLE PERIOD .1s

   The order of magnitude difference in per-sample costs shown in Table III for
the accelerometer and magnetometer suggests that the power costs of plans
for this query having different sampling and selection orders will vary sub-
stantially. We consider three possible plans: in the first, the magnetometer and
accelerometer are sampled before either selection is applied. In the second, the
magnetometer is sampled and the selection over its reading (which we call Smag )
is applied before the accelerometer is sampled or filtered. In the third plan, the
accelerometer is sampled first and its selection (Saccel ) is applied before the
magnetometer is sampled.
   This interleaving of sampling and processing introduces an additional is-
sue with temporal semantics: in this case, for example, the magnetometer and
accelerometer samples are not acquired at the same time. This may be prob-
lematic for some queries, for example, if one is trying to temporally correlate
high-frequency portions of signals from these two sensors. To address this con-
cern, we include in our language specification a NO INTERLEAVE clause, which
forces all sensors to be turned on and sampled simultaneously at the beginning
of each epoch (obviating the benefit of the acquisitional techniques discussed in
this section). We note that this clause may not lead to perfect synchronization of
sampling, as different sensors take different amounts of time to power up and
acquire readings, but will substantially improve temporal synchronization.
   Figure 6 shows the relative power costs of the latter two approaches, in terms
of power costs to sample the sensors (we assume the CPU cost is the same for
the two plans, so do not include it in our cost ratios) for different selectivity
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                        Acquisitional Query Processing In Sensor Networks                 •    143




         Fig. 6. Ratio of costs of two acquisitional plans over differing-cost sensors.


factors of the two selection predicates Saccel and Smag . The selectivities of these
two predicates are shown on the x and y axes, respectively. Regions of the
graph are shaded corresponding to the ratio of costs between the plan where
the magnetometer is sampled first (mag-first) versus the plan where the ac-
celerometer is sampled first (accel-first). As expected, these results show that
the mag-first plan is almost always more expensive than accel-first. In fact, it
can be an order of magnitude more expensive, when Saccel is much more selec-
tive than Smag . When Smag is highly selective, however, it can be cheaper to
sample the magnetometer first, although only by a small factor.
   The maximum difference in relative costs represents an absolute difference
of 255 µJ per sample, or 2.5 mW at a sample rate of 10 samples per second—
putting the additional power consumption from sampling in the incorrect order
on par with the power costs of running the radio or CPU for an entire second.

  4.2.1 Exemplary Aggregate Pushdown. There are certain kinds of aggre-
gate functions where the same kind of interleaving of sampling and processing
can also lead to a performance savings. Consider the query
         SELECT WINMAX(light,8s,8s)
           FROM sensors
           WHERE mag > x
           SAMPLE PERIOD 1s

   In this query, the maximum of 8 s worth of light readings will be computed,
but only light readings from sensors whose magnetometers read greater than x
will be considered. Interestingly, it turns out that, unless the mag > x predicate
is very selective, it will be cheaper to evaluate this query by checking to see if
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
144       •      S. R. Madden et al.

each new light reading is greater than the previous reading and then applying
the selection predicate over mag, rather than first sampling mag. This sort of
reordering, which we call exemplary aggregate pushdown can be applied to any
exemplary aggregate (e.g., MIN, MAX). Similar ideas have been explored in the
deductive database community by Sudarshan and Ramakrishnan [1991].
   The same technique can be used with nonwindowed aggregates when per-
forming in-network aggregation. Suppose we are applying an exemplary ag-
gregate at an intermediate node in the routing tree; if there is an expensive
acquisition required to evaluate a predicate (as in the query above), then it
may make sense to see if the local value affects the value of the aggregate
before acquiring the attribute used in the predicate.
   To add support for exemplary aggregate pushdown, we need a way to eval-
uate the selectivity of exemplary aggregates. In the absence of statistics that
reflect how a predicate changes over time, we simply assume that the attributes
involved in an exemplary aggregate (such as light in the query above) are sam-
pled from the same distribution. Thus, for MIN and MAX aggregates, the likelihood
that the second of two samples is less than (or greater than) the first is 0.5. For
n samples, the likelihood that the nth is the value reported by the aggregate is
thus 1/.5n−1 . By the same reasoning, for bottom (or top)-k aggregates, assuming
k < n, the nth sample will be reported with probability 1/.5n−k−1 .
   Given this selectivity estimate for an exemplary aggregate, S(a), over at-
tribute a with acquisition cost C(a), we can compute the benefit of exemplary
aggregate pushdown. We assume the query contains some set of conjunctive
predicates with aggregate selectivity P over several expensive acquisitional
attributes with aggregate acquisition cost K . We assume the values of S(a),
C(a), K , and P are available in the catalog. Then, the cost of evaluating the
query without exemplary aggregate pushdown is
                                           K + P ∗ C(a)                        (1)
and with pushdown it becomes
                                         C(a) + S(a) ∗ K .                     (2)
When (2) is less than (1), there will be an expected benefit to exemplary aggre-
gate pushdown, and it should be applied.

4.3 Technique 2: Event Query Batching to Conserve Power
As a second example of the benefit of power-aware optimization, we consider
the optimization of the query
              ON EVENT e(nodeid)
                SELECT a1
                FROM sensors AS s
                WHERE s.nodeid = e.nodeid
                SAMPLE PERIOD d FOR k

   This query will cause an instance of the internal query (SELECT ...) to be
started every time the event e occurs. The internal query samples results every
d seconds for a duration of k seconds, at which point it stops running.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Acquisitional Query Processing In Sensor Networks               •      145




    Fig. 7. The cost of processing event-based queries as asynchronous events versus joins.

   Note that, according to this specification of how an ON EVENT query is pro-
cessed, it is possible for multiple instances of the internal query to be running at
the same time. If enough such queries are running simultaneously, the benefit of
event-based queries (e.g., not having to poll for results) will be outweighed by the
fact that each instance of the query consumes significant energy sampling and
delivering (independent) results. To alleviate the burden of running multiple
copies of the same identical query, we employ a multiquery optimization tech-
nique based on rewriting. To do this, we convert external events (of type e) into
a stream of events, and rewrite the entire set of independent internal queries
as a sliding window join between events and sensors, with a window size of k
seconds on the event stream, and no window on the sensor stream. For example:
          SELECT s.a1
            FROM sensors AS s, events AS e
            WHERE s.nodeid = e.nodeid
            AND e.type = e
            AND s.time - e.time <= k AND s.time > e.time
            SAMPLE PERIOD d

   We execute this query by treating it as a join between a materialization point
of size k on events and the sensors stream. When an event tuple arrives, it is
added to the buffer of events. When a sensor tuple s arrives, events older than k
seconds are dropped from the buffer and s is joined with the remaining events.
   The advantage of this approach is that only one query runs at a time no
matter how frequently the events of type e are triggered. This offers a large
potential savings in sampling and transmission cost. At first it might seem as
though requiring the sensors to be sampled every d seconds irrespective of the
contents of the event buffer would be prohibitively expensive. However, the
check to see if the the event buffer is empty can be pushed before the sampling
of the sensors, and can be done relatively quickly.
   Figure 7 shows the power tradeoff for event-based queries that have and have
not been rewritten. Rewritten queries are labeled as stream join and nonrewrit-
ten queries as async events. We measure the cost in mW of the two approaches
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
146       •     S. R. Madden et al.

         Table IV. Parameters Used in Asynchronous Events Versus Stream-Join Study
Parameter                                Description                                   Value
tsample         Length of sample period                                        1/8 s
nevents         Number of events per second                                    0−5 (X axis)
durevent        Time for which events are active (FOR clause)                  1, 3, or 5 s
mWproc          Processor power consumption                                    12 mW
mssample        Time to acquire a sample, including processing and ADC time    0.35 ms
mWsample        Power used while sampling, including processor                 13 mW
mJsample        Energy per sample                                              Derived
mWidle          Milliwatts used while idling                                   Derived
tidle           Time spent idling per sample period (in seconds)               Derived
mJidle          Energy spent idling                                            Derived
mscheck         Time to check for enqueued event                               0.02 ms (80 instrs)
mJcheck         Energy to check if an event has been enqueued                  Derived
mWevents        Total power used in asynchronous event mode                    Derived
mWstream Join   Total power used in stream-join mode                           Derived

using a numerical model of power costs for idling, sampling and processing (in-
cluding the cost to check if the event queue is nonempty in the event-join case),
but excluding transmission costs to avoid complications of modeling differences
in cardinalities between the two approaches. The expectation was that the
asynchronous approach would generally transmit many more results. We var-
ied the sample rate and duration of the inner query, and the frequency of events.
We chose the specific parameters in this plot to demonstrate query optimization
tradeoffs; for much faster or slower event rates, one approach tends to always
be preferable. In this case, the stream-join rewrite is beneficial as when events
occur frequently; this might be the case if, for example, an event is triggered
whenever a signal goes above or below a threshold with a signal that is sampled
tens or hundreds of times per second; vibration monitoring applications tend
to have this kind of behavior. Table IV summarizes the parameters used in
this experiment; “derived” values are computed by the model below. Power
consumption numbers and sensor timings are drawn from Table III and the
Atmel 128 data sheet (see Atmel Corporation be cited in footnotes to Table III).
   The cost in milliwatts of the asynchronous events approach, mWevents , is mod-
eled via the following equations:
                  tidle =     tsample − nevents × durevent × mssample /1000,
               mJidle =       mWidle × tidle ,
              mJsample =      mWsample × mssample /1000,
              mWevents =      (nevents × durevent × mJsample + mJidle )/tsample .
   The cost in milliwatts of the stream-join approach, mWstreamJoin , is then

                          tidle =      tsample − (mscheck + mssample )/1000,
                        mJidle =       mWidle × tidle ,
                       mJcheck =       mWproc × mscheck /1000,
                      mJsample =       mWsample × mssamples /1000,
                 mWstreamJoin = (mJcheck + mJsample + mJidle )/tsample .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                        Acquisitional Query Processing In Sensor Networks              •      147

   For very low event rates (fewer than one per second), the asynchronous events
approach is sometimes preferable due to the extra overhead of empty-checks
on the event queue in the stream-join case. However, for faster event rates,
the power cost of this approach increases rapidly as independent samples are
acquired for each event every few seconds. Increasing the duration of the inner
query increases the cost of the asynchronous approach as more queries will be
running simultaneously. The maximum absolute difference (of about 0.8 mW)
is roughly comparable to one-quarter the power cost of the CPU or radio.
   Finally, we note that there is a subtle semantic change introduced by this
rewriting. The initial formulation of the query caused samples in each of the
internal queries to be produced relative to the time that the event fired: for
example, if event e1 fired at time t, samples would appear at time t + d , t +
2d , . . . . If a later event e2 fired at time t + i, it would produce a different set of
samples at time t + i + d , t + i + 2d , . . . . Thus, unless i were equal to d (i.e., the
events were in phase), samples for the two queries would be offset from each
other by up to d seconds. In the rewritten version of the query, there is only one
stream of sensor tuples which is shared by all events.
   In many cases, users may not care that tuples are out of phase with events.
In some situations, however, phase may be very important. In such situa-
tions, one way the system could improve the phase accuracy of samples while
still rewriting multiple event queries into a single join is via oversampling,
or acquiring some number of (additional) samples every d seconds. The in-
creased phase accuracy of oversampling comes at an increased cost of ac-
quiring additional samples (which may still be less than running multiple
queries simultaneously). For now, we simply allow the user to specify that
a query must be phase-aligned by specifying ON ALIGNED EVENT in the event
clause.
   Thus, we have shown that there are several interesting optimization issues in
ACQP systems; first, the system must properly order sampling, selection, and
aggregation to be truly low power. Second, for frequent event-based queries,
rewriting them as a join between an event stream and the sensors stream can
significantly reduce the rate at which a sensor must acquire samples.


5. POWER-SENSITIVE DISSEMINATION AND ROUTING
After the query has been optimized, it is disseminated into the network; dis-
semination begins with a broadcast of the query from the root of the network.
As each node hears the query, it must decide if the query applies locally and/or
needs to be broadcast to its children in the routing tree. We say a query q ap-
plies to a node n if there is a nonzero probability that n will produce results for
q. Deciding where a particular query should run is an important ACQP-related
decision. Although such decisions occur in other distributed query processing
environments, the costs of incorrectly initiating queries in ACQP environments
like TinyDB can be unusually high, as we will show.
   If a query does not apply at a particular node, and the node does not have
any children for which the query applies, then the entire subtree rooted at
that node can be excluded from the query, saving the costs of disseminating,
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
148       •     S. R. Madden et al.

executing, and forwarding results for the query across several nodes, signifi-
cantly extending the node’s lifetime.
   Given the potential benefits of limiting the scope of queries, the challenge
is to determine when a node or its children need not participate in a particu-
lar query. One situation arises with constant-valued attributes (e.g., nodeid or
location in a fixed-location network) with a selection predicate that indicates
the node need not participate. We expect that such queries will be very common,
especially in interactive workloads where users are exploring different parts of
the network to see how it is behaving. Similarly, if a node knows that none of
its children currently satisfy the value of some selection predicate, perhaps, be-
cause they have constant (and known) attribute values outside the predicate’s
range, it need not forward the query down the routing tree. To maintain infor-
mation about child attribute values (both constant and changing), we propose
a data structure called a semantic routing tree (SRT). We describe the proper-
ties of SRTs in the next section, and briefly outline how they are created and
maintained.


5.1 Semantic Routing Trees
An SRT is a routing tree (similar to the tree discussed in Section 2.3 above)
designed to allow each node to efficiently determine if any of the nodes below it
will need to participate in a given query over some constant attribute A. Tradi-
tionally, in sensor networks, routing tree construction is done by having nodes
pick a parent with the most reliable connection to the root (highest link quality).
With SRTs, we argue that the choice of parent should include some considera-
tion of semantic properties as well. In general, SRTs are most applicable when
there are several parents of comparable link quality. A link-quality-based par-
ent selection algorithm, such as the one described in Woo and Culler [2001],
should be used in conjunction with the SRT to prefilter parents made available
to the SRT.
   Conceptually, an SRT is an index over A that can be used to locate nodes that
have data relevant to the query. Unlike traditional indices, however, the SRT is
an overlay on the network. Each node stores a single unidimensional interval
representing the range of A values beneath each of its children. When a query
q with a predicate over A arrives at a node n, n checks to see if any child’s value
of A overlaps the query range of A in q. If so, it prepares to receive results and
forwards the query. If no child overlaps, the query is not forwarded. Also, if
the query also applies locally (whether or not it also applies to any children) n
begins executing the query itself. If the query does not apply at n or at any of
its children, it is simply forgotten.
   Building an SRT is a two-phase process: first the SRT build request is flooded
(retransmitted by every mote until all motes have heard the request) down
the network. This request includes the name of the attribute A over which
the tree should be built. As a request floods down the network, a node n may
have several possible choices of parent, since, in general, many nodes in radio
range may be closer to the root. If n has children, it forwards the request on
to them and waits until they reply. If n has no children, it chooses a node p
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Acquisitional Query Processing In Sensor Networks              •      149




Fig. 8. A semantic routing tree in use for a query. Gray arrows indicate flow of the query down
the tree; gray nodes must produce or forward results in the query.


from available parents to be its parent, and then reports the value of A to p in
a parent selection message. If n does have children, it records the child’s value
of A along with its id. When it has heard from all of its children, it chooses a
parent and sends a selection message indicating the range of values of A which
it and its descendents cover. The parent records this interval with the id of the
child node and proceeds to choose its own parent in the same manner, until the
root has heard from all of its children. Because children can fail or move away,
nodes also have a timeout which is the maximum time they will wait to hear
from a child; after this period is elapsed, the child is removed from the child
list. If the child reports after this timeout, it is incorporated into the SRT as if
it were a new node (see Section 5.2 below).
   Figure 8 shows an SRT over the X coordinate of each node on an Cartesian
grid. The query arrives at the root, is forwarded down the tree, and then only
the gray nodes are required to participate in the query (note that node 3 must
forward results for node 4, despite the fact that its own location precludes it
from participation).
   SRTs are analogous to indices in traditional database systems; to create one
in TinyDB, the CREATE SRT command can be used—its syntax is essentially
similar to the CREATE INDEX command in SQL:

          CREATE SRT loc ON sensors (xloc,yloc) ROOT 0,

where the ROOT annotation indicates the nodeid where the SRT should be rooted
from—by default, the value will be 0, but users may wish to create SRTs rooted
at other nodes to facilitate event-based queries that frequently radiate from a
particular node.

5.2 Maintaining SRTs
Even though SRTs are limited to constant attributes, some SRT maintenance
must occur. In particular, new nodes can appear, link qualities can change, and
existing nodes can fail.
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
150       •     S. R. Madden et al.

   Both node appearances and changes in link quality can require a node to
switch parents. To do this, the node sends a parent selection message to its
new parent, n. If this message changes the range of n’s interval, it notifies its
parent; in this way, updates can propagate to the root of the tree.
   To handle the disappearance of a child node, parents associate an active query
id and last epoch with every child in the SRT (recall that an epoch is the period
of time between successive samples). When a parent p forwards a query q to
a child c, it sets c’s active query id to the id of q and sets its last epoch entry
to 0. Every time p forwards or aggregates a result for q from c, it updates c’s
last epoch with the epoch on which the result was received. If p does not hear
c for some number of epochs t, it assumes c has moved away, and removes its
SRT entry. Then, p sends a request asking its remaining children to retransmit
their ranges. It uses this information to construct a new interval. If this new
interval differs in size from the previous interval, p sends a parent selection
message up the routing tree to reflect this change. We study the costs of SRT
maintenance in Section 5.4 below.
   Finally, we note that, by using these maintenance rules, it is possible to
support SRTs over nonconstant attributes, although if those attributes change
quickly, the cost of propagating interval-range changes could be prohibitive.


5.3 Evaluation of Benefit of SRTs
The benefit that an SRT provides is dependent on the quality of the clustering
of children beneath parents. If the descendents of some node n are clustered
around the value of the index attribute at n, then a query that applies to n will
likely also apply to its descendents. This can be expected for location attributes,
for example, since network topology is correlated with geography.
   We simulate the benefits of an SRT because large networks of the type where
we expect these data structures to be useful are just beginning to come online, so
only a small-number of fixed real-world topologies are available. We include in
our simulation experiments using a connectivity data file collected from one
such real-world deployment. We evaluate the benefit of SRTs in terms of number
of active nodes; inactive nodes incur no cost for a given query, expending energy
only to keep their processors in an idle state and to listen to their radios for the
arrival of new queries.
   We study three policies for SRT parent selection. In the first, random ap-
proach, each node picks a random parent from the nodes with which it can
communicate reliably. In the second, closest-parent approach, each parent re-
ports the value of its index attribute with the SRT-build request, and children
pick the parent whose attribute value is closest to their own. In the clustered
approach, nodes select a parent as in the closest-parent approach, except, if a
node hears a sibling node send a parent selection message, it snoops on the
message to determine its siblings parent and value. It then picks its own par-
ent (which could be the same as one of its siblings) to minimize the spread of
attribute values underneath all of its available parents.
   We studied these policies in a simple simulation environment—nodes were
arranged on an n × n grid and were asked to choose a constant attribute value
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks               •      151

from some distribution (which we varied between experiments). We used a per-
fect (lossless) connectivity model where each node could talk to its immediate
neighbors in the grid (so routing trees were n nodes deep), and each node had
eight neighbors (with three choices of parent, on average). We compared the
total number of nodes involved in range queries of different sizes for the three
SRT parent selection policies to the best-case approach and the no SRT ap-
proach. The best-case approach would only result if exactly those nodes that
overlapped the range predicate were activated, which is not possible in our
topologies but provides a convenient lower bound. In the no SRT approach, all
nodes participate in each query.
   We experimented with several of sensor value distributions. In the ran-
dom distribution, each constant attribute value was randomly and uniformly
selected from the interval [0, 1000]. In the geographic distribution, (one-
dimensional) sensor values were computed based on a function of a node’s x
and y position in the grid, such that a node’s value tended to be highly corre-
lated to the values of its neighbors.
   Finally, for the real distribution, we used a network topology based on data
collected from a network of 54 motes deployed throughout the Intel-Research,
Berkeley lab. The SRT was built over the node’s location in the lab, and the
network connectivity was derived by identifying pairs of motes with a high
probability of being able to successfully communicate with each other.14
   Figure 9 shows the number of nodes that participate in queries over variably-
sized query intervals (where the interval size is shown on the x axis) of the at-
tribute space in a 20 × 20 grid. The interval for queries was randomly selected
from the uniform distribution. Each point in the graph was obtained by averag-
ing over five trials for each of the three parent selection policies in each of the
sensor value distributions (for a total of 30 experiments).For each interval size
s, 100 queries were randomly constructed, and the average number of nodes
involved in each query was measured.
   For all three distributions, the clustered approach was superior to other SRT
algorithms, beating the random approach by about 25% and the closest parent
approach by about 10% on average. With the geographic and real distributions,
the performance of the clustered approach is close to optimal: for most ranges,
all of the nodes in the range tend to be colocated, so few intermediate nodes
are required to relay information for queries in which they themselves are
not participating. The fact that the results from real topology closely matches
the geographic distribution, where sensors’ values and topology are perfectly
correlated, is encouraging and suggests that SRTs will work well in practice.
   Figure 10 shows several visualizations of the topologies which are generated
by the clustered (Figure 10(a)) and random (Figure 10(b)) SRT generation ap-
proaches for an 8×8 network. Each node represents a sensor, labeled with its ID
and the distribution of the SRT subtree rooted underneath it. Edges represent
the routing tree. The gray nodes represent the nodes that would participate in

14 The probability threshold in this case was 25%, which is the same as the probability the
TinyOS/TinyDB routing layer use to determine if a neighboring node is of sufficiently high quality
to be considered as a candidate parent.

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
152       •     S. R. Madden et al.




Fig. 9. Number of nodes participating in range queries of different sizes for different parent selec-
tion policies in a semantic routing tree (20 × 20 grid, 400 nodes, each point average of 500 queries
of the appropriate size). The three graphs represent three different sensor-value distributions; see
the text for a description of each of these distribution types.


the query 400 < A < 500. On this small grid, the two approaches perform sim-
ilarly, but the variation in structure which results is quite evident—the random
approach tends to be of more uniform depth, whereas the clustered approach
leads to longer sequences of nodes with nearby values. Note that the labels in
this figure are not intended to be readable—the important point is the overall
pattern of nodes that are explored by the two approaches.

5.4 Maintenance Costs of SRTs
As the previous results show, the benefit of using an SRT can be substantial.
There are, however, maintenance and construction costs associated with SRTs,
as discussed above. Construction costs are comparable to those in conventional
sensor networks (which also have a routing tree), but slightly higher due to
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks                •      153




Fig. 10. Visualizations of the (a) clustered and (b) random topologies, with a query region overlaid
on top of them. Node 0, the root in Figures 10(a) and 10(b), is at the center of the graph.


the fact that parent selection messages are explicitly sent, whereas parents do
not always require confirmation from their children in other sensor network
environments.
   We conducted an experiment to measure the cost of selecting a new parent,
which requires a node to notify its old parent of its decision to move and send
its attribute value to its new parent. Both the new and old parent must then
update their attribute interval information and propagate any changes up the
tree to the root of the network. In this experiment, we varied the probability
with which any node switches parents on any given epoch from 0.001 to 0.2. We
did not constrain the extent of the query in this case—all nodes were assumed
to participate. Nodes were allowed to move from their current parent to an
arbitrary new parent, and multiple nodes could move on a given epoch. The
experimental parameters were the same as above. We measured the average
number of maintenance messages generated by movement across the whole
network. The results are shown in Figure 11. Each point represents the average
of five trials, and each trial consists of 100 epochs. The three lines represent
the three policies; the amount of movement varies along the x axis, and the
number of maintenance messages per epoch is shown on the y axis.
   Without maintenance, each active node (within the query range) sends one
message per epoch, instead of every node being required to transmit. Figure 11
suggests that for low movement rates, the maintenance costs of the SRT ap-
proach are small enough that it remains attractive—if 1% of the nodes move on
a given epoch, the cost is about 30 messages, which is substantially less than
                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
154       •     S. R. Madden et al.




Fig. 11. Maintenance costs (in measured network messages) for different SRT parent selection
policies with varying probabilities of node movement. Probabilities and costs are per epoch. Each
point is the average of five runs, and where each run is 100 epochs long.


the number of messages saved by using an SRT for most query ranges. If 10% of
the nodes move, the maintenance cost grows to about 300, making the benefit
of SRTs less clear.
   To measure the amount of movement expected in practice, we measured
movement rates in traces collected from two real-world monitoring deploy-
ments; in both cases, the nodes were stationary but employed a routing al-
gorithm that attempted to select the best parent over time. In the 3-month,
200-node Great Duck Island Deployment nodes switched parents between suc-
cessive result reports with a 0.9% (σ = 0.9%) chance, on average. In the 54 node
Intel-Berkeley lab dataset, nodes switched with a 4.3% (σ = 3.0%) chance. Thus,
the amount of parent switching varies markedly from deployment to deploy-
ment. One reason for the variation is that the two deployments use different
routing algorithms. In the case of the Intel-Berkeley deployment, the algorithm
was apparently not optimized to minimize the likelihood of switching.
   Figure 11 also shows that the different schemes for building SRTs result
in different maintenance costs. This is because the average depth of nodes in
the topologies varies from one approach to the other (7.67 in Random, 10.47 in
Closest, and 9.2 in Clustered) and because the spread of values underneath a
particular subtree varies depending on the approach used to build the tree. A
deeper tree generally results in more messages being sent up the tree as path
lengths increase. The closest parent scheme results in deep topologies because
no preference is given towards parents with a wide spread of values, unlike the
clustered approach which tends to favor selecting a parent that is a member of
a pre-existing, wide interval. The random approach is shallower still because
nodes simply select the first parent that broadcasts, resulting in minimally
deep trees.
   Finally, we note that the cost of joining the network is strictly dominated by
the cost of moving parents, as there is no old parent to notify. Similarly, a node
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks             •      155

disappearing is dominated by this movement cost, as there is no new parent to
notify.

5.5 SRT Observations
SRTs provide an efficient mechanism for disseminating queries and collect-
ing query results for queries over constant attributes. For attributes that are
highly correlated amongst neighbors in the routing tree (e.g., location), SRTs
can reduce the number of nodes that must disseminate queries and forward the
continuous stream of results from children by nearly an order of magnitude.
SRTs have the substantial advantage over a centralized index structure in that
they do not require complete topology and sensor value information to be col-
lected at the root of the network, which will be quite expensive to collect and
will be difficult to keep consistent as connectivity and sensor values change.
   SRT maintenance costs appear to be reasonable for at least some real-world
deployments. Interestingly, unlike traditional routing trees in sensor networks,
there is a substantial cost (in terms of network messages) for switching parents
in an SRT. This suggests that one metric by which routing layer designers might
evaluate their implementations is rate of parent-switching.
   For real-world deployments, we expect that SRTs will offer substantial ben-
efits. Although there are no benchmarks or definitive workloads for sensor net-
work databases, we anticipate that many queries will be over narrow geographic
areas—looking, for example, at single rooms or floors in a building, or nests,
trees, or regions, in outdoor environments as on Great Duck Island; other re-
searchers have noted the same need for constrained querying [Yao and Gehrke
2002; Mainwaring et al. 2002]. In a deployment like the Intel-Berkeley lab, if
queries are over individual rooms or regions of the lab, Figure 9 shows that
substantial performance gains can be had. For example, 2 of the 54 motes are
in the main conference room; 7 of the 54 are in the seminar area; both of these
queries can be evaluated using less that 30% of the network.
   We note two promising future extensions to SRTs. First, rather than storing
just a single interval at every subtree, a variable number of intervals could be
kept. This would allow nodes to more accurately summarize the range of values
beneath them, and increase the benefit of the approach. Second, when selecting
a parent, even in the clustered approach, nodes do not currently have access
to complete information about the subtree underneath a potential parent, par-
ticularly as nodes move in the network or come and go. It would be interesting
to explore a continuous SRT construction process, where parents periodically
broadcast out updated intervals, giving current and potential children an option
to move to a better subtree and improve the quality of the SRT.

6. PROCESSING QUERIES
Once queries have been disseminated and optimized, the query processor be-
gins executing them. Query execution is straightforward, so we describe it only
briefly. The remainder of the section is devoted to the ACQP-related issues of
prioritizing results and adapting sampling and delivery rates. We present sim-
ple schemes for prioritizing data in selection queries, briefly discuss prioritizing
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
156       •     S. R. Madden et al.

data in aggregation queries, and then turn to adaptation. We discuss two situ-
ations in which adaptation is necessary: when the radio is highly contened and
when power consumption is more rapid than expected.

6.1 Query Execution
Query execution consists of a simple sequence of operations at each node during
every epoch: first, nodes sleep for most of an epoch; then they wake, sample
sensors, apply operators to data generated locally and received from neighbors,
and then deliver results to their parent. We (briefly) describe ACQP-relevant
issues in each of these phases.
   Nodes sleep for as much of each epoch as possible to minimize power con-
sumption. They wake up only to sample sensors and relay and deliver results.
Because nodes are time synchronized, parents can ensure that they awake to
receive results when a child tries to propagate a message.15 The amount of
time, tawake that a sensor node must be awake to successfully accomplish the
latter three steps above is largely dependent on the number of other nodes
transmitting in the same radio cell, since only a small number of messages per
second can be transmitted over the single shared radio channel. We discuss the
communication scheduling approach in more detail in the next section.
   TinyDB uses a simple algorithm to scale tawake based on the neighborhood
size, which is measured by snooping on traffic from neighboring nodes. Note,
however, that there are situations in which a node will be forced to drop or com-
bine results as a result of the either tawake or the sample interval being too short
to perform all needed computation and communication. We discuss policies for
choosing how to aggregate data and which results to drop in Section 6.3.
   Once a node is awake, it begins sampling and filtering results according to the
plan provided by the optimizer. Samples are taken at the appropriate (current)
sample rate for the query, based on lifetime computations and information about
radio contention and power consumption (see Section 6.4 for more information
on how TinyDB adapts sampling in response to variations during execution).
Filters are applied and results are routed to join and aggregation operators
further up the query plan.
   Finally, we note that in event-based queries, the ON EVENT clause must be
handled specially. When an event fires on a node, that node disseminates the
query, specifying itself as the query root. This node collects query results, and
delivers them to the basestation or a local materialization point.

   6.1.1 Communication Scheduling and Aggregate Queries. When process-
ing aggregate queries, some care must be taken to coordinate the times when
parents and children are awake, so that parent nodes have access to their chil-
dren’s readings before aggregating. The basic idea is to subdivide the epoch into
a number of intervals, and assign nodes to intervals based on their position in
the routing tree. Because this mechanism makes relatively efficient use of the

15 Ofcourse, there is some imprecision in time synchronization between devices. In general, we can
tolerate a fair amount of imprecision by introducing a buffer period, such that parents wake up
several milliseconds before and stay awake several milliseconds longer than their children.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Acquisitional Query Processing In Sensor Networks             •      157




Fig. 12. Partial state records flowing up the tree during an epoch using interval-based
communication.

radio channel and has good power consumption characteristics, TinyDB uses
this scheduling approach for all queries (not just aggregates).
   In this slotted approach, each epoch is divided into a number of fixed-length
time intervals. These intervals are numbered in reverse order such that interval
1 is the last interval in the epoch. Then, each node is assigned to the interval
equal to its level, or number of hops from the root, in the routing tree. In the
interval preceding their own, nodes listen to their radios, collecting results
from any child nodes (which are one level below them in the tree, and thus
communicating in this interval). During a node’s interval, if it is aggregating,
it computes the partial state record consisting of the combination of any child
values it heard with its own local readings. After this computation, it transmits
either its partial state record or raw sensor readings up the network. In this
way, information travels up the tree in a staggered fashion, eventually reaching
the root of the network during interval 1.
   Figure 12 illustrates this in-network aggregation scheme for a simple COUNT
query that reports the number of nodes in the network. In the figure, time
advances from left to right, and different nodes in the communication topology
are shown along the y axis. Nodes transmit during the interval corresponding
to their depth in the tree, so H, I, and J transmit first, during interval 4,
because they are at level 4. Transmissions are indicated by arrows from sender
to receiver, and the numbers in circles on the arrows represent COUNTs contained
within each partial state record. Readings from these three nodes are combined,
via the COUNT merging function, at nodes G and F, both of which transmit new
partial state records during interval 3. Readings flow up the tree in this manner
until they reach node A, which then computes the final count of 10. Notice that
motes are idle for a significant portion of each epoch so they can enter a low
power sleeping state. A detailed analysis of the accuracy and benefit of this
approach in TinyDB can be found in Madden [2003].
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
158       •     S. R. Madden et al.

6.2 Multiple Queries
We note that, although TinyDB supports multiple queries running simulta-
neously, we have not focused on multiquery optimization. This means that, for
example, SRTs are shared between queries, but sample acquisition is not: if two
queries need a reading within a few milliseconds of each other, this will cause
both to acquire that reading. Similarly, there is no effort to optimize communi-
cation scheduling between queries: transmissions of one query are scheduled
independently from any other query. We hope to explore these issues as a part
of our long-term sensor network research agenda.

6.3 Prioritizing Data Delivery
Once results have been sampled and all local operators have been applied,
they are enqueued onto a radio queue for delivery to the node’s parent. This
queue contains both tuples from the local node as well as tuples that are being
forwarded on behalf of other nodes in the network. When network contention
and data rates are low, this queue can be drained faster than results arrive.
However, because the number of messages produced during a single epoch can
vary dramatically, depending on the number of queries running, the cardinality
of joins, and the number of groups and aggregates, there are situations when
the queue will overflow. In these situations, the system must decide if it should
discard the overflow tuple, discard some other tuple already in the queue, or
combine two tuples via some aggregation policy.
   The ability to make runtime decisions about the value of an individual data
item is central to ACQP systems, because the cost of acquiring and delivering
data is high, and because of these situations where the rate of data items ar-
riving at a node will exceed the maximum delivery rate. A simple conceptual
approach for making such runtime decisions is as follows: whenever the system
is ready to deliver a tuple, send the result that will most improve the “qual-
ity” of the answer that the user sees. Clearly, the proper metric for quality will
depend on the application: for a raw signal, root-mean-square (RMS) error is
a typical metric. For aggregation queries, minimizing the confidence intervals
of the values of group records could be the goal [Raman et al. 2002]. In other
applications, users may be concerned with preserving frequencies, receiving
statistical summaries (average, variance, or histograms), or maintaining more
tenuous qualities such as signal “shape.”
   Our goal is not to fully explore the spectrum of techniques available in this
space. Instead, we have implemented several policies in TinyDB to illustrate
that substantial quality improvements are possible given a particular workload
and quality metric. Generalizing concepts of quality and implementing and
exploring more sophisticated prioritization schemes remains an area of future
work.
   There is a large body of related work on approximation and compression
schemes for streams in the database literature (e.g., Garofalakis and Gibbons
[2001]; Chakrabarti et al. [2001]), although these approaches typically focus on
the problem of building histograms or summary structures over the streams
rather than trying to preserve the (in order) signal as best as possible, which
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks              •      159

is the goal we tackle first. Algorithms from signal processing, such as Fourier
analysis and wavelets, are likely applicable, although the extreme memory
and processor limitations of our devices and the online nature of our problem
(e.g., choosing which tuple in an overflowing queue to evict) make them tricky
to apply. We have begun to explore the use of wavelets in this context; see
Hellerstein et al. [2003] for more information on our initial efforts.

   6.3.1 Policies for Selection Queries. We begin with a comparison of three
simple prioritization schemes, naive, winavg, and delta, for simple selection
queries, turning our attention to aggregate queries in the next section. In
the naive scheme no tuple is considered more valuable than any other, so the
queue is drained in a FIFO manner and tuples are dropped if they do not fit in
the queue.
   The winavg scheme works similarly, except that instead of dropping results
when the queue fills, the two results at the head of the queue are averaged to
make room for new results. Since the head of the queue is now an average of
multiple records, we associate a count with it.
   In the delta scheme, a tuple is assigned an initial score relative to its differ-
ence from the most recent (in time) value successfully transmitted from this
node, and at each point in time, the tuple with the highest score is delivered.
The tuple with the lowest score is evicted when the queue overflows. Out of
order delivery (in time) is allowed. This scheme relies on the intuition that the
largest changes are probably interesting. It works as follows: when a tuple t
with timestamp T is initially enqueued and scored, we mark it with the times-
tamp R of this most recently delivered tuple r. Since tuples can be delivered
out of order, it is possible that a tuple with a timestamp between R and T could
be delivered next (indicating that r was delivered out of order), in which case
the score we computed for t as well as its R timestamp are now incorrect. Thus,
in general, we must rescore some enqueued tuples after every delivery. The
delta scheme is similar to the value-deviation metric used in Garofalakis and
Gibbons [2001] for minimizing deviation between a source and a cache although
value-deviation does not include the possibility of out of order delivery.
   We compared these three approaches on a single mote running TinyDB. To
measure their effect in a controlled setting, we set the sample rate to be a
fixed number K faster than the maximum delivery rate (such that 1 of every
K tuples was delivered, on average) and compared their performance against
several predefined sets of sensor readings (stored in the EEPROM of the device).
In this case, delta had a buffer of 5 tuples; we performed reordering of out of
order tuples at the basestation. To illustrate the effect of winavg and delta,
Figure 13 shows how delta and winavg approximate a high-periodicity trace
of sensor readings generated by a shaking accelerometer. Notice that delta is
considerably closer in shape to the original signal in this case, as it is tends to
emphasize extremes, whereas average tends to dampen them.
   We also measured RMS error for this signal as well as two others: a square
wave-like signal from a light sensor being covered and uncovered, and a slow
sinusoidal signal generated by moving a magnet around a magnetometer.
The error for each of these signals and techniques is shown in Table V.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
160       •      S. R. Madden et al.




Fig. 13. An acceleration signal (top) approximated by a delta (middle) and an average (bottom),
K = 4.
                     Table V. RMS Error for Different Prioritization Schemes
                      and Signals (1000 Samples, Sample Interval = 64 ms)
                               Accel.    Light (Step)    Magnetometer (Sinusoid)
                    Winavg       64          129                    54
                    Delta        63           81                    48
                    Naive        77          143                    63


Although delta appears to match the shape of the acceleration signal better,
its RMS value is about the same as average’s (due to the few peaks that delta
incorrectly merges together). Delta outperforms either other approach for the
fast changing step-functions in the light signal because it does not smooth
edges as much as average.
   We now turn our attention to result prioritization for aggregate queries.
   6.3.2 Policies for Aggregate Queries. The previous section focused on pri-
oritizing result collection in simple selection queries. In this section, we look
instead at aggregate queries, illustrating a class of snooping based techniques
first described in the TAG system [Madden et al. 2002a] that we have imple-
mented for TinyDB. We consider aggregate queries of the form
              SELECT f agg (a1 )
                FROM sensors
                GROUP BY a2
                SAMPLE PERIOD x

   Recall that this query computes the value of f agg applied to the value of a1
produced by each device every x seconds.
   Interestingly, for queries with few or no groups, there is a simple technique
that can be used to prioritize results for several types of aggregates. This tech-
nique, called snooping, allows nodes to locally suppress local aggregate values
by listening to the answers that neighboring nodes report and exploiting the se-
mantics of aggregate functions, and is also used in [Madden et al. 2002a]. Note
that this snooping can be done for free due to the broadcast nature of the radio
channel. Consider, for example, a MAX query over some attribute a—if a node n
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Acquisitional Query Processing In Sensor Networks               •      161




Fig. 14. Snooping reduces the data nodes must send in aggregate queries. Here node 2’s value can
be suppressed if it is less than the maximum value snooped from nodes 3, 4, and 5.

hears a value of a greater than its own locally computed partial MAX, it knows
that its local record is low priority, and assigns it a low score or suppresses it
altogether. Conversely, if n hears many neighboring partial MAXs over a that
are less than its own partial aggregate value, it knows that its local record is
more likely to be a maximum, and assigns it a higher score.
   Figure 14 shows a simple example of snooping for a MAX query—node 2 is can
score its own MAX value very low when it hears a MAX from node 3 that is larger
than its own.
   This basic technique applies to all monotonic, exemplary aggregates: MIN,
MAX, TOP-N, etc., since it is possible to deterministically decide whether a particu-
lar local result could appear in the final answer output at the top of the network.
For dense network topologies where there is ample opportunity for snooping,
this technique produces a dramatic reduction in communication, since at every
intermediate point in the routing tree, only a small number of node’s values
will actually need to be transmitted.
   It is also possible to glean some information from snooping in other aggre-
gates as well—for example, in an AVERAGE query, nodes may rank their own
results lower if they hear many siblings with similar sensor readings. For this
approach to work, parents must cache a count of recently heard children and
assume children who do not send a value for an average have the same value as
the average of their siblings’ values, since otherwise outliers will be weighted
disproportionately. This technique of assuming that missing values are the
same as the average of other reported values can be used for many summary
statistics: variance, sum, and so on. Exploring more sophisticated prioritization
schemes for aggregate queries is an important area of future work.
   In the previous sections, we demonstrated how prioritization of results can
be used improve the overall quality of that data that are transmitted to the root
when some results must be dropped or aggregated. Choosing the proper policies
to apply in general, and understanding how various existing approximation and
prioritization schemes map into ACQP is an important future direction.

6.4 Adapting Rates and Power Consumption
We saw in the previous sections how TinyDB can exploit query semantics
to transmit the most relevant results when limited bandwidth or power is
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
162       •     S. R. Madden et al.




                  Fig. 15. Per-mote sample rate versus aggregate delivery rate.

available. In this section, we discuss selecting and adjusting sampling and
transmission rates to limit the frequency of network-related losses and fill rates
of queues. This adaptation is the other half of the runtime techniques in ACQP:
because the system can adjust rates, significant reductions can be made in the
frequency with which data prioritization decisions must be made. These tech-
niques are simply not available in non-acquisitional query processing systems.
   When initially optimizing a query, TinyDB’s optimizer chooses a trans-
mission and sample rate based on current network load conditions, and re-
quested sample rates and lifetimes. However, static decisions made at the
start of query processing may not be valid after many days running the
same continuous query. Just as adaptive query processing techniques like ed-
dies [Avnur and Hellerstein 2000], Tukwila [Ives et al. 1999], and Query Scram-
bling [Urhan et al. 1998] dynamically reorder operators as the execution envi-
ronment changes, TinyDB must react to changing conditions—however, unlike
in previous adaptive query processing systems, failure to adapt in TinyDB can
cripple the system, reducing data flow to a trickle or causing the system to
severely miss power budget goals.
   We study the need for adaptivity in two contexts: network contention and
power consumption. We first examine network contention. Rather than simply
assuming that a specific transmission rate will result in a relatively uncontested
network channel, TinyDB monitors channel contention and adaptively reduces
the number of packets transmitted as contention rises. This backoff is very
important: as the four motes line of Figure 15 shows, if several nodes try to
transmit at high rates, the total number of packets delivered is substantially
less than if each of those nodes tries to transmit at a lower rate. Compare this
line with the performance of a single node (where there is no contention)—a
single node does not exhibit the same falling off because there is no contention
(although the percentage of successfully delivered packets does fall off). Finally,
the four motes adaptive line does not have the same precipitous performance
because it is able to monitor the network channel and adapt to contention.
   Note that the performance of the adaptive approach is slightly less than the
nonadaptive approach at four and eight samples per second as backoff begins
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Acquisitional Query Processing In Sensor Networks               •      163




Fig. 16. Comparison of delivered values (bottom) versus actual readings for from two motes (left
and right) sampling at 16 packets per second and sending simultaneously. Four motes were com-
municating simultaneously when this data was collected.

to throttle communication in this regime. However, when we compared the
percentage of successful transmission attempts at eight packets per second,
the adaptive scheme achieves twice the success rate of the nonadaptive scheme,
suggesting the adaptation is still effective in reducing wasted communication
effort, despite the lower utilization.
   The problem with reducing the transmission rate is that it will rapidly cause
the network queue to fill, forcing TinyDB to discard tuples using the semantic
techniques for victim selection presented in Section 6.3 above. We note, however,
that had TinyDB not chosen to slow its transmission rate, fewer total packets
would have been delivered. Furthermore, by choosing which packets to drop
using semantic information derived from the queries (rather than losing some
random sample of them), TinyDB is able to substantially improve the quality of
results delivered to the end user. To illustrate this in practice, we ran a selection
query over four motes running TinyDB, asking them each to sample data at 16
samples per second, and compared the quality of the delivered results using an
adaptive-backoff version of our delta approach to results over the same dataset
without adaptation or result prioritization. We show here traces from two of the
nodes on the left and right of Figure 16. The top plots show the performance of
the adaptive delta, the middle plots show the nonadaptive case, and the bottom
plots show the the original signals (which were stored in EEPROM to allow
repeatable trials). Notice that the delta scheme does substantially better in
both cases.

   6.4.1 Measuring Power Consumption. We now turn to the problem of
adapting tuple delivery rates to meet specific lifetime requirements in response
to incorrect sample rates computed at query optimization time (see Section 3.6).
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
164       •     S. R. Madden et al.

We first note that, using the computations shown in Section 3.6, it is possible to
compute a predicted battery voltage for a time t seconds into processing a query.
   The system can then compare its current voltage to this predicted voltage. By
assuming that voltage decays linearly we can reestimate the power consump-
tion characteristics of the device (e.g., the costs of sampling, transmitting, and
receiving) and then rerun our lifetime calculation. By reestimating these pa-
rameters, the system can ensure that this new lifetime calculation tracks the
actual lifetime more closely.
   Although this calculation and reoptimization are straightforward, they serve
an important role by allowing TinyDB motes to satisfy occasional ad hoc queries
and relay results for other nodes without compromising lifetime goals of long-
running monitoring queries.
   Finally, we note that incorrect measurements of power consumption may also
be due to incorrect estimates of the cost of various phases of query processing,
or may be as a result of incorrect selectivity estimation. We cover both by tuning
sample rate. As future work, we intend to explore adaptation of optimizer esti-
mates and ordering decisions (in the spirit of other adaptive work [Hellerstein
et al. 2000]) and the effect of frequency of reestimation on lifetime.

7. SUMMARY OF ACQP TECHNIQUES
This completes our discussion of the novel issues and techniques that arise
when taking an acquisitional perspective on query processing. In summary, we
first discussed important aspects of an acquisitional query language, introduc-
ing event and lifetime clauses for controlling when and how often sampling
occurs. We then discussed query optimization with the associated issues of
modeling sampling costs and ordering of sampling operators. We showed how
event-based queries can be rewritten as joins between streams of events and
sensor samples. Once queries have been optimized, we demonstrated the use
of semantic routing trees as a mechanism for efficiently disseminating queries
and collecting results. Finally, we showed the importance of prioritizing data
according to quality and discussed the need for techniques to adapt the trans-
mission and sampling rates of an ACQP system. Table VI lists the key new
techniques we introduced, summarizing what queries they apply to and when
they are most useful.

8. RELATED WORK
There has been some recent publication in the database and systems commu-
nities on query processing in sensor networks [Intanagonwiwat et al. 2000;
Madden et al. 2002a; Bonnet et al. 2001; Madden and Franklin 2002; Yao and
Gehrke 2002]. These articles noted the importance of power sensitivity. Their
predominant focus to date has been on in-network processing—that is, the push-
ing of operations, particularly selections and aggregations, into the network to
reduce communication. We too endorse in-network processing, but believe that,
for a sensor network system to be truly power sensitive, acquisitional issues of
when, where, and in what order to sample and which samples to process must
be considered. To our knowledge, no prior work addresses these issues.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                            Acquisitional Query Processing In Sensor Networks             •      165

         Table VI. Summary of Acquisitional Query Processing Techniques in TinyDB
Technique (Section)                                               Summary
Event-based queries (3.5)                   Avoid polling overhead
Lifetime queries (3.6)                      Satisfy user-specified longevity constraints
Interleaving acquisition/predicates (4.2)   Avoid unnecessary sampling costs in selection
                                              queries
Exemplary aggregate pushdown (4.2.1)        Avoid unnecessary sampling costs in aggregate
                                              queries
Event batching (4.3)                        Avoid execution costs when a number of event
                                              queries fire
SRT (5.1)                                   Avoid query dissemination costs or the inclusion of
                                              unneeded nodes in queries with predicates over
                                              constant attributes
Communication scheduling (6.1.1)            Disable node’s processors and radios during times of
                                              inactivity
Data prioritization (6.3)                   Choose most important samples to deliver according
                                              to a user-specified prioritization function
Snooping (6.3.2)                            Avoid unnecessary transmissions during aggregate
                                              queries
Rate adaptation (6.4)                       Intentionally drop tuples to avoid saturating the
                                              radio channel, allowing most important tuples to
                                              be delivered



   There is a small body of work related to query processing in mobile environ-
ments [Imielinski and Badrinath 1992; Alonso and Korth 1993]. This work has
been concerned with laptop-like devices that are carried with the user, can be
readily recharged every few hours, and, with the exception of a wireless network
interface, basically have the capabilities of a wired, powered PC. Lifetime-based
queries, notions of sampling the associated costs, and runtime issues regarding
rates and contention were not considered. Many of the proposed techniques, as
well as more recent work on moving object databases (such as Wolfson et al.
[1999]), focus on the highly mobile nature of devices, a situation we are not (yet)
dealing with, but which could certainly arise in sensor networks.
   Power-sensitive query optimization was proposed in Alonso and Ganguly
[1993], although, as with the previous work, the focus was on optimizing costs
in traditional mobile devices (e.g., laptops and palmtops), so concerns about the
cost and ordering of sampling did not appear. Furthermore, laptop-style devices
typically do not offer the same degree of rapid power-cycling that is available
on embedded platforms like motes. Even if they did, their interactive, user-
oriented nature makes it undesirable to turn off displays, network interfaces,
etc., because they are doing more than simply collecting and processing data,
so there are many fewer power optimizations that can be applied.
   Building an SRT is analogous to building an index in a conventional database
system. Due to the resource limitations of sensor networks, the actual index-
ing implementations are quite different. See Kossman [2000] for a survey of
relevant research on distributed indexing in conventional database systems.
There is also some similarity to indexing in peer-to-peer systems [Crespo and
Garcia-Molina 2002]. However, peer-to-peer systems differ in that they are
inexact and not subject to the same paucity of communications or storage
                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
166       •     S. R. Madden et al.

infrastructure as sensor networks, so algorithms tend to be storage and com-
munication heavy. Similar indexing issues also appear in highly mobile envi-
ronments (like Wolfson et al. [1999] or Imielinski and Badrinath [1992]), but
this work relies on a centralized location servers for tracking recent positions
of objects.
   The observation that interleaving the fetching of attributes and application of
operators also arises in the context of compressed databases [Chen et al. 2001],
as decompression effectively imposes a penalty for fetching an individual at-
tribute, so it is beneficial to apply selections and joins on already decompressed
or easy to decompress attributes.
   The ON EVENT and OUTPUT ACTION clauses in our query language are similar
to constructs present in event-condition-action/active databases [Chakravarthy
et al. 1994]. There is a long tradition of such work in the database community,
and our techniques are much simpler in comparison, as we we have not focused
on any of the difficult issues associated with the semantics of event composition
or with building a complete language for expressing and efficiently evaluating
the triggering of composite events. Work on systems for efficiently determining
when an event has fired, such as Hanson [1996], could be useful in TinyDB.
More recent work on continuous query systems [Liu et al. 1999; Chen et al.
2000] has described languages that provide for query processing in response
to events or at regular intervals over time. This earlier work, as well as our
own work on continuous query processing [Madden et al. 2002b], inspired the
periodic and event-driven features of TinyDB.
   Approximate and best effort caches [Olston and Widom 2002], as well as sys-
tems for online-aggregation [Raman et al. 2002] and stream query processing
[Motwani et al. 2003; Carney et al. 2002], include some notion of data quality.
Most of this other work has been focused on quality with respect to summaries,
aggregates, or staleness of individual objects, whereas we focus on quality as a
measure of fidelity to the underlying continuous signal. Aurora [Carney et al.
2002] mentioned a need for this kind of metric, but proposed no specific ap-
proaches. Work on approximate query processing [Garofalakis and Gibbons
2001] has included a scheme similar to our delta approach, as well as a sub-
stantially more thorough evaluation of its merits, but did not consider out of
order delivery.


9. CONCLUSIONS AND FUTURE WORK
Acquisitional query processing provides a framework for addressing issues of
when, where, and how often data is sampled and which data is delivered in dis-
tributed, embedded sensing environments. Although other research has iden-
tified the opportunities for query processing in sensor networks, this work is
the first to discuss these fundamental issues in an acquisitional framework.
   We identified several opportunities for future research. We are currently ac-
tively pursuing two of these: first, we are exploring how query optimizer statis-
tics change in acquisitional environments and studying the role of online reop-
timization in sample rate and operator orderings in response to bursts of data
or unexpected power consumption. Second, we are pursuing more sophisticated
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks                •      167

prioritization schemes, like wavelet analysis, that can capture salient proper-
ties of signals other than large changes (as our delta mechanism does) as well
as mechanisms to allow users to express their prioritization preferences.
   We believe that ACQP notions are of critical importance for preserving the
longevity and usefulness of any deployment of battery powered sensing devices,
such as those that are now appearing in biological preserves, roads, businesses,
and homes. Without appropriate query languages, optimization models, and
query dissemination and data delivery schemes that are cognizant of semantics
and the costs and capabilities of the underlying hardware the success of such
deployments will be limited.

APPENDIX A. POWER CONSUMPTION STUDY
This appendix details an analytical study of power consumption on a mote
running a typical data collection query.
   In this study, we assume that each mote runs a very simple query that trans-
mits one sample of (light, humidity) readings every minute. We assume each
mote also listens to its radio for 2 s per 1-min period to receive results from
neighboring devices and obtain access to the radio channel. We assume the
following hardware characteristics: a supply voltage of 3 V, an Atmega128 pro-
cessor (see footnote to Table III data on the processor) that can be set into
power-down mode and runs off the internal oscillator at 4 MHz, the use of the
Taos Photosynthetically Active Light Sensor [TAOS, Inc. 2002] and Sensirion
Humidity Sensor [Sensirion 2002], and a ChipCon CC1000 Radio (see text foot-
note 6 for data on this radio) transmitting at 433 MHz with 0-dBm output power
and −110-dBm receive sensitivity. We further assume the radio can make use of
its low-power sampling16 mode to reduce reception power when no other radios
are communicating, and that, on average, each node has 10 neighbors, or other
motes, within radio range, period, with one of those neighbors being a child in
the routing tree. Radio packets are 50 bytes each, with a 20-byte preamble for
synchronization. This hardware configuration represents real-world settings of
motes similar to values used in deployments of TinyDB in various environmen-
tal monitoring applications.
   The percentage of total energy used by various components is shown in
Table VII. These results show that the processor and radio together con-
sume the majority of energy for this particular data collection task. Obviously,
these numbers change as the number of messages transmitted per period in-
creases; doubling the number of messages sent increases the total power uti-
lization by about 19% as a result of the radio spending less time sampling the
channel and more time actively receiving. Similarly, if a node must send five
packets per sample period instead of one, its total power utilization rises by
about 10%.

16 Thismode works by sampling the radio at a low frequency—say, once every k bit-times, where k
is on the order of 100—and extending the synchronization header, or preamble, on radio packets
to be at least k + bits, such that a radio using this low-power listening approach will still detect
every packet. Once a packet is detected, the receiver begins packet reception at the normal rate.
The cost of this technique is that it increases transmission costs significantly.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
168        •    S. R. Madden et al.

         Table VII. Expected Power Consumption for Major Hardware Components, a Query
                  Reporting Light and Accelerometer Readings Once Every Minute
   Hardware                                   Current (mA)      Active Time (s)   % Total Energy
   Sensing, humidity                              0.50                0.34             1.43
   Sensing, light                                 0.35                1.30             3.67
   Communication, sending                        10.40                0.03             2.43
     (70 bytes @ 38.4 bps × 2 packets)
   Communication, receive packets                  9.30               0.15            11.00
     (70 bytes @ 38.4 bps × 10 packets)
   Communication, sampling channel                 0.07              0.86              0.31
   Processor, active                               5.00              2.00             80.68
   Processor, idle                                 0.001            58.00              0.47
                    Average current draw per second: 0.21 mA


   This table does not tell the entire story, however, because the processor must
be active during sensing and communication, even though it has very little
computation to perform.17 For example, in Table VII, 1.3 s are spent waiting for
the light sensor to start and produce a sample,18 and another 0.029 s are spent
transmitting. Furthermore, the media access control (MAC) layer on the radio
introduces a delay proportional to the number of devices transmitting. To mea-
sure this delay, we examined the average delay between 1700 packet arrivals
on a network of 10 time-synchronized motes attempting to send at the same
time. The minimum interpacket arrival time was about 0.06 s; subtracting the
expected transmit time of a packet (0.007 s) suggests that, with 10 nodes, the
average MAC delay will be at least (0.06 − 0.007) × 5) = 0.265 s. Thus, of
the 2 s each mote is awake, about 1.6 s of that time is spent waiting for the
sensors or radio. The total 2-s waking period is selected to allow for variation
in MAC delays on individual sensors.
   Application computation is almost negligible for basic data collection sce-
narios: we measured application processing time by running a simple TinyDB
query that collects three data fields from the RAM of the processor (incurring
no sensing delay) and transmits them over an uncontested radio channel (in-
curring little MAC delay). We inserted into the query result a measure of the
elapsed time from the start of processing until the moment the result begins to
be transmitted. The average delay was less than 1/32 (0.03125) s, which is the
minimum resolution we could measure.
   Thus, of the 81% of energy spent on the processor, no more than 1% of its
cycles are spent in application processing. For the example given here at least
65% of this 81% is spent waiting for sensors, and another 8% waiting for the
radio to send or receive. The remaining 26% of processing time is time to allow
for multihop forwarding of messages and as slop in the event that MAC de-
lays exceed the measured minimums given above. Summing the processor time

17 The requirement that the processor be active during these times is an artifact of the mote hard-
ware. Bluetooth radios, for example, can negotiate channel access independently of the proces-
sor. These radios, however, have significantly higher power consumption than the mote radio; see
Leopold et al. [2003] for a discussion of Bluetooth as a radio for sensor networks.
18 On motes, it is possible to start and sample several sensors simultaneously, so the delay for the

light and humidity sensors are not additive.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Acquisitional Query Processing In Sensor Networks             •      169

spent waiting to send or sending with the percent energy used by the radio
itself, we get
               (0.26 + 0.08) × 0.80 + 0.02 + 0.11 + 0.003 = 0.41
This indicates that about 41% of power consumption in this simple data collec-
tion task is due to communication. Similarly, in this example, the percentage
of energy devoted to sensing can be computed by summing the energy spent
waiting for samples with the energy costs of sampling:
                        0.65 ∗ 0.81 + 0.01 + 0.04 = 0.58.
Thus, about 58% of the energy in this case is spent sensing. Obviously, the total
percentage of time spent in sensing could be less if sensors that powered up
more rapidly were used. When we discussed query optimization in TinyDB in
Section 4, we saw a range of sensors with varying costs that would alter the
percentages shown here.

B. QUERY LANGUAGE
This appendix provides a complete specification of the syntax of the TinyDB
query language as well as pointers to the parts of the text where these constructs
are defined. We will use {} to denote a set, [] to denote optional clauses, and <>
to denote an expression, and italicized text to denote user-specified tokens such
as aggregate names, commands, and arithmetic operators. The separator “|”
indicates that one or the other of the surrounding tokens may appear, but not
both. Ellipses (“. . . ”) indicate a repeating set of tokens, such as fields in the
SELECT clause or tables in the FROM clause.

B.1 Query Syntax
The syntax of queries in the TinyDB query language is as follows:
[ON [ALIGNED] EVENT event-type[{paramlist}]
                     [ boolop event-type{paramlist} ... ]]
  SELECT [NO INTERLEAVE] <expr>| agg(<expr>) |
                           temporal agg(<expr>), ...
   FROM [sensors | storage-point], ...
   [WHERE {<pred>}]
   [GROUP BY {<expr>}]
   [HAVING {<pred>}]
  [OUTPUT ACTION [ command |
                   SIGNAL event({paramlist}) |
                   (SELECT ... ) ] |
  [INTO STORAGE POINT bufname]]
  [SAMPLE PERIOD seconds
         [[FOR n rounds] |
          [STOP ON event-type [WHERE <pred>]]]
         [COMBINE { agg(<expr>)}]
         [INTERPOLATE LINEAR]] |
   [ONCE] |
   [LIFETIME seconds [MIN SAMPLE RATE seconds]]
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
170       •     S. R. Madden et al.

                                Table VIII. References to Sections in
                               the Main Text Where Query Language
                                     Constructs are Introduced
                              Language Construct             Section
                              ON EVENT                      Section 3.5
                              SELECT-FROM-WHERE             Section 3
                              GROUP BY, HAVING              Section 3.3.1
                              OUTPUT ACTION                 Section 3.7
                              SIGNAL <event>                Section 3.5
                              INTO STORAGE POINT            Section 3.2
                              SAMPLE PERIOD                 Section 3
                              FOR                           Section 3.2
                              STOP ON                       Section 3.5
                              COMBINE                       Section 3.2
                              ONCE                          Section 3.7
                              LIFETIME                      Section 3.6



Each of these constructs are described in more detail in the sections shown in
Table VIII.

B.2 Storage Point Creation and Deletion Syntax
The syntax for storage point creation is
CREATE [CIRCULAR] STORAGE POINT name
SIZE [ ntuples | nseconds]
[( fieldname type [, ... , fieldname type])] |
[AS SELECT ... ]
[SAMPLE PERIOD nseconds]
and for deletion is
   DROP STORAGE POINT name
Both of these constructs are described in Section 3.2.

REFERENCES

ALONSO, R. AND GANGULY, S. 1993. Query optimization in mobile environments. In Proceedings of
  the Workshop on Foundations of Models and Languages for Data and Objects. 1–17.
ALONSO, R. AND KORTH, H. F. 1993. Database system issues in nomadic computing. In Proceedings
  of the ACM SIGMOD (Washington, DC).
AVNUR, R. AND HELLERSTEIN, J. M. 2000. Eddies: Continuously adaptive query processing. In Pro-
  ceedings of ACM SIGMOD (Dallas, TX). 261–272.
BANCILHON, F., BRIGGS, T., KHOSHAFIAN, S., AND VALDURIEZ, P. 1987. FAD, a powerful and simple
  database language. In Proceedings of VLDB.
BONNET, P., GEHRKE, J., AND SESHADRI, P. 2001. Towards sensor database systems. In Proceedings
  of the Conference on Mobile Data Management.
BROOKE, T. AND BURRELL, J. 2003. From ethnography to design in a vineyard. In Proceedings of
  the Design User Experiences (DUX) Conference. Case study.
CARNEY, D., CENTIEMEL, U., CHERNIAK, M., CONVEY, C., LEE, S., SEIDMAN, G., STONEBRAKER, M., TATBUL,
  N., AND ZDONIK, S. 2002. Monitoring streams—a new class of data management applications.
  In Proceedings of VLDB.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Acquisitional Query Processing In Sensor Networks                •      171

CERPA, A., ELSON, J., D. ESTRIN, GIROD, L., HAMILTON, M., AND ZHAO, J. 2001. Habitat monitoring:
  Application driver for wireless communications technology. In Proceedings of ACM SIGCOMM
  Workshop on Data Communications in Latin America and the Caribbean.
CHAKRABARTI, K., GAROFALAKIS, M., RASTOGI, R., AND SHIM, K. 2001. Approximate query processing
  using wavelets. VLDB J. 10, 2-3 (Sep.), 199–223.
CHAKRAVARTHY, S., KRISHNAPRASAD, V., ANWAR, E., AND KIM, S. K. 1994. Composite events for active
  databases: Semantics, contexts and detection. In Proceedings of VLDB.
CHANDRASEKARAN, S., COOPER, O., DESHPANDE, A., FRANKLIN, M. J., HELLERSTEIN, J. M., HONG, W.,
  KRISHNAMURTHY, S., MADDEN, S. R., RAMAN, V., REISS, F., AND SHAH, M. A. 2003. TelegraphCQ:
  Continuous dataflow processing for an uncertain world. In Proceedings of the First Annual Con-
  ference on Innovative Database Research (CIDR).
CHEN, J., DEWITT, D., TIAN, F., AND WANG, Y. 2000. NiagaraCQ: A scalable continuous query system
  for internet databases. In Proceedings of ACM SIGMOD.
CHEN, Z., GEHRKE, J., AND KORN, F. 2001. Query optimization in compressed database systems. In
  Proceedings of ACM SIGMOD.
CRESPO, A. AND GARCIA-MOLINA, H. 2002. Routing indices for peer-to-peer systems. In Proceedings
  of ICDCS.
DELIN, K. A. AND JACKSON, S. P. 2000. Sensor web for in situ exploration of gaseous biosignatures.
  In Proceedings of the IEEE Aerospace Conference.
DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER, D. A., BRICKER, A., HSIAO, H. I., AND RASMUSSEN,
  R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Eng. 2, 1, 44–
  62.
GANERIWAL, S., KUMAR, R., ADLAKHA, S., AND SRIVASTAVA, M. 2003. Timing-sync protocol for sensor
  networks. In Proceedings of ACM SenSys.
GAROFALAKIS, M. AND GIBBONS, P. 2001. Approximate query processing: Taming the terabytes!
  (tutorial). In Proceedings of VLDB.
GAY, D., LEVIS, P., VON BEHREN, R., WELSH, M., BREWER, E., AND CULLER, D. 2003. The nesC language:
  A holistic approach to network embedded systems. In Proceedings of the ACM SIGPLAN 2003
  Conference on Programming Language Design and Implementation (PLDI).
GEHRKE, J., KORN, F., AND SRIVASTAVA, D. 2001. On computing correlated aggregates over contin-
  ual data streams. In Proceedings of ACM SIGMOD Conference on Management of Data (Santa
  Barbara, CA).
HANSON, E. N. 1996. The design and implementation of the ariel active database rule system.
  IEEE Trans. Knowl. Data Eng. 8, 1 (Feb.), 157–172.
HELLERSTEIN, J., HONG, W., MADDEN, S., AND STANEK, K. 2003. Beyond average: Towards sophisti-
  cated sensing with queries. In Proceedings of the First Workshop on Information Processing in
  Sensor Networks (IPSN).
HELLERSTEIN, J. M. 1998. Optimization techniques for queries with expensive methods. ACM
  Trans. Database Syst. 23, 2, 113–157.
HELLERSTEIN, J. M., FRANKLIN, M. J., CHANDRASEKARAN, S., DESHPANDE, A., HILDRUM, K., MADDEN, S.,
  RAMAN, V., AND SHAH, M. 2000. Adaptive query processing: Technology in evolution. IEEE Data
  Eng. Bull. 23, 2, 7–18.
HILL, J., SZEWCZYK, R., WOO, A., HOLLAR, S., AND PISTER, D. C. K. 2000. System architecture direc-
  tions for networked sensors. In Proceedings of ASPLOS.
IBARAKI, T. AND KAMEDA, T. 1984. On the optimal nesting order for computing n-relational joins.
  ACM Trans. Database Syst. 9, 3, 482–502.
IMIELINSKI, T. AND BADRINATH, B. 1992. Querying in highly mobile distributed environments. In
  Proceedings of VLDB (Vancouver, B.C., Canada).
INTANAGONWIWAT, C., GOVINDAN, R., AND ESTRIN, D. 2000. Directed diffusion: A scalable and robust
  communication paradigm for sensor networks. In Proceedings of MobiCOM (Boston, MA).
INTERSEMA. 2002. MS5534A barometer module. Tech. rep. (Oct.). Go online to http://www.
  intersema.com/pro/module/file/da5534.pdf.
IVES, Z. G., FLORESCU, D., FRIEDMAN, M., LEVY, A., AND WELD, D. S. 1999. An adaptive query execution
  system for data integration. In Proceedings of ACM SIGMOD.
KOSSMAN, D. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32,
  4 (Dec.), 422–46.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
172       •     S. R. Madden et al.

KRISHNAMURTHY, R., BORAL, H., AND ZANIOLO, C. 1986. Optimization of nonrecursive queries. In
  Proceedings of VLDB. 128–137.
LEOPOLD, M., DYDENSBORG, M., AND BONNET, P. 2003. Bluetooth and sensor networks: A reality
  check. In Proceedings of ACM Conference on Sensor Networks (SenSys).
LIN, C., FEDERSPIEL, C., AND AUSLANDER, D. 2002. Multi-sensor single actuator control of HVAC sys-
  tems. In Proceedings of the International Conference for Enhanced Building Operations (Austin,
  TX, Oct. 14–18).
LIU, L., PU, C., AND TANG, W. 1999. Continual queries for internet-scale event-driven information
  delivery. IEEE Trans. Knowl. Data Eng. (special Issue on Web technology) 11, 4 (July), 610–628.
MADDEN, S. 2003. The design and evaluation of a query processing architecture for sensor net-
  works. Ph.D. dissertation. University of California, Berkeley, Berkeley, CA.
MADDEN, S. AND FRANKLIN, M. J. 2002. Fjording the stream: An architechture for queries over
  streaming sensor data. In Proceedings of ICDE.
MADDEN, S., FRANKLIN, M. J., HELLERSTEIN, J. M., AND HONG, W. 2002a. TAG: A Tiny AGgregation
  service for ad-hoc sensor networks. In Proceedings of OSDI.
MADDEN, S., HONG, W., FRANKLIN, M., AND HELLERSTEIN, J. M. 2003. TinyDB Web page. Go online
  to http://telegraph.cs.berkeley.edu/tinydb.
MADDEN, S., SHAH, M. A., HELLERSTEIN, J. M., AND RAMAN, V. 2002b. Continously adaptive contin-
  uous queries over data streams. In Proceedings of ACM SIGMOD (Madison, WI).
MAINWARING, A., POLASTRE, J., SZEWCZYK, R., AND CULLER, D. 2002. Wireless sensor networks for
  habitat monitoring. In Proceedings of ACM Workshop on Sensor Networks and Applications.
MELEXIS, INC. 2002. MLX90601 infrared thermopile module. Tech. rep. (Aug.). Go online to http:
  //www.melexis.com/prodfiles/mlx90601.pdf.
MONMA, C. L. AND SIDNEY, J. 1979. Sequencing with series parallel precedence constraints. Math.
  Operat. Rese. 4, 215–224.
MOTWANI, R., WIDOM, J., ARASU, A., BABCOCK, B., S.BABU, DATA, M., OLSTON, C., ROSENSTEIN, J., AND
  VARMA, R. 2003. Query processing, approximation and resource management in a data stream
  management system. In Proceedings of the First Annual Conference on Innovative Database
  Research (CIDR).
OLSTON, C. AND WIDOM, J. 2002. In best effort cache sychronization with source cooperation. In
  Proceedings of SIGMOD.
PIRAHESH, H., HELLERSTEIN, J. M., AND HASAN, W. 1992. Extensible/rule based query rewrite opti-
  mization in starburst. In Proceedings of ACM SIGMOD. 39–48.
POTTIE, G. AND KAISER, W. 2000. Wireless integrated network sensors. Commun. ACM 43, 5 (May),
  51–58.
PRIYANTHA, N. B., CHAKRABORTY, A., AND BALAKRISHNAN, H. 2000. The cricket location-support sys-
  tem. In Proceedings of MOBICOM.
RAMAN, V., RAMAN, B., AND HELLERSTEIN, J. M. 2002. Online dynamic reordering. VLDB J. 9, 3.
SENSIRION. 2002. SHT11/15 relative humidity sensor. Tech. rep. (June). Go online to http://www.
  sensirion.com/en/pdf/Datasheet_SHT1x_SHT7x_0206.pdf.
SHATDAL, A. AND NAUGHTON, J. 1995. Adaptive parallel aggregation algorithms. In Proceedings of
  ACM SIGMOD.
STONEBRAKER, M. AND KEMNITZ, G. 1991. The POSTGRES next-generation database management
  system. Commun. ACM 34, 10, 78–92.
SUDARSHAN, S. AND RAMAKRISHNAN, R. 1991. Aggregation and relevance in deductive databases. In
  Proceedings of VLDB. 501–511.
TAOS, INC. 2002. TSL2550 ambient light sensor. Tech. rep. (Sep.). Go online to http://www.
  taosinc.com/images/product/document/tsl2550.pdf.
UC BERKELEY. 2001. Smart buildings admit their faults. Web page. Lab notes: Research from the
  College of Engineering, UC Berkeley. Go online to http://coe.berkeley.edu/labnotes/1101.
  smartbuildings.html.
URHAN, T., FRANKLIN, M. J., AND AMSALEG, L. 1998. Cost-based query scrambling for initial delays.
  In Proceedings of ACM SIGMOD.
WOLFSON, O., SISTLA, A. P., XU, B., ZHOU, J., AND CHAMBERLAIN, S. 1999. DOMINO: Databases fOr
  MovINg Objects tracking. In Proceedings of ACM SIGMOD (Philadelphia, PA).


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                            Acquisitional Query Processing In Sensor Networks                •      173

WOO, A. AND CULLER, D. 2001. A transmission control scheme for media access in sensor networks.
  In Proceedings of ACM Mobicom.
YAO, Y. AND GEHRKE, J. 2002. The cougar approach to in-network query processing in sensor
  networks. In SIGMOD Rec. 13, 3 (Sept.), 9–18.

Received October 2003; revised June 2004; accepted September 2004




                                        ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
Data Exchange: Getting to the Core
RONALD FAGIN, PHOKION G. KOLAITIS, and LUCIAN POPA
IBM Almaden Research Center


Data exchange is the problem of taking data structured under a source schema and creating an
instance of a target schema that reflects the source data as accurately as possible. Given a source
instance, there may be many solutions to the data exchange problem, that is, many target in-
stances that satisfy the constraints of the data exchange problem. In an earlier article, we iden-
tified a special class of solutions that we call universal. A universal solution has homomorphisms
into every possible solution, and hence is a “most general possible” solution. Nonetheless, given
a source instance, there may be many universal solutions. This naturally raises the question of
whether there is a “best” universal solution, and hence a best solution for data exchange. We an-
swer this question by considering the well-known notion of the core of a structure, a notion that
was first studied in graph theory, and has also played a role in conjunctive-query processing. The
core of a structure is the smallest substructure that is also a homomorphic image of the struc-
ture. All universal solutions have the same core (up to isomorphism); we show that this core is
also a universal solution, and hence the smallest universal solution. The uniqueness of the core
of a universal solution together with its minimality make the core an ideal solution for data ex-
change. We investigate the computational complexity of producing the core. Well-known results by
Chandra and Merlin imply that, unless P = NP, there is no polynomial-time algorithm that, given
a structure as input, returns the core of that structure as output. In contrast, in the context of
data exchange, we identify natural and fairly broad conditions under which there are polynomial-
time algorithms for computing the core of a universal solution. We also analyze the computational
complexity of the following decision problem that underlies the computation of cores: given two
graphs G and H, is H the core of G? Earlier results imply that this problem is both NP-hard
and coNP-hard. Here, we pinpoint its exact complexity by establishing that it is a DP-complete
problem. Finally, we show that the core is the best among all universal solutions for answering ex-
istential queries, and we propose an alternative semantics for answering queries in data exchange
settings.
Categories and Subject Descriptors: H.2.5 [Heterogeneous Databases]: Data Translation; H.2.4
[Systems]: Relational Databases; H.2.4 [Systems]: Query Processing
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Certain answers, conjunctive queries, core, universal so-
lutions, dependencies, chase, data exchange, data integration, computational complexity, query
answering




P. G. Kolaitis is on leave from the University of California, Santa Cruz, Santa Cruz, CA; he is
partially supported by NSF Grant IIS-9907419.
A preliminary version of this article appeared on pages 90–101 of Proceedings of the ACM Sympo-
sium on Principles of Database Systems (San Diego, CA).
Authors’ addresses: Foundation of Computer Science, IBM Almaden Research Center, Department
K53/B2, 650 Harry Road, San Jose, CA 95120; email: {fagin,kolaitis,lucian}@almaden.ibm.com.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0174 $5.00


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 174–210.
                                     Data Exchange: Getting to the Core             •      175

1. INTRODUCTION AND SUMMARY OF RESULTS

1.1 The Data Exchange Problem

Data exchange is the problem of materializing an instance that adheres to a
target schema, given an instance of a source schema and a specification of the
relationship between the source schema and the target schema. This problem
arises in many tasks requiring data to be transferred between independent ap-
plications that do not necessarily adhere to the same data format (or schema).
The importance of data exchange was recognized a long time ago; in fact, an
early data exchange system was EXPRESS [Shu et al. 1977] from the 1970s,
whose main functionality was to convert data between hierarchical schemas.
The need for data exchange has steadily increased over the years and, actually,
has become more pronounced in recent years, with the proliferation of Web data
in various formats and with the emergence of e-business applications that need
to communicate data yet remain autonomous. The data exchange problem is
related to the data integration problem in the sense that both problems are
concerned with management of data stored in heterogeneous formats. The two
problems, however, are different for the following reasons. In data exchange, the
main focus is on actually materializing a target instance that reflects the source
data as accurately as possible; this can be a serious challenge, due to the inher-
ent underspecification of the relationship between the source and the target.
In contrast, a target instance need not be materialized in data integration; the
main focus there is on answering queries posed over the target schema using
views that express the relationship between the target and source schemas.
   In a previous paper [Fagin et al. 2003], we formalized the data exchange
problem and embarked on an in-depth investigation of the foundational and
algorithmic issues that surround it. Our work has been motivated by practi-
cal considerations arising in the development of Clio [Miller et al. 2000; Popa
et al. 2002] at the IBM Almaden Research Center. Clio is a prototype system for
schema mapping and data exchange between autonomous applications. A data
exchange setting is a quadruple (S, T, st , t ), where S is the source schema, T
is the target schema, st is a set of source-to-target dependencies that express
the relationship between S and T, and t is a set of dependencies that express
constraints on T. Such a setting gives rise to the following data exchange prob-
lem: given an instance I over the source schema S, find an instance J over
the target schema T such that I together with J satisfy the source-to-target
dependencies st , and J satisfies the target dependencies t . Such an instance
J is called a solution for I in the data exchange setting. In general, many differ-
ent solutions for an instance I may exist. Thus, the question is: which solution
should one choose to materialize, so that it reflects the source data as accurately
as possible? Moreover, can such a solution be efficiently computed?
   In Fagin et al. [2003], we investigated these issues for data exchange settings
in which S and T are relational schemas, st is a set of tuple-generating depen-
dencies (tgds) between S and T, and t is a set of tgds and equality-generating
dependencies (egds) on T. We isolated a class of solutions, called universal so-
lutions, possessing good properties that justify selecting them as the semantics
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
176       •     R. Fagin et al.

of the data exchange problem. Specifically, universal solutions have homomor-
phisms into every possible solution; in particular, they have homomorphisms
into each other, and thus are homomorphically equivalent. Universal solutions
are the most general among all solutions and, in a precise sense, they represent
the entire space of solutions. Moreover, as we shall explain shortly, universal
solutions can be used to compute the “certain answers” of queries q that are
unions of conjunctive queries over the target schema. The set certain(q, I ) of
certain answers of a query q over the target schema, with respect to a source
instance I , consists of all tuples that are in the intersection of all q(J )’s, as
J varies over all solutions for I (here, q(J ) denotes the result of evaluating q
on J ). The notion of the certain answers originated in the context of incomplete
databases (see van der Meyden [1998] for a survey). Moreover, the certain an-
swers have been used for query answering in data integration [Lenzerini 2002].
In the same data integration context, Abiteboul and Duschka [1998] studied
the complexity of computing the certain answers.
   We showed [Fagin et al. 2003] that the certain answers of unions of con-
junctive queries can be obtained by simply evaluating these queries on some
arbitrarily chosen universal solution. We also showed that, under fairly gen-
eral, yet practical, conditions, a universal solution exists whenever a solution
exists. Furthermore, we showed that when these conditions are satisfied, there
is a polynomial-time algorithm for computing a canonical universal solution;
this algorithm is based on the classical chase procedure [Beeri and Vardi 1984;
Maier et al. 1979].


1.2 Data Exchange with Cores
Even though they are homomorphically equivalent to each other, universal solu-
tions need not be unique. In other words, in a data exchange setting, there may
be many universal solutions for a given source instance I . Thus, it is natural to
ask: what makes a universal solution “better” than another universal solution?
Is there a “best” universal solution and, of course, what does “best” really mean?
If there is a “best” universal solution, can it be efficiently computed?
   The present article addresses these questions and offers answers that are
based on using minimality as a key criterion for what constitutes the “best”
universal solution. Although universal solutions come in different sizes, they
all share a unique (up to isomorphism) common “part,” which is nothing else
but the core of each of them, when they are viewed as relational structures.
By definition, the core of a structure is the smallest substructure that is also a
homomorphic image of the structure. The concept of the core originated in graph
theory, where a number of results about its properties have been established
                                  s r
(see, for instance, Hell and Neˇ etˇ il [1992]). Moreover, in the early days of
database theory, Chandra and Merlin [1977] realized that the core of a structure
is useful in conjunctive-query processing. Indeed, since evaluating joins is the
most expensive among the basic relational algebra operations, one of the most
fundamental problems in query processing is the join-minimization problem:
given a conjunctive query q, find an equivalent conjunctive query involving the
smallest possible number of joins. In turn, this problem amounts to computing
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                                  Data Exchange: Getting to the Core          •      177

the core of the relational instance Dq that is obtained from q by putting a fact
into Dq for each conjunct of q (see Abiteboul et al. [1995]; Chandra and Merlin
[1977]; Kanellakis [1990]).
   Consider a data exchange setting (S, T, st , t ) in which st is a set of source-
to-target tgds and t is a set of target tgds and target egds. Since all universal
solutions for a source instance I are homomorphically equivalent, it is easy to
see that their cores are isomorphic. Moreover, we show in this article that the
core of a universal solution for I is itself a solution for I . It follows that the
core of the universal solutions for I is the smallest universal solution for I , and
thus an ideal candidate for the “best” universal solution, at least in terms of
the space required to materialize it.
   After this, we address the issue of how hard it is to compute the core of a
universal solution. Chandra and Merlin [1977] showed that join minimization
is an NP-hard problem by pointing out that a graph G is 3-colorable if and
only if the 3-element clique K3 is the core of the disjoint sum G ⊕ K3 of G with
K3 . From this, it follows that, unless P = NP, there is no polynomial-time al-
gorithm that, given a structure as input, outputs its core. At first sight, this
result casts doubts on the tractability of computing the core of a universal solu-
tion. For data exchange, however, we give natural and fairly broad conditions
under which there are polynomial-time algorithms for computing the cores of
universal solutions. Specifically, we show that there are polynomial-time algo-
rithms for computing the core of universal solutions in data exchange settings
in which st is a set of source-to-target tgds and t is a set of target egds. It
remains an open problem to determine whether this result can be extended to
data exchange settings in which the target constraints t consist of both egds
and tgds. We also analyze the computational complexity of the following deci-
sion problem, called CORE IDENTIFICATION, which underlies the computation of
cores: given two graphs G and H, is H the core of G? As seen above, the results
by Chandra and Merlin [1977] imply that this problem is NP-hard. Later on,
              s r
Hell and Neˇ etˇ il [1992] showed that deciding whether a graph G is its own
core is a coNP-complete problem; in turn, this implies that CORE IDENTIFICATION
is a coNP-hard problem. Here, we pinpoint the exact computational complexity
of CORE IDENTIFICATION by showing that it is a DP-complete problem, where DP
is the class of decision problems that can be written as the intersection of an
NP-problem and a coNP-problem.
   In the last part of the article, we further justify the selection of the core
as the “best” universal solution by establishing its usefulness in answering
queries over the target schema T. An existential query q(x) is a formula of
the form ∃yφ(x, y), where φ(x, y) is a quantifier-free formula.1 Perhaps the
most important examples of existential queries are the conjunctive queries
with inequalities =. Another useful example of existential queries is the set-
difference query, which asks whether there is a member of the set difference
A − B.
   Let J0 be the core of all universal solutions for a source instance I . As dis-
cussed earlier, since J0 is itself a universal solution for I , the certain answers

1 We   shall also give a safety condition on φ.

                                         ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
178       •     R. Fagin et al.

of conjunctive queries over T can be obtained by simply evaluating them on J0 .
In Fagin et al. [2003], however, it was shown that there are simple conjunctive
queries with inequalities = such that evaluating them on a universal solution
always produces a proper superset of the set of certain answers for I . Nonethe-
less, here we show that evaluating existential queries on the core J0 of the uni-
versal solutions yields the best approximation (that is, the smallest superset)
of the set of the certain answers, among all universal solutions. Analogous to
the definition of certain answers, let us define the certain answers on universal
solutions of a query q over the target schema, with respect to a source instance
I , to be the set of all tuples that are in the intersection of all q(J )’s, as J varies
over all universal solutions for I ; we write u-certain(q, I ) to denote the certain
answers of q on universal solutions for I . Since we consider universal solutions
to be the preferred solutions to the data exchange problem, this suggests the
naturalness of this notion of certain answers on universal solutions as an alter-
native semantics for query answering in data exchange settings. We show that
if q is an existential query and J0 is the core of the universal solutions for I ,
then the set of those tuples in q(J0 ) whose entries are elements from the source
instance I is equal to the set u-certain(q, I ) of the certain answers of q on uni-
versal solutions. We also show that in the LAV setting (an important scenario
in data integration) there is an interesting contrast between the complexity
of computing certain answers and of computing certain answers on universal
solutions. Specifically, Abiteboul and Duschka [1998] showed that there is a
data exchange setting with t = ∅ and a conjunctive query with inequalities =
such that computing the certain answers of this query is a coNP-complete prob-
lem. In contrast to this, we establish here that in an even more general data
exchange setting (S, T, st , t ) in which st is an arbitrary set of tgds and
   t is an arbitrary set of egds, for every existential query q (and in particular,
for every conjunctive query q with inequalities =), there is a polynomial-time
algorithm for computing the set u-certain(q, I ) of the certain answers of q on
universal solutions.

2. PRELIMINARIES
This section contains the main definitions related to data exchange and a min-
imum amount of background material. The presentation follows closely our
earlier paper [Fagin et al. 2003].

2.1 The Data Exchange Problem
A schema is a finite sequence R = R1 , . . . , Rk of relation symbols, each of a
fixed arity. An instance I (over the schema R) is a sequence R1 , . . . , Rk that
                                                                     I       I
                                                            I
associates each relation symbol Ri with a relation Ri of the same arity as Ri .
We shall often abuse the notation and use Ri to denote both the relation symbol
and the relation RiI that interprets it. We may refer to RiI as the Ri relation of
I . Given a tuple t occurring in a relation R, we denote by R(t) the association
between t and R, and call it a fact. An instance I can be identified with the set of
all facts arising from the relations RiI of I . If R is a schema, then a dependency
over R is a sentence in some logical formalism over R.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                       Data Exchange: Getting to the Core             •      179

   Let S = S1 , . . . , Sn and T = T1 , . . . , Tm be two schemas with no relation
symbols in common. We refer to S as the source schema and to the Si ’s as the
source relation symbols. We refer to T as the target schema and to the T j ’s as the
target relation symbols. We denote by S, T the schema S1 , . . . , Sn , T1 , . . . , Tm .
Instances over S will be called source instances, while instances over T will be
called target instances. If I is a source instance and J is a target instance, then
we write I, J for the instance K over the schema S, T such that SiK = SiI
and T jK = T jJ , when 1 ≤ i ≤ n and 1 ≤ j ≤ m.
   A source-to-target dependency is, in general, a dependency over S, T of the
form ∀x(φS (x) → χT (x)), where φS (x) is a formula, with free variables x, of
some logical formalism over S, and χT (x) is a formula, with free variables x, of
some logical formalism over T (these two logical formalisms may be different).
We use the notation x for a vector of variables x1 , . . . , xk . We assume that all
the variables in x appear free in φS (x). A target dependency is, in general, a
dependency over the target schema T (the formalism used to express a target
dependency may be different from those used for the source-to-target depen-
dencies). The source schema may also have dependencies that we assume are
satisfied by every source instance. While the source dependencies may play
an important role in deriving source-to-target dependencies [Popa et al. 2002],
they do not play any direct role in data exchange, because we take the source
instance to be given.
   Definition 2.1. A data exchange setting (S, T, st , t ) consists of a source
schema S, a target schema T, a set st of source-to-target dependencies, and
a set t of target dependencies. The data exchange problem associated with
this setting is the following: given a finite source instance I , find a finite target
instance J such that I, J satisfies st and J satisfies t . Such a J is called a
solution for I or, simply, a solution if the source instance I is understood from
the context.
   For most practical purposes, and for most of the results of this article (all
results except for Proposition 2.7), each source-to-target dependency in st is a
tuple generating dependency (tgd) [Beeri and Vardi 1984] of the form
                              ∀x(φS (x) → ∃yψT (x, y)),
where φS (x) is a conjunction of atomic formulas over S and ψT (x, y) is a conjunc-
tion of atomic formulas over T. We assume that all the variables in x appear in
φS (x). Moreover, each target dependency in t is either a tgd, of the form
                              ∀x(φT (x) → ∃yψT (x, y)),
or an equality-generating dependency (egd) [Beeri and Vardi 1984], of the form
                                ∀x(φT (x) → (x1 = x2 )).
In these dependencies, φT (x) and ψT (x, y) are conjunctions of atomic formulas
over T, where all the variables in x appear in φT (x), and x1 , x2 are among
the variables in x. The tgds and egds together comprise Fagin’s (embedded)
implicational dependencies [Fagin 1982]. As in Fagin et al. [2003], we will drop
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
180       •      R. Fagin et al.

the universal quantifiers in front of a dependency, and implicitly assume such
quantification. However, we will write down all the existential quantifiers.
   Source-to-target tgds are a natural and powerful language for expressing the
relationship between a source schema and a target schema. Such dependencies
are automatically derived and used as representation of a schema mapping in
the Clio system [Popa et al. 2002]. Furthermore, data exchange settings with
tgds as source-to-target dependencies include as special cases both local-as-
view (LAV) and global-as-view (GAV) data integration systems in which the
views are sound and defined by conjunctive queries (see Lenzerini’s tutorial
[Lenzerini 2002] for a detailed discussion of LAV and GAV data integration
systems and sound views).
   A LAV data integration system with sound views defined by conjunctive
queries is a special case of a data exchange setting (S, T, st , t ), in which
S is the source schema (consisting of the views, in LAV terminology), T is the
target schema (or global schema, in LAV terminology), the set t of target de-
pendencies is empty, and each source-to-target tgd in st is of the form S(x) →
∃y ψT (x, y), where S is a single relation symbol of the source schema S (a view,
in LAV terminology) and ψT is a conjunction of atomic formulas over the target
schema T. A GAV setting is similar, but the tgds in st are of the form φS (x) →
T (x), where T is a single relation symbol over the target schema T (a view, in
GAV terminology), and φS is a conjunction of atomic formulas over the source
schema S. Since, in general, a source-to-target tgd relates a conjunctive query
over the source schema to a conjunctive query over the target schema, a data
exchange setting is strictly more expressive than LAV or GAV, and in fact it can
be thought of as a GLAV (global-and-local-as-view) system [Friedman et al.
1999; Lenzerini 2002]. These similarities between data integration and data
exchange notwithstanding, the main difference between the two is that in data
exchange we have to actually materialize a finite target instance that best re-
flects the given source instance. In data integration no such exchange of data
is required; the target can remain virtual.
   In general there may be multiple solutions for a given data exchange problem.
The following example illustrates this issue and raises the question of which
solution to choose to materialize.

   Example 2.2. Consider a data exchange problem in which the source
schema consists of two binary relation symbols as follows: EmpCity, associating
employees with cities they work in, and LivesIn, associating employees with
cities they live in. Assume that the target schema consists of three binary re-
lation symbols as follows: Home, associating employees with their home cities,
EmpDept, associating employees with departments, and DeptCity, associating
departments with their cities. We assume that t = ∅. The source-to-target
tgds and the source instance are as follows, where (d 1 ), (d 2 ), (d 3 ), and (d 4 ) are
labels for convenient reference later:

          st   : (d 1 ) EmpCity(e, c) → ∃HHome(e, H),
                 (d 2 ) EmpCity(e, c) → ∃D(EmpDept(e, D) ∧ DeptCity(D, c)),
                 (d 3 ) LivesIn(e, h) → Home(e, h),
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Data Exchange: Getting to the Core             •      181

            (d 4 ) LivesIn(e, h) → ∃D∃C(EmpDept(e, D) ∧ DeptCity(D, C)),

                  I = {EmpCity(Alice, SJ), EmpCity(Bob, SD)
                      LivesIn(Alice, SF), LivesIn(Bob, LA)}.
We shall use this example as a running example throughout this article. Since
the tgds in st do not completely specify the target instance, there are multiple
solutions that are consistent with the specification. One solution is
                  J0 = {Home(Alice, SF), Home(Bob, SD)
                       EmpDept(Alice, D1 ), EmpDept(Bob, D2 )
                          DeptCity(D1 , SJ), DeptCity(D2 , SD)},
where D1 and D2 represent “unknown” values, that is, values that do not occur
in the source instance. Such values are called labeled nulls and are to be dis-
tinguished from the values occurring in the source instance, which are called
constants. Instances with constants and labeled nulls are not specific to data
exchange. They have long been considered, in various forms, in the context of
incomplete or indefinite databases (see van der Meyden [1998]) as well as in
the context of data integration (see Halevy [2001]; Lenzerini [2002]).
   Intuitively, in the above instance, D1 and D2 are used to “give values” for
the existentially quantified variable D of (d 2 ), in order to satisfy (d 2 ) for the
two source tuples EmpCity(Alice, SJ) and EmpCity(Bob, SD). In contrast, two
constants (SF and SD) are used to “give values” for the existentially quantified
variable H of (d 1 ), in order to satisfy (d 1 ) for the same two source tuples.
   The following instances are solutions as well:
                  J = {Home(Alice, SF), Home(Bob, SD)
                       Home(Alice, H1 ), Home(Bob, H2 )
                          EmpDept(Alice, D1 ), EmpDept(Bob, D2 )
                          DeptCity(D1 , SJ), DeptCity(D2 , SD)},
                  J0 = {Home(Alice, SF), Home(Bob, SD)
                        EmpDept(Alice, D), EmpDept(Bob, D)
                          DeptCity(D, SJ), DeptCity(D, SD)}.
The instance J differs from J0 by having two extra Home tuples where the
home cities of Alice and Bob are two nulls, H1 and H2 , respectively. The second
instance J0 differs from J0 by using the same null (namely D) to denote the
“unknown” department of both Alice and Bob.
   Next, we review the notion of universal solutions, proposed in Fagin et al.
[2003] as the most general solutions.

2.2 Universal Solutions
We denote by Const the set (possibly infinite) of all values that occur in source
instances, and as before we call them constants. We also assume an infinite
set Var of values, called labeled nulls, such that Var ∩ Const = ∅. We reserve
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
182       •     R. Fagin et al.

the symbols I, I , I1 , I2 , . . . for instances over the source schema S and with
values in Const. We reserve the symbols J, J , J1 , J2 , . . . for instances over the
target schema T and with values in Const ∪ Var. Moreover, we require that
solutions of a data exchange problem have their values drawn from Const ∪ Var.
If R = R1 , . . . , Rk is a schema and K is an instance over R with values in
Const∪Var, then Const(K ) denotes the set of all constants occurring in relations
in K , and Var(K ) denotes the set of labeled nulls occurring in relations in K .
  Definition 2.3.          Let K 1 and K 2 be two instances over R with values in
Const ∪ Var.
1. A homomorphism h: K 1 → K 2 is a mapping from Const(K 1 ) ∪ Var(K 1 ) to
   Const(K 2 ) ∪ Var(K 2 ) such that (1) h(c) = c, for every c ∈ Const(K 1 ); (2) for
   every fact Ri (t) of K 1 , we have that Ri (h(t)) is a fact of K 2 (where, if t =
   (a1 , . . . , as ), then h(t) = (h(a1 ), . . ., h(as ))).
2. K 1 is homomorphically equivalent to K 2 if there are homomorphisms h:
   K 1 → K 2 and h : K 2 → K 1 .
  Definition 2.4 (Universal Solution). Consider a data exchange setting (S,
T, st , t ). If I is a source instance, then a universal solution for I is a solution
J for I such that for every solution J for I , there exists a homomorphism
h: J → J.
   Example 2.5. The instance J0 in Example 2.2 is not universal. In particu-
lar, there is no homomorphism from J0 to J0 . Hence, the solution J0 contains
“extra” information that was not required by the specification; in particular, J0
“assumes” that the departments of Alice and Bob are the same. In contrast, it
can easily be shown that J0 and J have homomorphisms to every solution (and
to each other). Thus, J0 and J are universal solutions.
   Universal solutions possess good properties that justify selecting them (as
opposed to arbitrary solutions) for the semantics of the data exchange problem.
A universal solution is more general than an arbitrary solution because, by
definition, it can be homomorphically mapped into that solution. Universal
solutions have, also by their definition, homomorphisms to each other and,
thus, are homomorphically equivalent.

   2.2.1 Computing Universal Solutions. In Fagin et al. [2003], we addressed
the question of how to check the existence of a universal solution and how
to compute one, if one exists. In particular, we identified fairly general, yet
practical, conditions that guarantee that universal solutions exist whenever
solutions exist. Moreover, we showed that there is a polynomial-time algorithm
for computing a canonical universal solution, if a solution exists; this algorithm
is based on the classical chase procedure. The following result summarizes these
findings.
   THEOREM 2.6 [FAGIN ET AL. 2003]. Assume a data exchange setting where st
is a set of tgds, and t is the union of a weakly acyclic set of tgds with a set of egds.
(1) The existence of a solution can be checked in polynomial time.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Data Exchange: Getting to the Core             •      183

(2) A universal solution exists if and only if a solution exists.
(3) If a solution exists, then a universal solution can be produced in polynomial
    time using the chase.

The notion of a weakly acyclic set of tgds first arose in a conversation between the
third author and A. Deutsch in 2001. It was then independently used in Deutsch
and Tannen [2003] and in Fagin et al. [2003] (in the former article, under the
term constraints with stratified-witness). This class guarantees the termina-
tion of the chase and is quite broad, as it includes both sets of full tgds [Beeri
and Vardi 1984] and sets of acyclic inclusion dependencies [Cosmadakis and
Kanellakis 1986]. We note that, when the set t of target constraints is empty,
a universal solution always exists and a canonical one is constructible in poly-
nomial time by chasing I, ∅ with st . In the Example 2.2, the instance J is
such a canonical universal solution. If the set t of target constraints contains
egds, then it is possible that no universal solution exists (and hence no solution
exists, either, by the above theorem). This occurs (see Fagin et al. [2003]) when
the chase fails by attempting to identify two constants while trying to apply
some egd of t . If the chase does not fail, then the result of chasing I, ∅ with
  st ∪ t is a canonical universal solution.

   2.2.2 Certain Answers. In a data exchange setting, there may be many
different solutions for a given source instance. Hence, given a source instance,
the question arises as to what the result of answering queries over the target
schema is. Following earlier work on information integration, in Fagin et al.
[2003] we adopted the notion of the certain answers as the semantics of query
answering in data exchange settings. As stated in Section 1, the set certain(q, I )
of the certain answers of q with respect to a source instance I is the set of tuples
that appear in q(J ) for every solution J ; in symbols,
                  certain(q, I ) =       {q(J ) : J is a solution for I }.
   Before stating the connection between the certain answers and universal
solutions, let us recall the definitions of conjunctive queries (with inequalities)
and unions of conjunctive queries (with inequalities). A conjunctive query q(x)
over a schema R is a formula of the form ∃yφ(x, y) where φ(x, y) is a conjunction
of atomic formulas over R. If, in addition to atomic formulas, the conjunction
φ(x, y) is allowed to contain inequalities of the form z i = z j , where z i , z j are
variables among x and y, we call q(x) a conjunctive query with inequalities. We
also impose a safety condition, that every variable in x and y must appear in an
atomic formula, not just in an inequality. A union of conjunctive queries (with
inequalities) is a disjunction q(x) = q1 (x) ∨ · · · ∨ qn (x) where q1 (x), . . . , qn (x) are
conjunctive queries (with inequalities).
   If J is an arbitrary solution, let us denote by q(J )↓ the set of all “null-free”
tuples in q(J ), that is the set of all tuples in q(J ) that are formed entirely of
constants. The next proposition from Fagin et al. [2003] asserts that null-free
evaluation of conjunctive queries on an arbitrarily chosen universal solution
gives precisely the set of certain answers. Moreover, universal solutions are the
only solutions that have this property.
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
184       •     R. Fagin et al.

   PROPOSITION 2.7 [FAGIN ET AL. 2003]. Consider a data exchange setting with
S as the source schema, T as the target schema, and such that the dependencies
in the sets st and t are arbitrary.
(1) Let q be a union of conjunctive queries over the target schema T. If I is a
    source instance and J is a universal solution, then certain(q, I ) = q(J )↓ .
(2) Let I be a source instance and J be a solution such that, for every conjunctive
    query q over T, we have that certain(q, I ) = q(J )↓ . Then J is a universal
    solution.

3. DATA EXCHANGE WITH CORES

3.1 Multiple Universal Solutions
Even if we restrict attention to universal solutions instead of arbitrary solu-
tions, there may still exist multiple, nonisomorphic universal solutions for a
given instance of a data exchange problem. Moreover, although these universal
solutions are homomorphically equivalent to each other, they may have dif-
ferent sizes (where the size is the number of tuples). The following example
illustrates this state of affairs.
   Example 3.1. We again revisit our running example from Example 2.2. As
we noted earlier, of the three target instances given there, two of them (namely,
J0 and J ) are universal solutions for I . These are nonisomorphic universal
solutions (since they have different sizes). We now give an infinite family of
nonisomorphic universal solutions, that we shall make use of later.
   For every m ≥ 0, let Jm be the target instance
                     Jm = {Home(Alice, SF), Home(Bob, SD),
                           EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ),
                               DeptCity(X 0 , SJ), DeptCity(Y 0 , SD),
                                                     ...
                               EmpDept(Alice, X m ), EmpDept(Bob, Y m ),
                               DeptCity(X m , SJ), DeptCity(Y m , SD)},
where X 0 , Y 0 , . . . , X m , Y m are distinct labeled nulls. (In the case of m = 0,
the resulting instance J0 is the same, modulo renaming of nulls, as the ear-
lier J0 from Example 2.2. We take the liberty of using the same name, since
the choice of nulls really does not matter.) It is easy to verify that each tar-
get instance Jm , for m ≥ 0, is a universal solution for I ; thus, there are in-
finitely many nonisomorphic universal solutions for I . It is also easy to see that
every universal solution must contain at least four tuples EmpDept(Alice, X ),
EmpDept(Bob, Y ), DeptCity(X , SJ), and DeptCity(Y, SD), for some labeled nulls
X and Y , as well as the tuples Home(Alice, SF) and Home(Bob, SD). Consequently,
the instance J0 has the smallest size among all universal solutions for I and
actually is the unique (up to isomorphism) universal solution of smallest size.
Thus, J0 is a rather special universal solution and, from a size point of view, a
preferred candidate to materialize in data exchange.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                            Data Exchange: Getting to the Core             •      185

   Motivated by the preceding example, in the sequel we introduce and study the
concept of the core of a universal solution. We show that the core of a universal
solution is the unique (up to isomorphism) smallest universal solution. We then
address the problem of computing the core and also investigate the use of cores
in answering queries over the target schemas. The results that we will establish
make a compelling case that cores are the preferred solutions to materialize in
data exchange.

3.2 Cores and Universal Solutions
In addition to the notion of an instance over a schema (which we defined earlier),
we find it convenient to define the closely related notion of a structure over a
schema. The difference is that a structure is defined with a universe, whereas
the universe of an instance is implicitly taken to be the “active domain,” that is,
the set of elements that appear in tuples of the instance. Furthermore, unlike
target instances in data exchange settings, structures do not necessarily have
distinguished elements (“constants”) that have to be mapped onto themselves
by homomorphisms.
   More formally, a structure A (over the schema R = R1 , . . . , Rk ) is a sequence
 A, R1 , . . . , Rk , where A is a nonempty set, called the universe, and each RiA is
      A           A

a relation on A of the same arity as the relation symbol Ri . As with instances, we
shall often abuse the notation and use Ri to denote both the relation symbol and
the relation RiA that interprets it. We may refer to RiA as the Ri relation of A. If A
is finite, then we say that the structure is finite. A structure B = (B, R1 , . . . , Rk )   B       B

is a substructure of A if B ⊆ A and Ri ⊆ Ri , for 1 ≤ i ≤ k. We say that
                                                B           A

B is a proper substructure of A if it is a substructure of A and at least one
of the containments RiB ⊆ RiA , for 1 ≤ i ≤ k, is a proper one. A structure
B = (B, R1 , . . . , Rk ) is an induced substructure of A if B ⊆ A and, for every 1 ≤
             B         B

i ≤ k, we have that RiB = {(x1 , . . . , xn ) | RiA (x1 , . . . , xn ) and x1 , . . . , xn are in B}.

   Definition 3.2. A substructure C of structure A is called a core of A if there
is a homomorphism from A to C, but there is no homomorphism from A to a
proper substructure of C. A structure C is called a core if it is a core of itself,
that is, if there is no homomorphism from C to a proper substructure of C.

   Note that C is a core of A if and only if C is a core, C is a substructure of A,
and there is a homomorphism from A to C. The concept of the core of a graph
                                                                     s r
has been studied extensively in graph theory (see Hell and Neˇ etˇ il [1992]).
The next proposition summarizes some basic facts about cores; a proof can be
                      s r
found in Hell and Neˇ etˇ il [1992].

   PROPOSITION 3.3.        The following statements hold:

—Every finite structure has a core; moreover, all cores of the same finite structure
 are isomorphic.
—Every finite structure is homomorphically equivalent to its core. Consequently,
 two finite structures are homomorphically equivalent if and only if their cores
 are isomorphic.
                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
186       •     R. Fagin et al.

— If C is the core of a finite structure A, then there is a homomorphism h: A → C
  such that h(v) = v for every member v of the universe of C.
— If C is the core of a finite structure A, then C is an induced substructure
  of A.
   In view of Proposition 3.3, if A is a finite structure, there is a unique (up to
isomorphism) core of A, which we denote by core(A).
   We can similarly define the notions of a subinstance of an instance and of a
core of an instance. We identify the instance with the corresponding structure,
where the universe of the structure is taken to be the active domain of the
instance, and where we distinguish the constants. That is, we require that if
h is a homomorphism and c is a constant, then h(c) = c (as already defined in
Section 2.2). The results about cores of structures will then carry over to cores
of instances.
   Universal solutions for I are unique up to homomorphic equivalence, but as
we saw in Example 3.1, they need not be unique up to isomorphism. Proposi-
tion 3.3, however, implies that their cores are isomorphic; in other words, all
universal solutions for I have the same core up to isomorphism. Moreover, if J
is a universal solution for I and core(J ) is a solution for I , then core(J ) is also
a universal solution for I , since J and core(J ) are homomorphically equiva-
lent. In general, if the dependencies st and t are arbitrary, then the core of
a solution to an instance of the data exchange problem need not be a solution.
The next result shows, however, that this cannot happen if st is a set of tgds
and t is a set of tgds and egds.
   PROPOSITION 3.4. Let (S, T, st , t ) be a data exchange setting in which st
is a set of tgds and t is a set of tgds and egds. If I is a source instance and J is
a solution for I , then core(J ) is a solution for I . Consequently, if J is a universal
solution for I , then also core(J ) is a universal solution for I .
   PROOF. Let φS (x) → ∃yψT (x, y) be a tgd in st and a = (a1 , . . . , an ) a tuple of
constants such that I |= φS (a). Since J is a solution for I , there is a tuple b =
(b1 , . . . , bs ) of elements of J such that I, J |= ψT (a, b). Let h be a homomor-
phism from J to core(J ). Then h(ai ) = ai , since each ai is a constant, for 1 ≤
i ≤ n. Consequently, I, core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )).
Thus, I, core(J ) satisfies the tgd.
   Next, let φT (x) → ∃yψT (x, y) be a tgd in t and a = (a1 , . . . , an ) a tuple of
elements in core(J ) such that core(J ) |= φT (a). Since core(J ) is a subinstance
of J , it follows that J |= φT (a), and since J is a solution, it follows that there
is a tuple b = (b1 , . . . , bs ) of elements of J such that J |= ψT (a, b). According to
the last part of Proposition 3.3, there is a homomorphism h from J to core(J )
such that h(v) = v, for every v in core(J ). In particular, h(ai ) = ai , for 1 ≤ i ≤ n.
It follows that core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )). Thus,
core(J ) satisfies the tgd.
   Finally, let φT (x) → (x1 = x2 ) be an egd in t . If a = (a1 , . . . , as ) is a tuple of
elements in core(J ) such that core(J ) |= φT (a), then J |= φT (a), because core(J )
is a subinstance of J . Since J is a solution, it follows that a1 = a2 . Thus, core(J )
satisfies every egd in t .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                     Data Exchange: Getting to the Core             •      187

   COROLLARY 3.5. Let (S, T, st , t ) be a data exchange setting in which st is
a set of tgds and t is a set of tgds and egds. If I is a source instance for which
a universal solution exists, then there is a unique (up to isomorphism) universal
solution J0 for I having the following properties:
— J0 is a core and is isomorphic to the core of every universal solution J for I .
— If J is a universal solution for I , there is a one-to-one homomorphism h from
  J0 to J . Hence, |J0 | ≤ |J |, where |J0 | and |J | are the sizes of J0 and J .
   We refer to J0 as the core of the universal solutions for I . As an illustration
of the concepts discussed in this subsection, recall the data exchange problem
of Example 3.1. Then J0 is indeed the core of the universal solutions for I .
   The core of the universal solutions is the preferred universal solution to
materialize in data exchange, since it is the unique most compact universal
solution. In turn, this raises the question of how to compute cores of universal
solutions. As mentioned earlier, universal solutions can be canonically com-
puted by using the chase. However, the result of such a chase, while a universal
solution, need not be the core. In general, an algorithm other than the chase
is needed for computing cores of universal solutions. In the next two sections,
we study what it takes to compute cores. We begin by analyzing the complexity
of computing cores of arbitrary instances and then focus on the computation of
cores of universal solutions in data exchange.

4. COMPLEXITY OF CORE IDENTIFICATION
Chandra and Merlin [1977] were the first to realize that computing the core
of a relational structure is an important problem in conjunctive query pro-
cessing and optimization. Unfortunately, in its full generality this problem is
intractable. Note that computing the core is a function problem, not a decision
problem. One way to gauge the difficulty of a function problem is to analyze the
computational complexity of its underlying decision problem.
   Definition 4.1. CORE IDENTIFICATION is the following decision problem: given
two structures A and B over some schema R such that B is a substructure of
A, is core(A) = B?
   It is easy to see that CORE IDENTIFICATION is an NP-hard problem. Indeed,
consider the following polynomial-time reduction from 3-COLORABILITY: a graph
G is 3-colorable if and only if core(G ⊕ K3 ) = K3 , where K3 is the complete
graph with 3 nodes and ⊕ is the disjoint sum operation on graphs. This re-
duction was already given by Chandra and Merlin [1977]. Later on, Hell and
   s r
Neˇ etˇ il [1992] studied the complexity of recognizing whether a graph is a core.
In precise terms, CORE RECOGNITION is the following decision problem: given a
structure A over some schema R, is A a core? Clearly, this problem is in coNP.
               s r
Hell and Neˇ etˇ il’s [1992] main result is that CORE RECOGNITION is a coNP-
complete problem, even if the inputs are undirected graphs. This is established
by exhibiting a rather sophisticated polynomial-time reduction from NON-3-
COLORABILITY on graphs of girth at least 7; the “gadgets” used in this reduction
are pairwise incomparable cores with certain additional properties. It follows
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
188       •     R. Fagin et al.

that CORE IDENTIFICATION is a coNP-hard problem. Nonetheless, it appears that
the exact complexity of CORE IDENTIFICATION has not been pinpointed in the lit-
erature until now. In the sequel, we will establish that CORE IDENTIFICATION is
a DP-complete problem. We present first some background material about the
complexity class DP.
   The class DP consists of all decision problems that can be written as the in-
tersection of an NP-problem and a coNP-problem; equivalently, DP consists of
all decision problems that can be written as the difference of two NP-problems.
This class was introduced by Papadimitriou and Yannakakis [1982], who dis-
covered several DP-complete problems. The prototypical DP-complete problem
is SAT/UNSAT: given two Boolean formulas φ and ψ, is φ satisfiable and ψ
unsatisfiable? Several problems that express some “critical” property turn out
to be DP-complete (see Papadimitriou [1994]). For instance, CRITICAL SAT is
DP-complete, where an instance of this problem is a CNF-formula φ and the
question is to determine whether φ is unsatisfiable, but if any one of its clauses
is removed, then the resulting formula is satisfiable. Moreover, Cosmadakis
[1983] showed that certain problems related to database query evaluation are
DP-complete. Note that DP contains both NP and coNP as subclasses; further-
more, each DP-complete problem is both NP-hard and coNP-hard. The pre-
vailing belief in computational complexity is that the above containments are
proper, but proving this remains an outstanding open problem. In any case,
establishing that a certain problem is DP-complete is interpreted as signify-
ing that this problem is intractable and, in fact, “more intractable” than an
NP-complete problem.
   Here, we establish that CORE IDENTIFICATION is a DP-complete problem by
exhibiting a reduction from 3-COLORABILITY/NON-3-COLORABILITY on graphs of
girth at least 7. This reduction is directly inspired by the reduction of NON-3-
COLORABILITY on graphs of girth at least 7 to CORE RECOGNITION, given in Hell
        s r
and Neˇ etˇ il [1992].

  THEOREM 4.2. CORE IDENTIFICATION is DP-complete, even if the inputs are
undirected graphs.

   In proving the above theorem, we make essential use of the following result,
                                                      s r
which is a special case of Theorem 6 in [Hell and Neˇ etˇ il 1992]. Recall that the
girth of a graph is the length of the shortest cycle in the graph.
                                   ˇ ˇ
   THEOREM 4.3 (HELL AND NESETRIL 1992). For each positive integer N , there
is a sequence A1 , . . . A N of connected graphs such that

(1) each Ai is 3-colorable, has girth 5, and each edge of Ai is on a 5-cycle;
(2) each Ai is a core; moreover, for every i, j with i ≤ n, j ≤ n and i = j , there
    is no homomorphism from Ai to A j ;
(3) each Ai has at most 15(N + 4) nodes; and
(4) there is a polynomial-time algorithm that, given N , constructs the sequence
    A1 , . . . A N .

   We now have the machinery needed to prove Theorem 4.2.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Data Exchange: Getting to the Core             •      189

   PROOF OF THEOREM 4.2. CORE IDENTIFICATION is in DP, because, given two
structures A and B over some schema R such that B is a substructure of A,
to determine whether core(A) = B one has to check whether there is a homo-
morphism from A to B (which is in NP) and whether B is a core (which is in
coNP).
   We will show that CORE IDENTIFICATION is DP-hard, even if the inputs
are undirected graphs, via a polynomial-time reduction from 3-COLORABILITY/
NON-3-COLORABILITY. As a stepping stone in this reduction, we will define CORE
HOMOMORPHISM, which is the following variant of CORE IDENTIFICATION: given two
structures A and B, is there a homomorphism from A to B, and is B a core?
There is a simple polynomial-time reduction of CORE HOMOMORPHISM to CORE
IDENTIFICATION, where the instance (A, B) is mapped onto (A ⊕ B, B). This is a
reduction, since there is a homomorphism from A to B with B as a core if and
only if core(A⊕B) = B. Thus, it remains to show that there is a polynomial-time
reduction of 3-COLORABILITY/NON-3-COLORABILITY to CORE HOMOMORPHISM.
   Hell and Neˇ etˇ il [1992] showed that 3-COLORABILITY is NP-complete even if
                    s r
the input graphs have girth at least 7 (this follows from Theorem 7 in Hell
         s r
and Neˇ etˇ il [1992] by taking A to be a self-loop and B to be K3 ). Hence, 3-
COLORABILITY/NON-3-COLORABILITY is DP-complete, even if the input graphs G
and H have girth at least 7. So, assume that we are given two graphs G and
H each having girth at least 7. Let v1 , . . . , vm be an enumeration of the nodes
of G, let w1 , . . . , wn be an enumeration of the nodes of H, and let N = m +
n. Let A1 , . . . , A N be a sequence of connected graphs having the properties
listed in Theorem 4.3. This sequence can be constructed in time polynomial in
N ; moreover, we can assume that these graphs have pairwise disjoint sets of
nodes. Let G∗ be the graph obtained by identifying each node vi of G with some
arbitrarily chosen node of Ai , for 1 ≤ i ≤ m (and keeping the edges between
nodes of G intact). Thus, the nodes of G∗ are the nodes that appear in the Ai ’s,
and the edges are the edges in the Ai ’s, along with the edges of G under our
identification. Similarly, let H∗ be the graph obtained by identifying each node
w j of H with some arbitrarily chosen node of A j , for m + 1 ≤ j ≤ N = m + n
(and keeping the edges between nodes of H intact). We now claim that G is 3-
colorable and H is not 3-colorable if and only if there is a homomorphism from
G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is a core. Hell and Neˇ etˇ il [1992] showed that
                                                           s r
CORE RECOGNITION is coNP-complete by showing that a graph H of girth at least
7 is not 3-colorable if and only if the graph H∗ ⊕ K3 is a core. We will use this
property in order to establish the above claim.
   Assume first that G is 3-colorable and H is not 3-colorable. Since each Ai is
a 3-colorable graph, G∗ ⊕ K3 is 3-colorable and so there a homomorphism from
G∗ ⊕ K3 to H∗ ⊕ K3 (in fact, to K3 ). Moreover, as shown in Hell and Neˇ etˇ ils r
[1992], H∗ ⊕ K3 is a core, since H is not 3-colorable. For the other direction,
assume that there is a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is
                                                  s r
a core. Using again the results in Hell and Neˇ etˇ il [1992], we infer that H is not
3-colorable. It remains to prove that G is 3-colorable. Let h be a homomorphism
from G∗ ⊕ K3 to H∗ ⊕ K3 . We claim that h actually maps G∗ to K3 ; hence, G is
3-colorable. Let us consider the image of each graph Ai , with 1 ≤ i ≤ m, under
the homomorphism h. Observe that Ai cannot be mapped to some A j , when
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
190       •     R. Fagin et al.

m + 1 ≤ j ≤ N = m + n, since, for every i and j such that 1 ≤ i ≤ m and
m + 1 ≤ j ≤ N = m + n, there is no homomorphism from Ai to A j . Observe
also that the image of a cycle C under a homomorphism is a cycle C of length
less than or equal the length of C. Since H has girth at least 7 and since each
edge of Ai is on a 5-cycle, the image of Ai under h cannot be contained in H.
For the same reason, the image of Ai under h cannot contain nodes from H and
some A j , for m + 1 ≤ j ≤ N = m + n; moreover, it cannot contain nodes from
two different A j ’s, for m + 1 ≤ j ≤ N = m + n (here, we also use the fact that
each A j has girth 5). Consequently, the homomorphism h must map each Ai ,
1 ≤ i ≤ m, to K3 . Hence, h maps G∗ to K3 , and so G is 3-colorable.
   It should be noted that problems equivalent to CORE RECOGNITION and CORE
IDENTIFICATION have been investigated in logic programming and artificial intel-
                                         ¨
ligence. Specifically, Gottlob and Fermuller [1993] studied the problem of re-
moving redundant literals from a clause, and analyzed the computational com-
plexity of two related decision problems: the problem of determining whether
a given clause is condensed and the problem of determining whether, given
                                                                      ¨
two clauses, one is a condensation of the other. Gottlob and Fermuller showed
that the first problem is coNP-complete and the second is DP-complete. As it
turns out, determining whether a given clause is condensed is equivalent to
CORE RECOGNITION, while determining whether a clause is a condensation of an-
other clause is equivalent to CORE IDENTIFICATION. Thus, the complexity of CORE
RECOGNITION and CORE IDENTIFICATION for relational structures (but not for undi-
rected graphs) can also be derived from the results in Gottlob and Fermuller ¨
                                                                 ¨
[1993]. As a matter of fact, the reductions in Gottlob and Fermuller [1993] give
easier proofs for the coNP-hardness and DP-hardness of CORE RECOGNITION and
CORE IDENTIFICATION, respectively, for undirected graphs with constants, that
is, undirected graphs in which certain nodes are distinguished so that every
homomorphism maps each such constant to itself (alternatively, graphs with
constants can be viewed as relational structures with a binary relation for the
edges and unary relations each of which consists of one of the constants). For in-
stance, the coNP-hardness of CORE IDENTIFICATION for graphs with constants can
be established via the following reduction from the CLIQUE problem. Given an
undirected graph G and a positive integer k, consider the disjoint sum G ⊕ Kk ,
where Kk is the complete graph with k elements. If every node in G is viewed
as a constant, then G ⊕ Kk is a core if and only if G does not contain a clique
with k elements.
   We now consider the implications of the intractability of CORE RECOGNITION for
the problem of computing the core of a structure. As stated earlier, Chandra and
Merlin [1977] observed that a graph G is 3-colorable if and only if core(G⊕K3 ) =
K3 . It follows that, unless P = NP, there is no polynomial-time algorithm for
computing the core of a given structure. Indeed, if such an algorithm existed,
then we could determine in polynomial time whether a graph is 3-colorable by
first running the algorithm to compute the core of G ⊕ K3 and then checking if
the answer is equal to K3 .
   Note, however, that in data exchange we are interested in computing the
core of a universal solution, rather than the core of an arbitrary instance.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Data Exchange: Getting to the Core             •      191

Consequently, we cannot assume a priori that the above intractability car-
ries over to the data exchange setting, since polynomial-time algorithms for
computing the core of universal solutions may exist. We address this next.

5. COMPUTING THE CORE IN DATA EXCHANGE
In contrast with the case of computing the core of an arbitrary instance, comput-
ing the core of a universal solution in data exchange does have polynomial-time
algorithms, in certain natural data exchange settings. Specifically, in this sec-
tion we show that the core of a universal solution can be computed in polynomial
time in data exchange settings in which st is an arbitrary set of tgds and t
is a set of egds.
    We give two rather different polynomial-time algorithms for the task of com-
puting the core in data exchange settings in which st is an arbitrary set of
tgds and t is a set of egds: a greedy algorithm and an algorithm we call the
blocks algorithm. Section 5.1 is devoted to the greedy algorithm. In Section 5.2
we present the blocks algorithm for data exchange settings with no target con-
straints (i.e., t = ∅). We then show in Section 5.3 that essentially the same
blocks algorithm works if we remove the emptiness condition on t and al-
low it to contain egds. Although the blocks algorithm is more complicated than
the greedy algorithm (and its proof of correctness much more involved), it has
certain advantages for data exchange that we will describe later on.
    In what follows, we assume that (S, T, st , t ) is a data exchange setting
such that st is a set of tgds and t is a set of egds. Given a source instance
I , we let J be the target instance obtained by chasing I, ∅ with st . We call
J a canonical preuniversal instance for I . Note that J is a canonical universal
solution for I with respect to the data exchange setting (S, T, st , ∅) (that is, no
target constraints).

5.1 Greedy Algorithm
Intuitively, given a source instance I , the greedy algorithm first determines
whether solutions for I exist, and then, if solutions exist, computes the core of
the universal solutions for I by successively removing tuples from a canonical
universal solution for I , as long as I and the instance resulting in each step
satisfy the tgds in st . Recall that a fact is an expression of the form R(t) indi-
cating that the tuple t belongs to the relation R; moreover, every instance can
be identified with the set of all facts arising from the relations of that instance.
  Algorithm 5.1 (Greedy Algorithm).
Input: source instance I .
Output: the core of the universal solutions for I , if solutions exist; “failure,” otherwise.
(1) Chase I with st to produce a canonical pre-universal instance J .
(2) Chase J with t ; if the chase fails, then stop and return “failure”; otherwise, let J
    be the canonical universal solution for I produced by the chase.
(3) Initialize J ∗ to be J .
(4) While there is a fact R(t) in J ∗ such that I, J ∗ − {R(t)} satisfies st , set J ∗ to be
    J ∗ − {R(t)}.
(5) Return J ∗

                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
192       •     R. Fagin et al.

  THEOREM 5.2. Assume that (S, T, st , t ) is a data exchange setting such
that st is a set of tgds and t is a set of egds. Then Algorithm 5.1 is a correct,
polynomial-time algorithm for computing the core of universal solutions.

    PROOF. As shown in Fagin et al. [2003] (see also Theorem 2.6), the chase is
a correct, polynomial-time algorithm for determining whether, given a source
instance I , a solution exists and, if so, producing the canonical universal
solution J .
    Assume that for a given source instance I , a canonical universal solution J
for I has been produced in Step (2) of the greedy algorithm. We claim that each
target instance J ∗ produced during the iterations of the while loop in Step (4)
is a universal solution for I . To begin with, I, J ∗ satisfies the tgds in st by
construction. Furthermore, J ∗ satisfies the egds in t , because J ∗ is a subin-
stance of J , and J satisfies the egds in t . Consequently, J ∗ is a solution for
I ; moreover, it is a universal solution, since it is a subinstance of the canonical
universal solution J for I and thus it can be mapped homomorphically into
every solution for I .
    Let C be the target instance returned by the algorithm. Then C is a universal
solution for I and hence it contains an isomorphic copy J0 of the core of the
universal solutions as a subinstance. We claim that C = J0 . Indeed, if there is
a fact R(t) in C − J0 , then C − {R(t)} satisfies the tgds in st , since J0 satisfies
the tgds in st and is a subinstance of J0 − {R(t)}; thus, the algorithm could
not have returned C as output.
    In order to analyze the running time of the algorithm, we consider the
following parameters: m is the size of the source instance I (number of tuples
in I ); a is the maximum number of universally quantified variables over all
tgds in st ; b is the maximum number of existentially quantified variables over
all tgds in st ; finally, a is the maximum number of universally quantified
variables over all egds in t . Since the data exchange setting is fixed, the
quantities a, b, and a are constants.
    Given a source instance I of size m, the size of the canonical preuniversal
instance J is O(ma ) and the time needed to produce it is O(ma+ab ). Indeed,
the canonical preuniversal instance is constructed by considering each tgd
(∀x)(ϕS (x) → (∃y)ψT (x, y)) in st , instantiating the universally quantified
variables x with elements from I in every possible way, and, for each such
instantiation, checking whether the existentially quantified variables y can
be instantiated by existing elements so that the formula ψT (x, y) is satisfied,
and, if not, adding null values and facts to satisfy it. Since st is fixed, at
most a constant number of facts are added at each step, which accounts for
the O(ma ) bound in the size of the canonical preuniversal instance. There
are O(ma ) possible instantiations of the universally quantified variables, and
for each such instantiation O((ma )b) steps are needed to check whether the
existentially quantified variables can be instantiated by existing elements,
hence the total time required to construct the canonical preuniversal instance
is O(ma+ab ).
    The size of the canonical universal solution J is also O(ma ) (since it is at
most the size of J ) and the time needed to produce J from J is O(maa +2a ).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Data Exchange: Getting to the Core             •      193

Indeed, chasing with the egds in t requires at most O((ma )2 ) = O(m2a ) chase
steps, since in the worst case every two values will be set equal to each other.
Moreover, each chase step takes time O((ma )a ), since at each step we need to
instantiate the universally quantified variables in the egds in every possible
way.
   The while loop in Step (4) requires at most O(ma ) iterations each of which
takes O(ma+ab ) steps to verify that st is satisfied by I, J ∗ − {R(t)} . Thus,
Step (4) takes time O(m2a+ab ). It follows that the running time of the greedy
algorithm is O(m2a+ab + m2a+aa ).

   Several remarks are in order now. First, it should be noted that the cor-
rectness of the greedy algorithm depends crucially on the assumption that t
consists of egds only. The crucial property that holds for egds, but fails for tgds,
is that if an instance satisfies an egd, then every subinstance of it also satisfies
that egd. Thus, if the greedy algorithm is applied to data exchange settings in
which t contains at least one tgd, then the output of the algorithm may fail to
be a solution for the input instance. One can consider a variant of the greedy
algorithm in which the test in the while loop is that I, J ∗ − {R(t)} satisfies
both st and t . This modified greedy algorithm outputs a universal solution
for I , but it is not too hard to construct examples in which the output is not the
core of the universal solutions for I .
   Note that Step (4) of the greedy algorithm can also be construed as a
polynomial-time algorithm for producing the core of the universal solutions,
given a source instance I and some arbitrary universal solution J for I . The
first two steps of the greedy algorithm produce a universal solution for I in time
polynomial in the size of the source instance I or determine that no solution
for I exists, so that the entire greedy algorithm runs in time polynomial in the
size of I .
   Although the greedy algorithm is conceptually simple and its proof of correct-
ness transparent, it requires that the source instance I be available throughout
the execution of the algorithm. There are situations, however, in which the orig-
inal source I becomes unavailable, after a canonical universal solution J for
I has been produced. In particular, the Clio system [Popa et al. 2002] uses a
specialized engine to produce a canonical universal solution, when there are no
target constraints, or a canonical preuniversal instance, when there are target
constraints. Any further processing, such as chasing with target egds or pro-
ducing the core, will have to be done by another engine or application that may
not have access to the original source instance.
   This state of affairs raises the question of whether the core of the universal
solutions can be produced in polynomial time using only a canonical univer-
sal solution or only a canonical pre-universal instance. In what follows, we
describe such an algorithm, called the blocks algorithm, which has the fea-
ture that it can start from either a canonical universal solution or a canonical
pre-universal instance, and has no further need for the source instance. We
present the blocks algorithms in two stages: first, for the case in which there
are no target constraints ( t = ∅), and then for the case in which t is a set of
egds.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
194       •     R. Fagin et al.

5.2 Blocks Algorithm: No Target Constrains
We first define some notions that are needed in order to state the algorithm as
well as to prove its correctness and polynomial-time bound. For the next two
definitions, we assume K to be an arbitrary instance whose elements consists
of constants from Const and nulls from Var. We say that two elements of K are
adjacent if there exists some tuple in some relation of K in which both elements
occur.

   Definition 5.3. The Gaifman graph of the nulls of K is an undirected graph
in which (1) the nodes are all the nulls of K , and (2) there exists an edge between
two nulls whenever the nulls are adjacent in K . A block of nulls is the set of
nulls in a connected component of the Gaifman graph of the nulls.

   If y is a null of K , then we may refer to the block of nulls that contains y as
the block of y. Note that, by the definition of blocks, the set Var(K ) of all nulls
of K is partitioned into disjoint blocks. Let K and K be two instances with
elements in Const ∪ Var. Recall that K is a subinstance of K if every tuple of a
relation of K is a tuple of the corresponding relation of K .

   Definition 5.4. Let h be a homomorphism of K . Denote the result of ap-
plying h to K by h(K ). If h(K ) is a subinstance of K , then we call h an endo-
morphism of K . An endomorphism h of K is useful if h(K ) = K (i.e., h(K ) is a
proper subinstance of K ).

  The following lemma is a simple characterization of useful endomorphisms
that we will make use of in proving the main results of this subsection and of
Section 5.3.

   LEMMA 5.5. Let K be an instance, and let h be an endomorphism of K . Then
h is useful if and only if h is not one-to-one.

   PROOF. Assume that h is not one-to-one. Then there is some x that is in the
domain of h but not in the range of h (here we use the fact that the instance is
finite.) So no tuple containing x is in h(K ). Therefore, h(K ) = K , and so h is
useful.
   Now assume that h is one-to-one. So h is simply a renaming of the members
of K , and so an isomorphism of K . Thus, h(K ) has the same number of tuples
as K . Since h(K ) is a subinstance of K , it follows that h(K ) = K (here again
we use the fact that the instance K is finite). So h is not useful.

   For the rest of this subsection, we assume that we are given a data exchange
setting (S, T, st , ∅) and a source instance I . Moreover, we assume that J is a
canonical universal solution for this data exchange problem. That is, J is such
that I, J is the result of chasing I, ∅ with st . Our goal is to compute core(J ),
that is, a subinstance C of J such that (1) C = h(J ) for some endomorphism
h of J , and (2) there is no proper subinstance of C with the same property
(condition (2) is equivalent to there being no endomorphism of C onto a proper
subinstance of C). The central idea of the algorithm, as we shall see, is to show
that the above mentioned endomorphism h of J can be found as the composition
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                              Data Exchange: Getting to the Core             •     195

of a polynomial-length sequence of “local” (or “small”) endomorphisms, each of
which can be found in polynomial time. We next define what “local” means.

   Definition 5.6. Let K and K be two instances such that the nulls of K
form a subset of the nulls of K , that is, Var(K ) ⊆ Var(K ). Let h be some endo-
morphism of K , and let B be a block of nulls of K . We say that h is K -local for
B if h(x) = x whenever x ∈ B. (Since all the nulls of K are among the nulls
of K , it makes sense to consider whether or not a null x of K belongs to the
block B of K .) We say that h is K -local if it is K -local for B, for some block
B of K .

   The next lemma is crucial for the existence of the polynomial-time algorithm
for computing the core of a universal solution.

   LEMMA 5.7. Assume a data exchange setting where st is a set of tgds and
  t = ∅. Let J be a subinstance of the canonical universal solution J . If there
exists a useful endomorphism of J , then there exists a useful J -local endomor-
phism of J .

    PROOF. Let h be a useful endomorphism of J . By Lemma 5.5, we know that
h is not one-to-one. So there is a null y that appears in J but does not appear
in h(J ). Let B be the block of y (in J ). Define h on J by letting h (x) = h(x) if
x ∈ B, and h (x) = x otherwise.
    We show that h is an endomorphism of J . Let (u1 , . . . , us ) be a tuple of
the R relation of J ; we must show that (h (u1 ), . . . , h (us )) is a tuple of the R
relation of J . Since J is a subinstance of J , the tuple (u1 , . . . , us ) is also a tuple
of the R relation of J . Hence, by definition of a block of J , all the nulls among
u1 , . . . , us are in the same block B . There are two cases, depending on whether
or not B = B. Assume first that B = B. Then, by definition of h , for every ui
among u1 , . . . , us , we have that h (ui ) = h(ui ) if ui is a null, and h (ui ) = ui =
h(ui ) if ui is a constant. Hence (h (u1 ), . . . , h (us )) = (h(u1 ), . . . , h(us )). Since h
is an endomorphism of J , we know that (h(u1 ), . . ., h(us )) is a tuple of the R
relation of J . Thus, (h (u1 ), . . . , h (us )) is a tuple of the R relation of J . Now
assume that B = B. So for every ui among u1 , . . . , us , we have that h (ui ) = ui .
Hence (h (u1 ), . . . , h (us )) = (u1 , . . . , us ). Therefore, once again, (h (u1 ), . . . , h (us ))
is a tuple of the R relation of J , as desired. Hence, h is an endomorphism
of J .

   We now present the blocks algorithm for computing the core of the universal
solutions, when t = ∅.
  Algorithm 5.8 (Blocks Algorithm: No Target Constraints).
Input: source instance I .
Output: the core of the universal solutions for I .

(1) Compute J , the canonical universal solution, from I, ∅ by chasing with st .
(2) Compute the blocks of J , and initialize J to be J .
(3) Check whether there exists a useful J -local endomorphism h of J . If not, then stop
    with result J .
(4) Update J to be h(J ), and return to Step (3).

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
196       •     R. Fagin et al.

   THEOREM 5.9. Assume that (S, T, st , t ) is a data exchange setting such
that st is a set of tgds and t = ∅. Then Algorithm 5.8 is a correct, polynomial-
time algorithm for computing the core of the universal solutions.

   PROOF. We first show that Algorithm 5.8 is correct, that is, that the final
instance C at the conclusion of the algorithm is the core of the given universal
solution. Every time we apply Step (4) of the algorithm, we are replacing the
instance by a homomorphic image. Therefore, the final instance C is the result
of applying a composition of homomorphisms to the input instance, and hence
is a homomorphic image of the canonical universal solution J . Also, since each
of the homomorphisms found in Step (3) is an endomorphism, we have that C
is a subinstance of J . Assume now that C is not the core; we shall derive a
contradiction. Since C is not the core, there is an endomorphism h such that
when h is applied to C, the resulting instance is a proper subinstance of C.
Hence, h is a useful endomorphism of C. Therefore, by Lemma 5.7, there must
exist a useful J -local endomorphism of C. But then Algorithm 5.8 should not
have stopped in Step 3 with C. This is the desired contradiction. Hence, C is
the core of J .
   We now show that Algorithm 5.8 runs in polynomial time. To do so, we need
to consider certain parameters. As in the analysis of greedy algorithm, the first
parameter, denoted by b, is the maximum number of existentially quantified
variables over all tgds in st . Since we are taking st to be fixed, the quantity b
is a constant. It follows easily from the construction of the canonical universal
solution J (by chasing with st ) that b is an upper bound on the size of a block
in J . The second parameter, denoted by n, is the size of the canonical univer-
sal solution J (number of tuples in J ); as seen in the analysis of the greedy
algorithm, n is O(ma ), where a is the maximum number of the universally quan-
tified variables over all tgds in st and m is the size of I . Let J be the instance
in some execution of Step (3). For each block B, to check if there is a useful
endomorphism of J that is J -local for B, we can exhaustively check each of
the possible functions h on the domain of J such that h(x) = x whenever x ∈ B:
there are at most nb such functions. To check that such a function is actually
a useful endomorphism requires time O(n). Since there are at most n blocks,
the time to determine if there is a block with a useful J -local endomorphism is
O(nb+2 ). The updating time in Step (4) is O(n).
   By Lemma 5.5, after Step (4) is executed, there is at least one less null in J
than there was before. Since there are initially at most n nulls in the instance,
it follows that the number of loops that Algorithm 5.8 performs is at most n.
Therefore, the running time of the algorithm (except for Step (1) and Step (2),
which are executed only once) is at most n (the number of loops) times O(nb+2 ),
that is, O(nb+3 ). Since Step (1) and Step (2) take polynomial time as well, it
follows that the entire algorithm executes in polynomial time.

   The crucial observation behind the polynomial-time bound is that the total
number of endomorphisms that the algorithm explores in Step (3) is at most
nb for each block of J . This is in strong contrast with the case of minimizing
arbitrary instances with constants and nulls for which we may need to explore
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                      Data Exchange: Getting to the Core             •      197

a much larger number of endomorphisms (up to nn , in general) in one mini-
mization step.

5.3 Blocks Algorithm: Target Egds
In this subsection, we extend Theorem 5.9 by showing that there is a
polynomial-time algorithm for finding the core even when t is a set of egds.
   Thus, we assume next that we are given a data exchange setting (S, T, st , t )
where t is a set of egds. We are also given a source instance I . As with the
greedy algorithm, let J be a canonical preuniversal instance, that is, J is the
result of chasing I with st . Let J be the canonical universal solution obtained
by chasing J with t . Our goal is to compute core(J ), that is, a subinstance
C of J such that C = h(J ) for some endomorphism h of J , and such that
there is no proper subinstance of C with the same property. As in the case when
  t = ∅, the central idea of the algorithm is to show that the above mentioned
endomorphism h of J can be found as the composition of a polynomial-length
sequence of “small” endomorphisms, each findable in polynomial time. As in the
case when t = ∅, “small” will mean J -local. We make this precise in the next
lemma. This lemma, crucial for the existence of the polynomial-time algorithm
for computing core(J ), is a nontrivial generalization of Lemma 5.7.

    LEMMA 5.10. Assume a data exchange setting where st is a set of tgds and
  t is a set of egds. Let J be the canonical preuniversal instance, and let J be an
endomorphic image of the canonical universal solution J . If there exists a useful
endomorphism of J , then there exists a useful J -local endomorphism of J .

    The proof of Lemma 5.10 requires additional definitions as well as two addi-
tional lemmas. We start with the required definitions.
    Let J be the canonical preuniversal instance, and let J be the canonical
universal solution produced from J by chasing with the set t of egds. We
define a directed graph, whose nodes are the members of J , both nulls and
constants. If during the chase process, a null u gets replaced by v (either a null
or a constant), then there is an edge from u to v in the graph. Let ≤ be the
reflexive, transitive closure of this graph. It is easy to see that ≤ is a reflexive
partial order. For each node u, define [u] to be the maximal (under ≤) node v such
that u ≤ v. Intuitively, u eventually gets replaced by [u] as a result of the chase.
It is clear that every member of J is of the form [u]. It is also clear that if u is a
constant, then u = [u]. Let us write u ∼ v if [u] = [v]. Intuitively, u ∼ v means
that u and v eventually collapse to the same element as a result of the chase.

  Definition 5.11. Let K be an instance whose elements are constants and
nulls. Let y be some element of K . We say that y is rigid if h( y) = y for every
homomorphism h of K . (In particular, all constants occurring in K are rigid.)

   A key step in the proof of Lemma 5.10 is the following surprising result,
which says that if two nulls in different blocks of J both collapse onto the same
element z of J as a result of the chase, then z is rigid, that is, h(z) = z for
every endomorphism h of J .
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
198       •     R. Fagin et al.

   LEMMA 5.12 (RIGIDITY LEMMA). Assume a data exchange setting where st is
a set of tgds and t is a set of egds. Let J be the canonical preuniversal instance,
and let J be the result of chasing J with the set t of egds. Let x and y be nulls
of J such that x ∼ y, and such that [x] is a nonrigid null of J . Then x and y
are in the same block of J .
    PROOF. Assume that x and y are nulls in different blocks of J with x ∼ y.
We must show that [x] is rigid in J . Let φ be the diagram of the instance J , that
is, the conjunction of all expressions S(u1 , . . . , us ) where (u1 , . . . , us ) is a tuple
of the S relation of J . (We are treating members of J , both constants and nulls,
as variables.) Let τ be the egd φ → (x = y). Since x ∼ y, it follows that t |= τ .
This is because the chase sets variables equal only when it is logically forced
to (the result appears in papers that characterize the implication problem for
dependencies; see, for instance, Beeri and Vardi [1984]; Maier et al. [1979]).
Since J satisfies t , it follows that J satisfies τ .
    We wish to show that [x] is rigid in J . Let h be a homomorphism of J ;
we must show that h([x]) = [x]. Let B be the block of x in J . Let V be the
assignment to the variables of τ obtained by letting V (u) = h([u]) if u ∈ B, and
V (u) = [u] otherwise. We now show that V is a valid assignment for φ in J ,
that is, that for each conjunct S(u1 , . . . , us ) of φ, necessarily (V (u1 ), . . . , V (us ))
is a tuple of the S relation of J . Let S(u1 , . . . , us ) be a conjunct of φ. By the
construction of the chase, we know that ([u1 ], . . . , [us ]) is a tuple of the S relation
of J , since (u1 , . . . , us ) is a tuple of the S relation of J . There are two cases,
depending on whether or not some ui (with 1 ≤ i ≤ s) is in B. If no ui is in
B, then V (ui ) = [ui ] for each i, and so (V (u1 ), . . . , V (us )) is a tuple of the S
relation of J , as desired. If some ui is in B, then every ui is either a null in
B or a constant (this is because (u1 , . . . , us ) is a tuple of the S relation of J ).
If ui is a null in B, then V (ui ) = h([ui ]). If ui is a constant, then ui = [ui ],
and so V (ui ) = [ui ] = ui = h(ui ) = h([ui ]), where the third equality holds
since h is a homomorphism and ui is a constant. Thus, in both cases, we have
V (ui ) = h([ui ]). Since ([u1 ], . . . , [us ]) is a tuple of the S relation of J and h is a
homomorphism of J , we know that (h[u1 ], . . . , h[us ]) is a tuple of the S relation
of J . So again, (V (u1 ), . . . , V (us )) is a tuple of the S relation of J , as desired.
    Hence, V is a valid assignment for φ in J . Therefore, since J satisfies τ ,
it follows that in J , we have V (x) = V ( y). Now V (x) = h([x]), since x ∈ B.
Further, V ( y) = [ y], since y ∈ B (because y is in a different block than x). So
h([x]) = [ y]. Since x ∼ y, that is, [x] = [ y], we have h([x]) = [ y] = [x], which
shows that h([x]) = [x], as desired.
   The contrapositive of Lemma 5.12 says that if x and y are nulls in different
blocks of J that are set equal (perhaps transitively) during the chase, then [x]
is rigid in J .
  LEMMA 5.13. Let h be an endomorphism of J . Then every rigid element of
J is a rigid element of h(J ).
   PROOF. Let u be a rigid element of J . Then h(u) is an element of h(J ), and
so u is an element of h(J ), since h(u) = u by rigidity. Let h be a homomorphism
                                                             ˆ
of h(J ); we must show that h(u) = u. But h(u) = hh(u), since h(u) = u. Now
                                ˆ              ˆ       ˆ

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                          Data Exchange: Getting to the Core             •      199

ˆ
hh is also a homomorphism of J , since the composition of homomorphisms is
a homomorphism. By rigidity of u in J , it follows that hh(u) = u. So h(u) =
                                                        ˆ             ˆ
hh(u) = u, as desired.
ˆ

  We are now ready to give the proof of Lemma 5.10, after which we will present
the blocks algorithm for the case of target egds.

   PROOF OF LEMMA 5.10. Let h be an endomorphism of J such that J = h(J ),
and let h be a useful endomorphism of h(J ). By Lemma 5.5, there is a null
 y that appears in h(J ) but does not appear in h h(J ). Let B be the block in
J that contains y. Define h on h(J ) by letting h (x) = h (x) if x ∈ B, and
h (x) = x otherwise. We shall show that h is a useful J -local endomorphism
of h(J ).
   We now show that h is an endomorphism of h(J ). Let (u1 , . . . , us ) be a tuple
of the R relation of h(J ); we must show that (h (u1 ), . . ., h (us )) is a tuple of
the R relation of h(J ).
   We first show that every nonrigid null among u1 , . . . , us is in the same block
of J . Let u p and uq be nonrigid nulls among u1 , . . . , us ; we show that u p and
uq are in the same block of J . Since (u1 , . . . , us ) is a tuple of the R relation of
h(J ), and h(J ) is a subinstance of J , we know that (u1 , . . . , us ) is a tuple of
the R relation of J . By construction of J from J using the chase, we know
that there is ui where ui ∼ ui for 1 ≤ i ≤ s, such that (u1 , . . . , us ) is a tuple of
the R relation of J . Since u p and uq are nonrigid nulls of h(J ), it follows from
Lemma 5.13 that u p and uq are nonrigid nulls of J . Now u p is not a constant,
since u p ∼ u p and u p is a nonrigid null. Similarly, uq is not a constant. So u p
and uq are in the same block B of J . Now [u p ] = u p , since u p is in J . Since
u p ∼ u p and [u p ] = u p is nonrigid, it follows from Lemma 5.12 that u p and u p
are in the same block of J , and so u p ∈ B . Similarly, uq ∈ B . So u p and uq are
in the same block B of J , as desired.
   There are now two cases, depending on whether or not B = B. Assume
first that B = B. For those ui ’s that are nonrigid, we showed that ui ∈ B =
B, and so h (ui ) = h (ui ). For those u j ’s that are rigid (including nulls and
constants), we have h (u j ) = u j = h (u j ). So for every ui among u1 , . . . , us , we
have h (u j ) = h (u j ). Since h is a homomorphism of h(J ), and since (u1 , . . . , us )
is a tuple of the R relation of h(J ), we know that (h (u1 ), . . . , h (us )) is a tuple
of the R relation of h(J ). Hence (h (u1 ), . . . , h (us )) is a tuple of the R relation
of h(J ), as desired. Now assume that B = B. For those ui ’s that are nonrigid,
we showed that ui ∈ B , and so ui ∈ B. Hence, for those ui ’s that are nonrigid,
we have h (u j ) = u j . But also h (ui ) = ui for the rigid ui ’s. Thus, (h (u1 ), . . . ,
h (us )) = (u1 , . . . , us ). Hence, once again, (h (u1 ), . . . , h (us )) is a tuple of the R
relation of h(J ), as desired.
   So h is an endomorphism of h(J ). By definition, h is J -local. We now show
that h is useful. Since y appears in h(J ), Lemma 5.5 tells us that we need only
show that the range of h does not contain y. If x ∈ B, then h (x) = h (x) = y,
since the range of h does not include y. If x ∈ B, then h (x) = x = y, since
 y ∈ B. So the range of h does not contain y, and hence h is useful. Therefore,
h is a useful J -local endomorphism of h(J ).
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
200       •     R. Fagin et al.

   We now present the blocks algorithm for computing the core when t is a
set of egds. (As mentioned earlier, when the target constraints include egds, it
may be possible that there are no solutions and hence no universal solutions.
This case is detected by our algorithm, and “failure” is returned.)
   Algorithm 5.14 (Blocks Algorithm: Target egds).
Input: source instance I .
Output: the core of the universal solutions for I , if solutions exist, and “failure”, other-
wise.
(1) Compute J , the canonical preuniversal instance, from I, ∅ by chasing with st .
(2) Compute the blocks of J , and then chase J with t to produce the canonical universal
    solution J . If the chase fails, then stop with “failure.” Otherwise, initialize J to
    be J .
(3) Check whether there exists a useful J -local endomorphism h of J . If not, then stop
    with result J .
(4) Update J to be h(J ), and return to Step (3).


  THEOREM 5.15. Assume that (S, T, st , t ) is a data exchange setting such
that st is a set of tgds and t is a set of egds. Then Algorithm 5.14 is a correct,
polynomial-time algorithm for computing the core of the universal solutions.
   PROOF. The proof is essentially the same as that of Theorem 5.9, except
that we make use of Lemma 5.10 instead of Lemma 5.7. For the correctness of
the algorithm, we use the fact that each h(J ) is both a homomorphic image
and a subinstance of the canonical universal solution J ; hence it satisfies both
the tgds in st and the egds in t . For the running time of the algorithm, we
also use the fact that chasing with egds (used in Step (2)) is a polynomial-time
procedure.
   We note that it is essential for the polynomial-time upper bound that the
endomorphisms explored by Algorithm 5.14 are J -local and not merely J -local.
While, as argued earlier in the case t = ∅, the blocks of J are bounded in size
by the constant b (the maximal number of existentially quantified variables
over all tgds in st ), the same is not true, in general, for the blocks of J . The
chase with egds, used to obtain J , may generate blocks of unbounded size.
Intuitively, if an egd equates the nulls x and y that are in different blocks
of J , then this creates a new, larger, block out of the union of the blocks of x
and y.

5.4 Can We Obtain the Core Via the Chase?
A universal solution can be obtained via the chase [Fagin et al. 2003]. What
about the core? In this section, we show by example that the core may not be
obtainable via the chase. We begin with a preliminary example.
    Example 5.16. We again consider our running example from Example 2.2.
If we chase the source instance I of Example 2.2 by first chasing with the
dependencies (d 2 ) and (d 3 ), and then by the dependencies (d 1 ) and (d 4 ), neither
of which add any tuples, then the result is the core J0 , as given in Example 2.2.
If, however, we chase first with the dependency (d 1 ), then with the dependencies
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                        Data Exchange: Getting to the Core             •      201

(d 2 ) and (d 3 ), and finally with the dependency (d 4 ), which does not add any
tuples, then the result is the target instance J , as given in Example 2.2, rather
than the core J0 .
   In Example 5.16 , the result of the chase may or may not be the core, depend-
ing on the order of the chase steps. We now give an example where there is no
chase (that is, no order of doing the chase steps) that produces the core.
   Example 5.17. Assume that the source schema consists of one 4-ary rela-
tion symbol R and the target schema consists of one 5-ary relation symbol S.
There are two source-to-target tgds d 1 and d 2 , where d 1 is
               R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(x5 , b, x1 , x2 , a)
                                                        ∧S(x5 , c, x3 , x4 , a)
                                                        ∧S(d , c, x3 , x4 , b))
and where d 2 is
              R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(d , a, a, x1 , b)
                                                      ∧S(x5 , a, a, x1 , a)
                                                      ∧S(x5 , c, x2 , x3 , x4 )).
The source instance I is {R(1, 1, 2, 3)}.
  The result of chasing I with d 1 only is
                                 {S(N5 , 1, N1 , N2 , 1),
                                 S(N5 , 2, N3 , N4 , 1),
                                  S(3, 2, N3 , N4 , 1)},                                       (1)
where N1 , N2 , N3 , N4 , N5 are nulls.
  The result of chasing I with d 2 only is
                                  {S(3, 1, 1, N1 , 1),
                                  S(N5 , 1, 1, N1 , 1),
                                S(N5 , 2, N2 , N3 , N4 )},                                     (2)
where N1 , N2 , N3 , N4 , N5 are nulls.
   Let J be the universal solution that is the union of (1) and (2). We now show
that the core of J is given by the following instance J0 , which consists of the
third tuple of (1) and the first tuple of (2):
                                  {S(3, 2, N3 , N4 , 1),
                                   S(3, 1, 1, N1 , 1)}.
   First, it is straightforward to verify that J0 is the image of the universal so-
lution J under the following endomorphism h: h(N1 ) = 1; h(N2 ) = N1 ; h(N3 ) =
N3 ; h(N4 ) = N4 ; h(N5 ) = 3; h(N1 ) = N1 ; h(N2 ) = N3 ; h(N3 ) = N4 ; h(N4 ) = 1;
and h(N5 ) = 3. Second, it is easy to see that there is no endomorphism of J0
into a proper substructure of J0 . From these two facts, it follows immediately
that J0 is the core.
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
202       •     R. Fagin et al.

   Since the result of chasing first with d 1 has three tuples, and since the core
has only two tuples, it follows that the result of chasing first with d 1 and then
d 2 does not give the core. Similarly, the result of chasing first with d 2 and
then d 1 does not give the core. Thus, no chase gives the core, which was to be
shown.
   This example has several other features built into it. First, it is not possible
to remove a conjunct from the right-hand side of d 1 and still maintain a depen-
dency equivalent to d 1 . A similar comment applies to d 2 . Therefore, the fact that
no chase gives the core is not caused by the right-hand side of a source-to-target
tgd having a redundant conjunct.
   Second, the Gaifman graph of the nulls as determined by (1) is connected. In-
tuitively, this tells us that the tgd d 1 cannot be “decomposed” into multiple tgds
with the same left-hand side. A similar comment applies to d 2 . Therefore, the
fact that no chase gives the core is not caused by the tgds being “decomposable.”
   Third, not only does the set (1) of tuples not appear in the core, but even the
core of (1), which consists of the first and third tuples of (1), does not appear in
the core. A similar comment applies to (2), whose core consists of the first and
third tuples of (2). So even if we were to modify the chase by inserting, at each
chase step, only the core of the set of tuples generated by applying a given tgd,
we still would not obtain the core as the result of a chase.

6. QUERY ANSWERING WITH CORES
Up to this point, we have shown that there are two reasons for using cores
in data exchange: first, they are the smallest universal solutions, and second,
they are polynomial-time computable in many natural data exchange settings.
In this section, we provide further justification for using cores in data exchange
by establishing that they have clear advantages over other universal solutions
in answering target queries.
   Assume that (S, T, st , t ) is a data exchange setting, I is a source instance,
and J0 is the core of the universal solutions for I . If q is a union of conjunctive
queries over the target schema T, then, by Proposition 2.7, for every universal
solution J for I , we have that certain(q, I ) = q(J )↓ . In particular, certain(q, I ) =
q(J0 )↓ , since J0 is a universal solution. Suppose now that q is a conjunctive
query with inequalities = over the target schema. In general, if J is a universal
solution, then q(J )↓ may properly contain certain(q, I ). We illustrate this point
with the following example.
  Example 6.1. Let us revisit our running example from Example 2.2. We
saw earlier in Example 3.1 that, for every m ≥ 0, the target instance
                     Jm = {Home(Alice, SF), Home(Bob, SD),
                           EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ),
                           DeptCity(X 0 , SJ), DeptCity(Y 0 , SD),
                                                     ...
                               EmpDept(Alice, X m ), EmpDept(Bob, Y m ),
                               DeptCity(X m , SJ), DeptCity(Y m , SD)}
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                              Data Exchange: Getting to the Core             •      203

is a universal solution for I ; moreover, J0 is the core of the universal solutions
for I . Consider now the following conjunctive query q with one inequality:
                 ∃D1 ∃D2 (EmpDept(e, D1 ) ∧ EmpDept(e, D2 ) ∧ (D1 = D2 )).
Clearly, q(J0 ) = ∅, while if m ≥ 1, then q(Jm ) = {Alice, Bob}. This implies
that certain(q, I ) = ∅, and thus evaluating the above query q on the universal
solution Jm , for arbitrary m ≥ 1, produces a strict superset of the set of the
certain answers. In contrast, evaluating q on the core J0 coincides with the set
of the certain answers, since q(J0 ) = ∅ = certain(q, I ).
   This example can also be used to illustrate another difference between con-
junctive queries and conjunctive queries with inequalities. Specifically, if J
and J are universal solutions for I , and q ∗ is a conjunctive query over the
target schema, then q ∗ (J )↓ = q ∗ (J )↓ . In contrast, this does not hold for
the above conjunctive query q with one inequality. Indeed, q(J0 ) = ∅ while
q(Jm ) = {Alice, Bob}, for every m ≥ 1.
   In the preceding example, the certain answers of a particular conjunctive
query with inequalities could be obtained by evaluating the query on the core
of the universal solutions. As shown in the next example, however, this does
not hold true for arbitrary conjunctive queries with inequalities.
   Example 6.2. Referring to our running example, consider again the univer-
sal solutions Jm , for m ≥ 0, from Example 6.1. In particular, recall the instance
J0 , which is the core of the universal solutions for I , and which has two distinct
labeled nulls X 0 and Y 0 , denoting unknown departments. Besides their role
as placeholders for department values, the role of such nulls is also to “link”
employees to the cities they work in, as specified by the tgd (d 2 ) in st . For
data exchange, it is important that such nulls be different from constants and
different from each other. Universal solutions such as J0 naturally satisfy this
requirement. In contrast, the target instance
                       J0 = {Home(Alice, SF), Home(Bob, SD),
                             EmpDept(Alice, X 0 ), EmpDept(Bob, X 0 ),
                                DeptCity(X 0 , SJ), DeptCity(X 0 , SD)}
is a solution2 for I , but not a universal solution for I , because it uses the same
null for both source tuples (Alice, SJ) and, (Bob, SD) and, hence, there is no
homomorphism from J0 to J0 . In this solution, the association between Alice
and SJ as well as the association between Bob and SD have been lost.
   Let q be the following conjunctive query with one inequality:
                   ∃D∃D (EmpDept(e, D) ∧ DeptCity(D , c) ∧ (D = D )).
It is easy to see that q(J0 ) = {(Alice, SD), (Bob, SJ)}. In contrast, q(J0 ) = ∅,
since in J0 both Alice and Bob are linked with both SJ and SD. Consequently,
certain(q, I ) = ∅, and thus certain(q, I ) is properly contained in q(J0 )↓ .

2 This   is the same instance, modulo renaming of nulls, as the earlier instance J0 of Example 2.2.

                                        ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
204       •     R. Fagin et al.

  Let J be a universal solution for I . Since J0 is (up to a renaming of the nulls)
the core of J , it follows that
                                          q(J0 ) ⊆ q(J )↓ .
(We are using the fact that q(J0 ) = q(J0 )↓ here.) Since also we have the strict
inclusion certain(q, I ) ⊂ q(J0 ), we have that certain(q, I ) ⊂ q(J )↓ , for every
universal solution J . This also means that there is no universal solution J for
I such that certain(q, I ) = q(J )↓ .
   Finally, consider the target instance:
                      J = {Home(Alice, SF), Home(Bob, SD),
                           EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ),
                               DeptCity(X 0 , SJ), DeptCity(Y 0 , SD),
                               DeptCity(X , SJ)}.
It is easy to verify that J is a universal solution and that q(J ) = {(Alice, SJ),
(Alice, SD), (Bob, SJ) }. Thus, the following strict inclusions hold: certain(q, I ) ⊂
q(J0 )↓ ⊂ q(J )↓ . This shows that a strict inclusion hierarchy can exist among
the set of the certain answers, the result of the null-free query evaluation on the
core and the result of the null-free query evaluation on some other universal
solution.
  We will argue in the next section that instead of computing certain(q, I ) a
better answer to the query may be given by taking q(J0 )↓ itself!

6.1 Certain Answers on Universal Solutions
Although the certain answers of conjunctive queries with inequalities cannot
always be obtained by evaluating these queries on the core of the universal so-
lutions, it turns out that this evaluation produces a “best approximation” to the
certain answers among all evaluations on universal solutions. Moreover, as we
shall show, this property characterizes the core, and also extends to existential
queries.
    We now define existential queries, including a safety condition. An existential
query q(x) is a formula of the form ∃yφ(x, y), where φ(x, y) is a quantifier-free
formula in disjunctive normal form. Let φ be ∨i ∧ j γij , where each γij is an atomic
formula, the negation of an atomic formula, an equality, or the negation of an
equality. As a safety condition, we assume that for each conjunction ∧ j γij and
each variable z (in x or y) that appears in this conjunction, one of the conjuncts
γij is an atomic formula that contains z. The safety condition guarantees that
φ is domain independent [Fagin 1982] (so that its truth does not depend on any
underlying domain, but only on the “active domain” of elements that appear in
tuples in the instance).
    We now introduce the following concept, which we shall argue is
fundamental.
   Definition 6.3. Let (S, T, st , t ) be a data exchange setting and let I be
a source instance. For every query q over the target schema T, the set of the
certain answers of q on universal solutions with respect to the source instance I ,
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                         Data Exchange: Getting to the Core             •      205

denoted by u-certain(q, I ), is the set of all tuples that appear in q(J ) for every
universal solution J for I ; in symbols,
            u-certain(q, I ) =    {q(J ) : J is a universal solution for I }.
   Clearly, certain(q, I ) ⊆ u-certain(q, I ). Moreover, if q is a union of conjunc-
tive queries, then Proposition 2.7 implies that certain(q, I ) = u-certain(q, I ).
In contrast, if q is a conjunctive query with inequalities, it is possible that
certain(q, I ) is properly contained in u-certain(q, I ). Concretely, this holds true
for the query q and the source instance I in Example 6.2, since certain(q, I ) = ∅,
while u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}. In such cases, there is no uni-
versal solution J for I such that certain(q, I ) = q(J )↓ . Nonetheless, the next
result asserts that if J0 is the core of the universal solutions for I , then
u-certain(q, I ) = q(J0 )↓ . Therefore, q(J0 )↓ is the best approximation (that is,
the least superset) of the certain answers for I among all choices of q(J )↓ where
J is a universal solution for I .
   Before we prove the next result, we need to recall some definitions from
Fagin et al. [2003]. Let q be a Boolean (that is, 0-ary) query over the target
schema T and I a source instance. If we let true denote the set with one 0-ary
tuple and false denote the empty set, then each of the statements q(J ) = true
and q(J ) = false has its usual meaning for Boolean queries q. It follows from
the definitions that certain(q, I ) = true means that for every solution J of this
instance of the data exchange problem, we have that q(J ) = true; moreover,
certain(q, I ) = false means that there is a solution J such that q(J ) = false.
   PROPOSITION 6.4. Let (S, T, st , t ) be a data exchange setting in which st is
a set of tgds and t is a set of tgds and egds. Let I be a source instance such that
a universal solution for I exists, and let J0 be the core of the universal solutions
for I .
(1) If q is an existential query over the target schema T, then
                                 u-certain(q, I ) = q(J0 )↓ .
        ∗
(2) If J is a universal solution for I such that for every existential query q over
    the target schema T we have that
                                 u-certain(q, I ) = q(J ∗ )↓ ,
    then J ∗ is isomorphic to the core J0 of the universal solutions for I . In fact,
    it is enough for the above property to hold for every conjunctive query q with
    inequalities =.
   PROOF. Let J be a universal solution, and let J0 be the core of J . By
Proposition 3.3, we know that J0 is an induced substructure of J . Let q be an ex-
istential query over the target schema T. Since q is an existential query and J0
is an induced substructure of J , it is straightforward to verify that q(J0 ) ⊆ q(J )
(this is a well-known preservation property of existential first-order formulas).
Since J0 is the core of every universal solution for I up to a renaming of the
nulls, it follows that q(J0 )↓ ⊆ {q(J ) : J universal for I }. We now show the re-
verse inclusion. Define J0 by renaming each null of J0 in such a way that J0 and
J0 have no nulls in common. Then {q(J ) : J universal for I } ⊆ q(J0 ) ∩ q(J0 ).
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
206       •     R. Fagin et al.

But it is easy to see that q(J0 )∩q(J0 ) = q(J0 )↓ . This proves the reverse inclusion
and so
               u-certain(q, I ) =         {q(J ) : J universal for I } = q(J0 )↓ .
   For the second part, assume that J ∗ is a universal solution for I such that
for every conjunctive query q with inequalities = over the target schema,

                  q(J ∗ )↓ =      {q(J ) : J is a universal solution for I }.            (3)

Let q ∗ be the canonical conjunctive query with inequalities associated with J ∗ ,
that is, q ∗ is a Boolean conjunctive query with inequalities that asserts that
there exist at least n∗ distinct elements, where n∗ is the number of elements of
J ∗ , and describes which tuples from J ∗ occur in which relations in the target
schema T. It is clear that q ∗ (J ∗ ) = true. Since q ∗ is a Boolean query, we have
q(J ∗ )↓ = q(J ∗ ). So from (3), where q ∗ plays the role of q, we have

                  q ∗ (J ∗ ) =    {q ∗ (J ) : J is a universal solution for I }.         (4)

Since q ∗ (J ∗ ) = true, it follows from (4) that q ∗ (J0 ) = true. In turn, q ∗ (J0 ) = true
implies that there is a one-to-one homomorphism h∗ from J ∗ to J0 . At the same
time, there is a one-to-one homomorphism from J0 to J ∗ , by Corollary 3.5.
Consequently, J ∗ is isomorphic to J0 .

   Let us take a closer look at the concept of the certain answers of a query q on
universal solutions. In Fagin et al. [2003], we made a case that the universal
solutions are the preferred solutions to the data exchange problem, since in a
precise sense they are the most general possible solutions and, thus, they rep-
resent the space of all solutions. This suggests that, in the context of data
exchange, the notion of the certain answers on universal solutions may be
more fundamental and more meaningful than that of the certain answers. In
other words, we propose here that u-certain(q, I ) should be used as the se-
mantics of query answering in data exchange settings, instead of certain(q, I ),
because we believe that this notion should be viewed as the “right” semantics
for query answering in data exchange. As pointed out earlier, certain(q, I ) and
u-certain(q, I ) coincide when q is a union of conjunctive queries, but they may
very well be different when q is a conjunctive query with inequalities. The pre-
ceding Example 6.2 illustrates this difference between the two semantics, since
certain(q, I ) = ∅ and u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}, where q is the
query
                 ∃D∃D (EmpDept(e, D) ∧ DeptCity(D , c) ∧ (D = D )).
We argue that a user should not expect the empty set ∅ as the answer to
the query q, after the data exchange between the source of the target (un-
less, of course, further constraints are added to specify that the nulls must be
equal). Thus, u-certain(q, I ) = {(Alice, SD), (Bob, SJ)} is a more intuitive an-
swer to q than certain(q, I ) = ∅. Furthermore, this answer can be computed as
q(J0 )↓ .


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                       Data Exchange: Getting to the Core             •      207

   We now show that for conjunctive queries with inequalities, it may be
easier to compute the certain answers on universal solutions than to com-
pute the certain answers. Abiteboul and Duschka [1998] proved the following
result.
   THEOREM 6.5 [ABITEBOUL AND DUSCHKA 1998]. There is a LAV setting and a
Boolean conjunctive query q with inequalities = such that computing the set
certain(q, I ) of the certain answers of q is a coNP-complete problem.
  By contrast, we prove the following result, which covers not only LAV settings
but even broader settings.

   THEOREM 6.5. Let (S, T, st , t ) be a data exchange setting in which st is a
set of tgds and t is a set of egds. For every existential query q over the target
schema T, there is a polynomial-time algorithm for computing, given a source
instance I , the set u-certain(q, I ) of the certain answers of q on the universal
solutions for I .

   PROOF. Let q be an existential query, and let J0 be the core of the universal
solutions. We see from Proposition 6.4 that u-certain(q, I ) = q(J0 )↓ . By Theo-
rem 5.2 or Theorem 5.15, there is a polynomial-time algorithm for computing
J0 , and hence for computing q(J0 )↓ .

   Theorems 6.5 and 6.5 show a computational advantage for certain answers
on universal solutions over simply certain answers. Note that the core is used
in the proof of Theorem 6.5 but does not appear in the statement of the theorem
and does not enter into the definitions of the concepts used in the theorem. It
is not at all clear how one would prove this theorem directly, without making
use of our results about the core.
   We close this section by pointing out that Proposition 6.4 is very dependent
on the assumption that q is an existential query. A universal query is taken to
be the negation of an existential query. It is a query of the form ∀xφ(x), where
φ(x) is a quantifier-free formula, with a safety condition that is inherited from
existential queries. Note that each egd and full tgd is a universal query (and in
particular, satisfies the safety condition). For example, the egd ∀x(A1 ∧ A2 →
(x1 = x2 )) satisfies the safety condition, since its negation is ∃x(A1 ∧ A2 ∧
(x1 = x2 )), which satisfies the safety condition for existential queries since
every variable in x appears in one of the atomic formulas A1 or A2 .
   We now give a data exchange setting and a universal query q such that
u-certain(q, I ) cannot be obtained by evaluating q on the core of the universal
solutions for I .

   Example 6.6. Referring to our running example, consider again the univer-
sal solutions Jm , for m ≥ 0, from Example 6.1. Among those universal solutions,
the instance J0 is the core of the universal solutions for I .
   Let q be the following Boolean universal query (a functional dependency):

            ∀e∀d 1 ∀d 2 (EmpDept(e, d 1 ) ∧ EmpDept(e, d 2 ) → (d 1 = d 2 )).


                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
208       •     R. Fagin et al.

It is easy to see that q(J0 ) = true and q(Jm ) = false, for all m ≥ 1. Consequently,
                      certain(q, I ) = false = u-certain(q, I ) = q(J0 ).

7. CONCLUDING REMARKS
In a previous article [Fagin et al. 2003], we argued that universal solutions are
the best solutions in a data exchange setting, in that they are the “most general
possible” solutions. Unfortunately, there may be many universal solutions. In
this article, we identified a particular universal solution, namely, the core of an
arbitrary universal solution, and argued that it is the best universal solution
(and hence the best of the best). The core is unique up to isomorphism, and is the
universal solution of the smallest size, that is, with the fewest tuples. The core
gives the best answer, among all universal solutions, for existential queries. By
“best answer,” we mean that the core provides the best approximation (among
all universal solutions) to the set of the certain answers. In fact, we proposed
an alternative semantics where the set of “certain answers” are redefined to be
those that occur in every universal solution. Under this alternative semantics,
the core gives the exact answer for existential queries.
   We considered the question of the complexity of computing the core. To this
effect, we showed that the complexity of deciding if a graph H is the core of
a graph G is DP-complete. Thus, unless P = NP, there is no polynomial-time
algorithm for producing the core of a given arbitrary structure. On the other
hand, in our case of interest, namely, data exchange, we gave natural conditions
where there are polynomial-time algorithms for computing the core of universal
solutions. Specifically, we showed that the core of the universal solutions is
polynomial-time computable in data exchange settings in which st is a set of
source-to-target tgds and t is a set of egds.
   These results raise a number of questions. First, there are questions about
the complexity of constructing the core. Even in the case where we prove that
there is a polynomial-time algorithm for computing the core, the exponent may
be somewhat large. Is there a more efficient algorithm for computing the core
in this case and, if so, what is the most efficient such algorithm? There is also
the question of extending the polynomial-time result to broader classes of tar-
get dependencies. To this effect, Gottlob [2005] recently showed that computing
the core may be NP-hard in the case in which t consists of a single full tgd,
provided a NULL “built-in” target predicate is available to tell labeled nulls
from constants in target instances; note that, since NULL is a “built-in” predi-
cate, it need not be preserved under homomorphisms. Since our formalization
of data exchange does not allow for such a NULL predicate, it remains an open
problem to determine the complexity of computing the core in data exchange
settings in which the target constraints are egds and tgds.
   On a slightly different note, and given the similarities between the two prob-
lems, it would be interesting to see if our techniques for minimizing univer-
sal solutions can be applied to the problem of minimizing the chase-generated
universal plans that arise in the comprehensive query optimization method
introduced in [Deutsch et al. 1999].


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                                           Data Exchange: Getting to the Core             •      209

   Finally, the work reported here addresses data exchange only between rela-
tional schemas. In the future we hope to investigate to what extent the results
presented in this article and in Fagin et al. [2003] can be extended to the more
general case of XML/nested data exchange.

ACKNOWLEDGMENTS

                                                     e
Many thanks to Marcelo Arenas, Georg Gottlob, Ren´ e J. Miller, Arnon
Rosenthal, Wang-Chiew Tan, Val Tannen, and Moshe Y. Vardi for helpful sug-
gestions, comments, and pointers to the literature.

REFERENCES

ABITEBOUL, S. AND DUSCHKA, O. M. 1998. Complexity of answering queries using materialized
  views. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 254–
  263.
ABITEBOUL, S., HULL, R., AND VIANU, V. 1995. Foundations of Databases. Addison-Wesley, Reading,
  MA.
BEERI, C. AND VARDI, M. Y. 1984. A proof procedure for data dependencies. Journal Assoc. Comput.
  Mach. 31, 4, 718–741.
CHANDRA, A. K. AND MERLIN, P. M. 1977. Optimal implementation of conjunctive queries in re-
  lational data bases. In Proceedings of the ACM Symposium on Theory of Computing (STOC).
  77–90.
COSMADAKIS, S. 1983. The complexity of evaluating relational queries. Inform. Contr. 58, 101–112.
COSMADAKIS, S. S. AND KANELLAKIS, P. C. 1986. Functional and inclusion dependencies: A graph
  theoretic approach. In Advances in Computing Research., vol. 3. JAI Press, Greenwich, CT, 163–
  184.
DEUTSCH, A., POPA, L., AND TANNEN, V. 1999. Physical data independence, constraints and opti-
  mization with universal plans. In Proceedings of the International Conference on Very Large Data
  Bases (VLDB). 459–470.
DEUTSCH, A. AND TANNEN, V. 2003. Reformulation of XML queries and constraints. In Proceedings
  of the International Conference on Database Theory (ICDT). 225–241.
FAGIN, R. 1982. Horn clauses and database dependencies. Journal Assoc. Comput. Mach. 29, 4
  (Oct.), 952–985.
FAGIN, R., KOLAITIS, P. G., MILLER, R. J., AND POPA, L. 2003. Data exchange: Semantics and query
  answering. In Proceedings of the International Conference on Database Theory (ICDT). 207–
  224.
FRIEDMAN, M., LEVY, A. Y., AND MILLSTEIN, T. D. 1999. Navigational plans for data integration. In
  Proceedings of the National Conference on Artificial Intelligence (AAAI). 67–73.
GOTTLOB, G. 2005. Cores for data exchange: Hard cases and practical solutions. In Proceedings
  of the ACM Symposium on Principles of Database Systems (PODS).
                         ¨
GOTTLOB, G. AND FERMULLER, C. 1993. Removing redundancy from a clause. Art. Intell. 61, 2,
  263–289.
HALEVY, A. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270–294.
                ˇ ˇ
HELL, P. AND NESETRIL, J. 1992. The core of a graph. Discr. Math. 109, 117–126.
KANELLAKIS, P. C. 1990. Elements of relational database theory. In Handbook of Theoretical Com-
  puter Science, Volume B: Formal Models and Sematics. Elsevier, Amsterdam, The Netherlands,
  and MIT Press, Cambridge, MA, 1073–1156.
LENZERINI, M. 2002. Data integration: A theoretical perspective. In Proceedings of the ACM Sym-
  posium on Principles of Database Systems (PODS). 233–246.
MAIER, D., MENDELZON, A. O., AND SAGIV, Y. 1979. Testing implications of data dependencies. ACM
  Trans. Database Syst. 4, 4 (Dec.), 455–469.
                                         ´
MILLER, R. J., HAAS, L. M., AND HERNANDEZ, M. 2000. Schema mapping as query discovery. In
  Proceedings of the International Conference on Very Large Data Bases (VLDB). 77–88.


                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
210       •     R. Fagin et al.

PAPADIMITRIOU, C. AND YANNAKAKIS, M. 1982. The complexity of facets and some facets of complexity.
  In Proceedings of the ACM Symposium on Theory of Computing (STOC). 229–234.
PAPADIMITRIOU, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA.
POPA, L., VELEGRAKIS, Y., MILLER, R. J., HERNANDEZ, M. A., AND FAGIN, R. 2002. Translating Web
  data. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 598–609.
SHU, N. C., HOUSEL, B. C., TAYLOR, R. W., GHOSH, S. P., AND LUM, V. Y. 1977. EXPRESS: A data
  EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2, 2, 134–
  174.
VAN DER MEYDEN, R. 1998. Logical approaches to incomplete information: A survey. In Logics for
  Databases and Information Systems. Kluwer, Dordrecht, The Netherlands, 307–356.

Received October 2003; revised May 2004; accepted July 2004




ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
Concise Descriptions of Subsets of
Structured Sets
KEN Q. PU and ALBERTO O. MENDELZON
University of Toronto


We study the problem of economical representation of subsets of structured sets, which are sets
equipped with a set cover or a family of preorders. Given a structured set U , and a language
L whose expressions define subsets of U , the problem of minimum description length in L (L-
MDL) is: “given a subset V of U , find a shortest string in L that defines V .” Depending on the
structure and the language, the MDL-problem is in general intractable. We study the complexity
of the MDL-problem for various structures and show that certain specializations are tractable.
The families of focus are hierarchy, linear order, and their multidimensional extensions; these are
found in the context of statistical and OLAP databases. In the case of general OLAP databases,
data organization is a mixture of multidimensionality, hierarchy, and ordering, which can also be
viewed naturally as a cover-structured ordered set. Efficient algorithms are provided for the MDL-
problem for hierarchical and linearly ordered structures, and we prove that the multidimensional
extensions are NP-complete. Finally, we illustrate the application of the theory to summarization
of large result sets and (multi) query optimization for ROLAP queries.
Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design—Data mod-
els; normal forms; H.2.3 [Database Management]: Languages
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Minimal description length, OLAP, query optimization,
summarization



1. INTRODUCTION
Consider an OLAP or multidimensional database setting [Kimball 1996], where
a user has requested to view a certain set of cells of the datacube, say in
the form of a 100 × 20 matrix. Typically, the user interacts with a front-end
query tool that ships SQL queries to a back-end database management system
(DBMS). After perusing the output, the user clicks on some of the rows of the
matrix, say 20 of them, and requests further details on these rows. Suppose
each row represents data on a certain city. A typical query tool will translate
the user request to a long SQL query with a WHERE clause of the form city
= city1 OR city = city2 ... OR city = city20. However, if the set of cities
happens to include every city in Ontario except Toronto, an equivalent but much

This work was supported by the Natural Sciences and Engineering Research Council of Canada.
Authors’ address: Department of Computer Science, University of Toronto, 6 King’s College Road,
Toronto, Ont., Canada M5S 3H5; email: {kenpu,mendel}@cs.toronto.edu.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
C 2005 ACM 0362-5915/05/0300-0211 $5.00


                        ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 211–248.
212       •     K. Q. Pu and A. O. Mendelzon

shorter formulation would be province = ‘Ontario’ AND city <> ‘Toronto’.
Minimizing the length of the query that goes to the back end is advantageous
for two reasons. First, many systems1 have difficulty dealing with long queries,
or even hard limits on query length. Second, the shorter query can often be
processed much faster than the longer one (even though an extra join may be
required, e.g., if there is no Province attribute stored in the cube).
   With this problem as motivation, we study the concise representations of
subsets of a structured set. By “structured” we simply mean that we are given
a (finite) set, called the universe, and a (finite) set of symbols, called the alpha-
bet, each of which represents some subset of the universe. We are also given
a language L of expressions on the alphabet, and a semantics that maps ex-
pressions to subsets of the universe. Given a subset V of the universe, we
want to find a shortest expression in the given language that describes V .
We call this the L-MDL (minimum description length) problem. In the ex-
ample above, the universe is the set of city names, the alphabet includes at
least the city name Toronto plus a set of province names, and the seman-
tics provides a mapping from province names to sets of cities. This is the
simplest case, where the symbols in the alphabet induce a partition of the
universe.
   The most general language we consider, called L, is the language of arbitrary
Boolean set expressions on symbols from the alphabet. In Section 2.1 we show
that the L-MDL problem is solvable in polynomial time when the alphabet
forms a partition of the universe. In particular, when the partition is granular,
that is, every element of the universe is represented as one of the symbols in the
alphabet, we obtain a normal form for minimum-length expressions, leading to
a polynomial time algorithm.
   Of course, in addition to cities grouped into provinces, we could have
provinces grouped into regions, regions into countries, etc. That is, the sub-
sets of the universe may form a hierarchy. We consider this case in Section 2.2
and show that the normal forms of the previous section can be generalized,
leading again to a polynomial time L-MDL problem.
   In the full OLAP context, elements of the universe can be grouped according
to multiple independent criteria. If we think of a row in our initial example
as a tuple <city, product, date, sales>, and the universe is the set of such
tuples, then these tuples can be grouped by city into provinces, or by product
into brands, or by date into years, etc. In Section 2.3 we consider the multidi-
mensional case. In particular, we focus on the common situation in which each
of the groupings is a hierarchy. We consider three increasingly powerful sublan-
guages of L, including L itself, and show that the MDL-problem is NP-complete
for each of them.
   In many cases, the universe is naturally ordered, such as the TIME di-
mension. In Section 3, we define order-structures to capture such ordering.
A language L(≤) is defined to express subsets of the ordered universe. The

1 Many commercial relational OLAP engines naively translate user selection into simple SELECT
SQL queries. It has been known that large enough user selections are executed as several SQL
queries.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                            Concise Descriptions of Subsets of Structured Sets              •      213




                                      Fig. 1. A structured set.

MDL-problem is in general NP-complete, but in the case of one linear ordering,
it can be solved in polynomial time.
   Section 4 focuses on two areas of application of the theory: summarization of
query answers and optimization of SELECT queries in OLAP. We consider the
scenario of querying a relational OLAP database using simple SELECT queries,
and show that it is advantageous to rewrite the queries into the corresponding
compact expressions.
   In Section 5.1, we describe some related MDL-problem and they are related
to various languages presented in this article. We also present some existing
OLAP query optimization techniques and how they are related to our approach.
   Finally we summarize our findings and outline the possibilities of future
research in Section 6.

2. COVER STRUCTURES, LANGUAGES, AND THE MDL PROBLEM
In this section we introduce our model of structured sets and descriptive lan-
guages for subsets of them, and state the minimum description length problem.
  Definition 1 (Cover Structured Set). A structured set is a pair of finite sets
(U, ) together with an interpretation function [·] :    → Pwr(U ) : σ → [σ ]
which is injective, and is such that σ ∈ [σ ] = U . The set U is referred to as
the universe, and the alphabet.
   Intuitively the cover2 structure of the set U is modeled by the grouping of
its elements; each group is labeled by a symbol in the alphabet . The inter-
pretation of a symbol σ is the elements in U belonging to the group labeled
by σ .
   Example 1. Consider a cover structured set depicted in Figure 1. The uni-
verse U = {1, 2, 3, 4, 5}. The alphabet = {A, B, C}. The interpretation func-
tion is [A] = {1, 2}, [B] = {2, 3, 5}, and [C] = {4, 5}.
  Elements of the alphabet can be combined in expressions that describe other
subsets of the universe. The most general language we will consider for these
expressions is the propositional language that consists of all expressions com-
posed of symbols from the alphabet and operators that stand for the usual set
operations of union, intersection and difference.
  Definition 2 (Propositional Language). Given a structured set (U, ), its
propositional language L(U, ) is defined as ∈ L(U, ), σ ∈ L(U, ) for all
σ ∈ , and if α, β ∈ L(U, ), then (α + β), (α − β) and (α · β) are all in L(U, ).

2 The term cover refers to the fact that the universe U
                                                     is covered by the interpretation of the alphabet
 . Later, in Section 3, we introduce the order-structure in which the universe is ordered.

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
214       •     K. Q. Pu and A. O. Mendelzon

   Definition 3 (Semantics and Length). The evaluation of L(U,                 ) is a func-
tion [·]∗ : L(U, ) → Pwr(U ), defined as

— [ ]∗ = ∅,
— [σ ]∗ = [σ ] for any σ ∈ , and
— [α + β]∗ = [α]∗ ∪ [β]∗ , [α − β]∗ = [α]∗ − [β]∗ , and [α · β]∗ = [α]∗ ∩ [β]∗ .

   The string length of L(U,            ) is a function · : L(U,    ) → N, given by

—   = 0,
— σ = 1 for any σ ∈ , and
— α+β = α−β = α·β = α + β .

   Remark. We abuse the definitions in a number of harmless ways. For in-
stance, we may refer to U as a structured set, implying that it is equipped with
an alphabet and an interpretation function [·]. The language L(U, ) is some-
times written simply as L when the structured set (U, ) is understood from
the context. The evaluation function [·]∗ supersedes the single-symbol inter-
pretation function [·], so the latter is omitted from discussions and the simpler
form [·] is used in place of [·]∗ .
   Two expressions s and t in L are equivalent if they evaluate to the same set:
that is, [s] = [t]. (Note that this means equivalence with respect to a particular
structured set (U, ) and thus does not coincide with propositional equivalence.)
In case they are equivalent, we say that s is reducible to t if s ≥ t . The
expression s is strictly reducible to t if they are equivalent and s > t . An
expression is compact if it is not strictly reducible to any other expression in
the language.
   Given a sublanguage K ⊆ L, an expression is K-compact if it belongs to K
and is not strictly reducible to any other expression in K.
   A language K ⊆ L(U, ) is granular if it can express every subset, or equiv-
alently, every singleton, that is,

                                   (∀a ∈ U )(∃s ∈ K) [s] = {a}.

We say that a structure is granular if the propositional language L(U, ) is
granular.
   If L(U, ) is not granular, then certain subsets (specifically singletons) of U
cannot be expressed by any expression. The solution is then to augment the
alphabet to include sufficiently more symbols until it becomes granular.

   Definition 4 (K-Descriptive Length). Given a structured set (U, ), con-
sider a sublanguage K ⊆ L(U, ), and a subset V ⊆ U . The language K(V )
is all expressions s ∈ K such that [s] = V , and the K-descriptive length of V ,
written V K , is defined as
                                 min{ α : α ∈ K(V )} if K(V ) = ∅, and
                     V   K   =
                                 ∞                   otherwise.

In case K = L(U,         ), we write V        K   simply as V .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets            •      215




                                   Fig. 2. A partition.

  The K-descriptive length of a subset V is just the minimal length needed to
express it in the language K.
   Example 2. Continuing with the example of the structure shown in
Figure 1, the language L(U, ) includes expressions like s1 = (A − B) − C,
s2 = A − B, and s3 = (B − A) − C, with [s1 ] = [A − B] − [C] = ([A] − [B]) − [C] =
{1} = [s2 ] and [s1 ] = [B− A]−[C] = ([B]−[A])−[C] = {3}. The first two strings s1
and s2 are equivalent, but s2 is shorter in length; therefore s1 is strictly reducible
to s2 . It’s not difficult to check that s2 is L(U, )-compact, so {1} = 2.
  Our first algorithmic problem is: what is the complexity of determining the
minimum length of a subset in the language K. We pose it as a decision problem.
  Definition 5 (The K-MDL Decision Problem).
— INSTANCE: A structured set (U,         ), a subset V ⊆ U , and a positive integer
  k > 0.
— QUESTION: V K ≤ k?
  PROPOSITION 1.        The L-MDL decision problem is NP-complete.
   The proof of Proposition 1 requires the simple observation that for any struc-
tured set (U, ), there is a naturally induced set cover, written U/ , on U given
by U/ = {[σ ] : σ ∈ }. The general minimum set-cover problem [Garey and
Johnson 1979] easily reduces to the general L-MDL problem.
   The next few sections will focus on some specific structures that are relevant
to realistic databases.

2.1 Partition is in P
In this section we focus our attention on the simple case where the symbols in
  form a partition of U .
   Definition 6 (Partition). A structured set (U,             ) is a partition if the induced
set cover U/ partitions U .
  Example 3. Consider these streets: Grand, Canal, Broadway in the city
NewYork, VanNess, Market, Mission in SanFrancisco, and Victoria, DeRivoli in
Paris. The street names form the universe, which is partitioned by the alphabet
consisting of the three city names, as shown in Figure 2.

   PROPOSITION 2. The L-MDL decision problem for a partition (U,                        ) can be
solved in O(|U | · log |U |).
   The L-MDL decision problem for partitions is particularly easy because,
given a subset V , V L is simply the number of cells that cover V exactly.
Given the partition and V , computing the number of cells that cover V exactly
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
216       •     K. Q. Pu and A. O. Mendelzon

can be done in O(|U | log |V |), and can in fact be further optimized to O(|V |) if
special data structures are used.
   Of course, in general not all subsets of street names can be expressed only by
using city names—that is, the propositional language L(U, ) for a partition is
not, in general, granular. We therefore extend the alphabet to be granular;
this requires having additional symbols in , one for each element of U .

   Definition 7 (Granular Partition). A structured set (U, ) is a granular
partition if  = 0 ∪U where (U, 0 ) is a partition. The interpretation func-
                   ˙
tion [·] : → Pwr(U ) is extended such that [u] = {u} for any u ∈ U .

    The L-MDL decision problem for granular partitions is also solvable in poly-
nomial time. We first define a sublanguage Npar ⊆ L consisting of expressions
which we refer to as normal, and show that all expressions in L are reducible
to ones in Npar , and use this to constructively show that the Npar -MDL decision
problem is solvable in polynomial time.
                                                                    →
                                                                    −
    Let A = {a1 , a2 , . . . , an } ⊆ be a set of symbols. We write A = a1 + a2 + · · · +
an . The ordering of the symbols ai does not change the semantic evaluation nor
               →
               −
its length, so A can be any of the strings that are equivalent to a1 + a2 + · · · + an
                                                                             →
                                                                             −
up to the permutations of {ai }. Furthermore, we write [A] to mean [ A]. For a
set of expressions {si }, i si is the expression formed by concatenating si by the
+ operator.

   Definition 8 (Normal Form for Granular Partitions). Let (U, 0 ∪U ) be a
                                                                     ˙
granular partition, and its propositional language be L. An expression s ∈ L is
                                     → −
                                     −     →    −→
in normal form if it is of the form ( + A+ ) − A− where ⊆ 0 and A+ and
A− are elements in U interpreted as symbols in . The normal expression s is
trim if A+ = [s] − [ ] and A− = [ ] − [s].
   Let Npar (U, ) be all the normal expressions in L(U, ) that are trim.

   Intuitively, a normal form expression consists of taking the union of some set
of symbols from the alphabet, adding to it some elements from the universe,
and subtracting some others. The expression is trim if we only add and subtract
exactly those symbols that we need to express a particular subset.
   Note that all normal and trim expressions s ∈ Npar are uniquely determined
                                                                        can
by their semantics [s] and the high-level symbols used. Therefore we− write
                                                              → → →
                                                              − −
π(V / ) to mean the normal and trim expression of the form + A+ − A− where
A+ = V − [ ] and A− = [ ] − V .
   With the interest of compact expressions, we only need to be concerned with
normal expressions that are trim for the following reasons.
                                                           →
                                                           −     → −
                                                                 −    →
  PROPOSITION 3. A normal expression s =                       + A+ − A− is L-compact only if
A+ ∩ [ ] = A− − [ ] = ∅.

  PROOF. If A+ ∩ [ −is nonempty, say a ∈ A+ ∩ [ ], then define A + = A+ − {a},
              → ]→
        → −+
        −
and s =    + A − A− . It is clear that [s ] = [s] but s < s , so s cannot be
L-compact. Similarly if A− − [ ] = ∅, we can reduce s strictly as well.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets              •      217

  PROPOSITION 4.      A normal expression for a granular partition is L-compact
only if it is trim.
                   → → →
                   − − + −−
  PROOF. Let s =      + A − A be a normal expression. Say that it is not trim.
Then either A+ = [s] − [ ] or A− = [ ] − [s]. We show that in either case, the
expression s can be strictly reduced.
  Say A+ = [s] − [ ]. There are two possibilities:
— A+ − ([s] − [ ]) = ∅: Since A+ ∩ [ ] = ∅ by Proposition 3,− − − − that
                                                                      we − →
                                                                      −−−have
    +                       +
                                                        −+− − →
                                                 → −− − −
                                                 −
  A − [s] = ∅. Let a ∈ A − [s]. Define s =            + (A − {a}) − (A − {a}). It’s
  easy to see that [s ] = [s] but s < s .
— ([s] − [ ]) − A+ = ∅: Recall that [s] = ([ ] ∪ A+ ) − A− ; we have [ ] ∪ A+ ⊇ [s],
  ([s] − [ ]) − A+ = ∅ always, making this case impossible.
  The second case of A− = [ ] − [s] implies that s is reducible by similar
arguments.
  LEMMA 1 (NORMALIZATION).           Every expression in L is reducible to one in Npar .
   PROOF. The proof is by induction on the construction of the expression s in L.
   The base case of s = and s = σ are trivially reducible to Npar . The expression
                     → → →
                     − − −
                             −
s = is reducible to ∅ + ∅ − ∅ ,→                                    The
                                 which also has a length of zero.− expression
                      −→ → −                         → −
                                                     −      → →
s = σ is reducible to {σ } + ∅ − ∅ if σ ∈ 0 , and to ∅ + {σ } − ∅ if σ ∈ U .
   The inductive step has three cases:
 (i) Suppose that s = s1 + s2 where si ∈ Npar . We show that s is reducible to
     Npar .
                       +    −
     Write si = i + Ai − Ai . Define = 1 ∪ 2 . Then, by Definition 8, we
     have the following,
             A+ = [s] − [ ] = ([s1 ] ∪ [s2 ]) − [ ]
                = ([s1 ] − [ ]) ∪ ([s2 ] − [ ]) ⊆ ([s1 ] − [ 1 ]) ∪ ([s2 ] − [         2 ])
                = A+ ∪ A+ , and
                    1       2
             A− = [ ] − [s] = ([ 1 ] − [s]) ∪ ([ 2 ] − [s])
                  ⊆ ([     1]− [s1 ]) ∪ ([   2]   − [s2 ])
                  =   A−
                       1   ∪ A− .
                              2
                                                   → →
                                            → − + −−
                                            −
     So the normal expression π ([s]/ ) =      + A − A is equivalent to s, and
     has its length
                              → −+ −−
                              −       →     →
                 π ([s]/ ) =       + A − A = | | + |A+ | + |A− |
                           ≤ | 1 | + |A+ | + |A− | + | 2 | + |A+ | + |A− |
                                       1       1               2       2
                                = s1 + s2 = s .
(ii) Suppose that s = s1 · s2 . Let si be as in (i), and define   = 1 ∩ 2 . By
     standard set manipulations similar to those in (i), we once again get
                           A+ ⊆ A+ ∪ A+
                                 1    2            and       A− ⊆ A− ∪ A− .
                                                                   1    2
     Hence s is reducible to π ([s]/ ).

                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
218        •      K. Q. Pu and A. O. Mendelzon

(iii) Finally consider the case that s = s1 − s2 with si in normal form as before.
      Let = 1 − 2 . Then one can show that
                                A+ ⊆ A+ ∪ A−
                                      1    2        and      A− ⊆ A− ∪ A+ .
                                                                   1    2
      Again s is reducible to π ([s]/ ). This concludes the proof.
   Lemma 1 immediately implies the following.
   THEOREM 1.         For all V ⊆ U , we have V            Npar   = V   L.

  By Theorem 1, one only needs to focus on the Npar -MDL problem for granular
partitions. The necessary and sufficient condition for Npar -compactness can be
easily stated in terms of the symbols used.
  Suppose V ⊆ U ; let us denote
                         +
                              (V ) = {σ ∈    : |[σ ] ∩ V | > |[σ ] − V | + 1},
and very similarly
                          #
                              (V ) = {σ ∈    : |[σ ] ∩ V | ≥ |[σ ] − V | + 1}.
Intuitively, the interpretation of a symbol in + (V ) includes more elements in
V than elements not in V —by a difference of at least two. Similarly for a symbol
in # (V ), the difference is at least one.
   We say that symbols in # (V ) are efficient with respect to V and ones in
  +
    (V ) are strictly efficient. Symbols that are not in # (V ) are inefficient with
respect to V .
  Example 4. Consider the partition in Figure 2. Let V1 = {Victoria, DeRivoli},
and V2 = {Grand, Canal}. # (V1 ) = + (V1 ) = {Paris}, # (V2 ) = {NewYork}, and
 +
   (V2 ) = ∅.
                         → −
                         −     →    −→
   LEMMA 2. Let s = ( + A+ ) − A− be an expression in Npar representing V .
It is Npar -compact if and only if + (V ) ⊆ ⊆ # (V ).
                                                                                 +
   PROOF (ONLY IF). We show that s is Npar -compact implies that                     (V ) ⊆   ⊆
  #
   (V ) by contradiction.
(i) Suppose + (V ) ⊆ , then there exists an symbol σ ∈ + (V ) but σ ∈ .
    Define   = ∪{σ }, and s = π (V / ). We have that
                ˙
     A + = V − [ ] = V − ([ ]∪[σ ]) = (V − [ ]) − [σ ] = (V − [ ])−(V ∩ [σ ])
                               ˙                                  ˙
            +˙
         = A −(V ∩ [σ ]), and
     A − = [ ] − V = ([ ]∪[σ ]) − V = ([ ] − V )∪([σ ] − V )
                          ˙                     ˙
               = A− ∪([σ ] − V ).
                    ˙
      So
                                s   = | | + |A + | + |A − |
                                    = s + (|[σ ] − V | + 1 − |V − [σ ]|)
                                    < s .
      This contradicts with the assumption that s is Npar -compact.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                             Concise Descriptions of Subsets of Structured Sets                            •       219

(ii) Say that      ⊆ # (V ). Let ω ∈                         but ω ∈        #
                                                                                (V ). Define       =    −{ω} and
                                                                                                       ˙
     s = π (V /    ). We then have,
                   +
                  A = V − ([ ]−[ω])
                                ˙
                       = A+ ∪ (V ∩ [ω]), and
                  A−   = [ ] − V = ([ ]−[ω]) − V = ([ ] − V )−([ω] − V )
                                         ˙                   ˙
                          −˙
                       = A −([ω] − V ).
    It follows then,
                                 s     = | | + |A + | + |A − |
                                       = s + (|V ∩ [ω]| − |[ω] − V | − 1)
                                       < s .
    Again a contradiction.
                                                        +
(IF). It remains to be shown that                            (V ) ⊆     ⊆        #
                                                                                     (V ) implies that s is Npar -
compact.
   Let 0 = + (V ) and s0 = π (V /                       0)   We are going to prove the following fact:
                                  +
            (∀     ⊆    0)            (V ) ⊆        ⊆        #
                                                                 (V ) =⇒ s0 = π (V / ) .                           (∗)
                                                                                            +
Therefore by Equation (*), all expressions in Npar with     (V ) ⊆   ⊆     (V )                                #

have the same length, and since one must be Npar -compact by the necessary
condition and the guaranteed existence of a Npar -compact expression, all must
be Npar -compact.
                                            → →
                                      → − + −−
                                      −
   Now we prove (*). Consider any s =    + A − A with + (V ) ⊆ ⊆ # (V ).
Define = − + (V ). Then = 0 ∪ , and  ˙
          A+ = V − [ ] = (V − [ 0 ])−(V ∩ [ ]) = A+ −([ ] ∩ V ), and
           0
                                    ˙               ˙
           −               −˙
          A0 = [ ] − V = A ∪([ ] − V ).
It then follows that
                        s0 = s + |V ∩ [ ]| − |[ ] − V | − | |.                                                 (∗∗)
Furthermore, since           ⊆         #
                                           (V ), and [ ] = ∪γ ∈ [γ ], we conclude
                                                           ˙

         V ∩ [ ]| − |[ ] − V | =                    (|V ∩ [γ ]| − |[γ ] − V |) =                1 = | |.
                                               γ∈

Substitute into Equation (**), we have the desired result: s0 = s .
    Intuitively Lemma 2 tells us that an expression is Npar -compact if and only
if it uses all strictly efficient symbols, and never uses any inefficient ones.
  COROLLARY 1. Let (U, ) be a granular partition. Given any V ⊆ U ,
π(V / # (V )) is L-compact.
  Computing π (V /           #
                                 (V )) is certainly in polynomial time.


  THEOREM 2. The L-MDL problem for granular partitions can be solved in
polynomial time.
                                              ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
220       •     K. Q. Pu and A. O. Mendelzon




                                 Fig. 3. The STORE dimension as a tree.

  Example 5. Consider V1 and V2 as defined in the previous example. By
Lemma 2, both of the following expressions of V1 ∪ V2 = { Victoria, DeRivoli,
Grand, Canal} are compact:
      s1 = (NewYork + Paris) − Broadway and s2 = Paris + (Grand + Canal).
Note that π(V /        #
                           (V )) is s1 .

2.2 Hierarchy is in P
Partition has the nice property that its MDL problem is simple. However it
does not adequately express many realistic structures. We shall generalize the
notion of (granular) partitions to (granular) multilevel hierarchies.
   Definition 9 (Hierarchy).                    A structured set (U,                ) is a hierarchy if
                                           =     1   ∪
                                                     ˙    2   ∪
                                                              ˙   3   ...∪
                                                                         ˙       N,

such that for any i ≤ N , (U, i ) is a partition; furthermore, for any i, j ≤ N , we
have i < j =⇒ U/ i refines U/ j . The integer N is referred as the number of
levels or the height of the hierarchy, and (U, i ) the ith level.

   Example 6. We extend the partition in Figure 2 to form a hierarchy with
three levels (N = 3) shown in Figure 3. The first level has 1 being the street
names, the second has 2 being the city names, and finally the third level has
  3 having only one symbol STORE.

   Consider a hierarchy (U, 1 ∪ 2 · · · ∪ N ). First note that it is granular if
                                  ˙       ˙
and only if in the first level 1 = U , that is, (U, 1 ∪ 2 ) is a granular partition.
                                                     ˙
                               N
For i < N , we define i = k=i+1 k . The alphabet i contains all symbols in
levels higher than the ith level of the hierarchy. We may view i as a universe,
and consider ( i , i ) as a new hierarchy, with the interpretation function given
by
                      [·]i :     i   → Pwr(          i)   : λ → {σ ∈         i   : [σ ] ⊆ [λ]}.
Let Li denote the propositional language L( i , i ).
  Much of the discussion regarding partitions naturally applies to hierarchies
with some appropriate generalization.
   Definition 10 (Normal Forms). An expression s ∈ Li is in normal form for
                                               →
                                              −+ −−    →
the hierarchy if it is of the form s = s + Ai − Ai , where s ∈ Li+1 is the leading
                                           ˆ                      ˆ
                               +     −                                                +
subexpression of s, and Ai , Ai ⊆ i . It is trim if s is Li+1 -compact and Ai =
                                                           ˆ
                  −
[s]i − [ˆ ]i and Ai = [ˆ ]i − [s]i . We denote (Nhie )i = Nhie ( , i ) to be the set of all
        s              s
normal and trim expressions of the hierarchy ( i , i ), and let Nhie ≡ (Nhie )1 .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets             •     221




                    Fig. 4. The filled circles are the selected elements.

  Here are some familiar results.
  PROPOSITION 5.       A normal expression in Li is Li -compact only if it is trim.
  The proof of Proposition 5 mirrors that of Proposition 4 exactly.
  LEMMA 3 (NORMALIZATION).         Every expression in Li can be reduced to one in
(Nhie )i .
   PROOF. We prove by induction on the construction of expressions in (Nhie )i .
The base cases of s = and s = σ are trivial.        → −
                                                    −     →
   Suppose that s = s1 + s2 ∈ Li , where sk = sˆ + A+ − A− for k = 1, 2. Then let
                                                 k    k    k
t be an Li+1 -compact expression that sˆ1 + sˆ2 → − to.
ˆ                                               reduces
                                               −+    →
   Consider the normal expression s = t + A − A− where A+ = [s]i − [t ]i and
                                           ˆ                              ˆ
A = [t ]i − [s]i . Repeating the proof of Lemma 1, we have that A+ ⊆ A+ ∪ A+
  −    ˆ                                                                  1     2
and A− ⊆ A− ∪ A− . Therefore s reduces to s .
             1      2
   The cases for s = s1 · s2 and s = s1 − s2 are handled similarly.
  THEOREM 3.     Let (U,     ) be a hierarchy, then for any V ⊆ U , V              L   = V    Nhie .

Theorem 3 follows immediately from Lemma 3.
  As in the case for partitions, one only needs to focus on the expressions in
Nhie since Nhie -compactness implies L-compactness.
   LEMMA 4 (NECESSARY CONDITION). Let s ∈ (Nhie )i , and V = [s]i . It is (Nhie )i -
                   +                                   +
compact only if i+1 (V ) ⊆ [ˆ ]i+1 ⊆ i+1 (V ), where i+1 (V ) and i+1 (V ) are,
                               s        #                            #

respectively, the strictly efficient and efficient alphabets in i+1 with respect
to V .
The (only if) half of the proof of Lemma 2 applies with minimal modifications.
   Note that Lemma 4 mirrors Lemma 2. It states that the expression s is
                      ˆ
compact only when s expresses all the efficient symbols in i+1 with respect to
V , and never any inefficient ones. It is also worth noting that this condition is
not sufficient, unlike the case in Lemma 2, as demonstrated in the following
example.
  Example 7.     Consider the hierarchical structure shown in Figure 4.
  Let V = {1, 2, 4, 5}. The expression s = 1 + 2 + 4 + 5 expresses V is normal.
            +
Note that 1 (V ) is empty, so s is also trim, but it is not compact as it can be
reduced to s = D − (3 + 6).
  For any i ≤ N , define a partial order over (Nhie )i , such that for any two
expressions s, t ∈ (Nhie )i ,
                   s     t ⇐⇒ [s]i = [t]i       and     [ˆ ]i+1 ⊇ [t ]i+1 .
                                                         s         ˆ
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
222       •     K. Q. Pu and A. O. Mendelzon

   PROPOSITION 6. Let s, t be two equivalent expressions in Nhie which satisfy
the necessary condition of Lemma 4. Then s t =⇒ s ≤ t . In other words,
  · : (Nhie , ) → (N, ≤) is order preserving.
                           → −
                           −      →               → −
                                                  −     →
   PROOF. Write s = s + A+ − A− and t = t + B+ − B− , and let V = [s]i = [t]i .
                       ˆ                      ˆ
By assumption, [ˆ ]i+1 and [t ]i+1 are subsets of i+1 (V ). Define = [ˆ ]i+1 −[t ]i+1
                 s           ˆ                     #
                                                                     s      ˙ ˆ
                             #
which is also a subset of i+1 (V ).
   Recall that A+ = V − [ˆ ]i and B+ = V − [t ]i .
                           s                    ˆ
                             s    t =⇒ [ˆ ]i ⊇ [t ]i =⇒ A+ ⊆ B+ .
                                        s       ˆ
Furthermore,
                 B+ = V − [t ]i = V − [ˆ − ]i = (V − [ˆ ]i ) ∪ (V ∩ [ ]i )
                           ˆ           s              s ˙
                       + ˙
                    = A ∪ (V ∩ [ ]i ).
Similarly, we can show that A− = B− ∪ ([ ]i − V ). Therefore,
                                    ˙
                          |B+ − A+ | = |B+ | − |A+ | = |V ∩ [ ]i |,
                          |A− − B− | = |A− | − |B− | = |[ ]i − V |.
So, s − t = ( s − t ) + (|A+ | − |B+ |) + (|A− | − |B− |) = ( s − t ) − | |.
                   ˆ     ˆ                                    ˆ   ˆ
                                   →
                                   −
Observe that s is equivalent to t + , so s ≤ t + | |. Therefore s ≤ t .
             ˆ                  ˆ        ˆ   ˆ
   Therefore by minimizing with respect to , we are effectively minimizing
the length. It is immediate from the definition of that minimization over
                                                               #
in (Nhie )i yields maximization of [ˆ ]i+1 which is bounded by i+1 ([s]). This leads
                                    s
to the following recursive description of a minimal expression of a set V .
    COROLLARY 2.   Let minexpi : Pwr( i ) → Li be defined as
                   →
                   −
— minexp N (V ) = V ,
— for 0 ≤ i < N , minexpi (V ) = πi (V /minexpi+1 ( i+1 (V ))) where πi (V /t) denotes
                                                    #
                      −−−
                     −− −→ −− −→    −−−
  the expression t + (V − [t]i ) − ([t]i − V ).
Then for any subset V ⊆ U , minexp0 (V ) is an Nhie -compact expression for V .
   Here is a bottom-up decomposition procedure to compute a minimal expres-
sion in Nhie for a given subset V ⊆ U .
   Definition 11 (Decomposition Operators).                     Define the following mappings
for each i ≤ N :
      i : Pwr( i ) → Pwr( i+1 )         :V →       #
—                                                  i+1 (V ).
      +
—     i : Pwr( i ) → Pwr( i ) :        V → V − [ i (V )]i , and
      −
—     i : Pwr( i ) → Pwr( i ) :        V → [ i (V )]i − V .
  With these operators and Corollary 2, we can construct a Nhie -compact ex-
pression for a given set V with respect to a hierarchy in an iterative fashion.
    THEOREM 4.        Suppose V ⊆ U . Let
— V1 = V ,
— Vi+1 = i (Vi ), Wi+ =          +
                                 i (Vi )   and Wi− =      −
                                                          i (Vi ),   for 1 < i ≤ N .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Concise Descriptions of Subsets of Structured Sets             •      223




                   Fig. 5. The decomposition algorithm for hierarchy.

  Define the expressions
        →
        −
— sN = VN ,
               −
              −+→     −
                     −− →
— si−1 = si + Wi−1 − Wi−1 for 1 ≤ i < N .
Then s1 is a Nhie -compact expression expressing V .
   Corollary 2 follows from simple induction on the number of levels of the
hierarchy and showing that at each level the constructed expression satisfies
the sufficient condition stated in Proposition 6.
   Clearly the complexity of construction of s1 is in polynomial time, in fact can
be done in O(| | · |V | · log |V |). The algorithm is illustrated in Figure 5.
  Example 8. Consider the hierarchy in Figure 3. Let V1 = {Victoria,
DeRivoli, Grand, Broadway, Market}. The algorithm produces:
                             +             −
V2 = {Paris, NewYork}, and W1 = {Market}, W1 = {Canal},
                 +         −
V3 = {STORE}, W2 = ∅. W2 = {SanFrancisco}
  The expressions produced by the algorithm are
— s3 = STORE,
— s2 = STORE − SanFrancisco,
— s1 = (STORE − SanFrancisco) + Market − Canal.
Since s1 is guaranteed compact, V1 = s1 = 4. Note that s1 is not the only
compact expression; (NewYork − Canal) + Market + Paris, for instance, is another
expression with length 4.

2.3 Multidimensional Partition and Hierarchy
An important family of structures is the multidimensional structures. The sim-
plest is the multidimensional partition.
  Definition 12 (Multidimensional Partition). A cover structure (U, ) is a
multidimensional partition if the alphabet    = 1 ∪ 2 · · · ∪ N where for ev-
                                                   ˙         ˙
ery i, (U, i ) is a partition as defined in Definition 6. The integer N is the
dimensionality of the structure. The hierarchy (U, i ) is the ith dimension.
  Note the subtle difference between a multidimensional partition and a hi-
erarchy. A hierarchy has the additional constraint that U/ i are ordered by
granularity, and is in fact a special case of the multidimensional partition, but
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
224       •     K. Q. Pu and A. O. Mendelzon

as one might expect, and we shall show, the relaxed definition of multidimen-
sional partition leads to a NP-hard MDL-problem.
   A simple extension of the multidimensional partition is the multidimensional
hierarchy.
  Definition 13 (Multidimensional Hierarchy). A cover structure (U, ) is a
multidimensional hierarchy if the alphabet      = 1 ∪ 2 · · · ∪ N where, for
                                                       ˙       ˙
every i, (U, i ) is a hierarchy as defined in Definition 9. The integer N is the
dimensionality of the structure.
   In this section, we will consider three languages which express subsets of the
universe, with successively more grammatic freedom. It will be shown that the
MDL decision problem is NP-complete for all three languages. In fact, we will
show this on a specific kind of structures that we call product structures. In-
tuitively, multidimensional partitions and multidimensional hierarchies make
sense when the elements of the universe can be thought of as N -dimensional
points, and each of the partitions or hierarchies operates along one dimension.
Most of our discussion will focus on the two-dimensional (2D) case (N = 2),
which is enough to yield the NP-completeness results. We next define product
structures for the 2D case.
   Definition 14 (2D Product Structure). We say that (U, ) is a 2D product
structure if universe U is the cartesian product of two disjoint sets X and Y :
U = X × Y , and the alphabet       is the union of X and Y :     = X ∪ Y . The
                                                                       ˙
interpretation function is defined as, for any z ∈ ,
                                            {z} × Y if z ∈ X ,
                                  [z] =
                                            X × {z} if z ∈ Y .
  Note that the 2D product structure is granular, since the language L(X ×
Y,  ) can express every singleton {(x, y)} ∈ Pwr(U ) by the expression (x · y).
  The 2D product structure admits two natural expression languages, both
requiring the notion of product expressions.
   Definition 15 (Product Expressions).                 An expression s ∈ L is a product ex-
pression if it is of the form
      → →
     − −
s = ( A · B ) where A ⊆ X and B ⊆ Y .
     We build up two languages using product expressions.
   Definition 16 (Disjunctive Product Language).                     The   disjunctive   product
language L P + is defined as
— ∈ LP +,
—any product expression s belongs to L P + ,
— if s, t ∈ L P + , then (s + t) ∈ L P + .
  It is immediate that any expression s ∈ L P + can be written in the form i∈I si
where, for any i, si is a product expression. A generalization of the disjunctive
product language is to allow other operators to connect the product expressions.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                           Concise Descriptions of Subsets of Structured Sets              •      225

  Definition 17 (Propositional Product Language).                     The propositional prod-
uct language L P is defined as
— ∈ LP ,
—any product expression s belongs to L P ,
— if s, t ∈ L P , then (s + t), (s − t), (s · t) ∈ L P .
   Obviously L P +       LP      L.
   Example 9.       Consider a 2D product structure with
                       CITY = { New York, San Francisco, Paris},
and PRODUCT = { Clothing, Beverage, Automobile}. The universe U = CITY ×
PRODUCT consists of the nine pairs of city name and product family:

U=       (NewYork, Clothing), (NewYork, Beverage), (NewYork, Automobile),
         (SanFrancisco, Clothing), (SanFrancisco, Beverage), (SanFrancisco, Automobile),
         (Paris, Clothing), (Paris, Beverage), (Paris, Automobile) .

   The alphabet        consists of six symbols
          = CITY ∪ PRODUCT
                 ˙
          = {NewYork, SanFrancisco, Paris, Clothing, Beverage, Automobile}.
   The interpretation of a symbol is the pairs in U in which the symbol occurs.
                                                           
                                     ( NewYork, Beverage) 
   For instance, [ Beverage] = ( SanFrancisco, Beverage) .
                                                           
                                         ( Paris, Beverage)
   Consider the following expressions in L(U,                ):
— s1 = ((NewYork + Paris) · Clothing) + (NewYork · Beverage),
—s2 = ((NewYork + Paris) · (Clothing + Beverage)) − (NewYork · Clothing),
— s3 = NewYork − Beverage.
The expressions s1 ∈ L P + , s2 ∈ L P − L P + , and s3 ∈ L − L P . They are evaluated
to [s1 ] = {(NewYork, Clothing), (Paris, Clothing), (NewYork, Beverage)}, and [s2 ] =
 (NewYork, Beverage), (Paris, Clothing), (Paris, Beverage) . The last expression
s3 is a bit tricky—it contains all tuples of NewYork that are not Beverage, so
[s3 ] = {(NewYork, Clothing), (NewYork, Automobile)}.
  We will see that the MDL decision problem for each of these languages is
NP-complete.

2.4 The L P -MDL Decision Problem is NP-Complete
In this section, we prove that the MDL problems for L P + and L P are NP-
complete. It’s obvious that they are all in NP. The proof of NP-hardness is by
a reduction from the minimal three-set cover problem. Recall that an instance
of minimal three-set cover problem consists of a set cover C = {C1 , C2 , . . . , Cn }
                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
226       •     K. Q. Pu and A. O. Mendelzon




                                 Fig. 6. A set cover with three cells.




         Fig. 7. The transformed instance of the MDL problem of 2D product structure.

where (∀C ∈ C)|C| = 3 and an integer k > 0. The question is if there exists
a subcover D ⊆ C such that                 D =     C and |D| ≤ k. This is known to be
NP-complete [Garey and Johnson 1979].
   From this point on, we fix the instance of the minimum cover problem (C, k).
Write C = {C1 , C2 , . . . , Cn }. Define X =            C, and for each i ≤ n, let Y i be
a set such that |Y i | = m > 3. The family {Y i }n is made disjoint. Let Y =
(∪i≤n Y i ) ∪ { y ∗ }, where y ∗ does not belong to any Y i . The structure is the 2D
 ˙          ˙
product structure of X × Y . The subset to be represented is given by V =
∪i≤n (Ci × Y i ) ∪ ( X × { y ∗ }). It is not difficult to see that this is a polynomial time
reduction.
    Example 10. Consider a set X = {A, B, C, D, E}, and a cover C = {C1 , C2 ,
C3 } where C1 = {A, B, C}, C2 = {C, D, E} and C3 = {A, C, D}, as shown in
Figure 6.
    It is transformed by first constructing Y 1 , Y 2 , and Y 3 , all disjoint and each
with four elements. Then let Y = Y 1 ∪ Y 2 ∪ Y 3 ∪ { y ∗ }. The structure is the 2D
                                            ˙    ˙   ˙
product structure of X and Y . The subset V = (C1 × Y 1 ) ∪ (C2 × Y 2 ) ∪ (C3 ×
                                                                   ˙            ˙
Y 3 ) ∪ (X × { y ∗ }). It is shown as the shaded boxes in Figure 7.
      ˙
    It turns out that for this very specific subset V , one can characterize the form
of the compact expressions that express V in L P .
   LEMMA 5. Let V be a subset resulted from the reduction from a set cover
problem (depicted in Figure 7). Then all L P -compact expressions of V are in the
form of
                                 − −
                                  → →            → →
                                                 − −∗
                        s=      ( Ci · Y i ) +  (C j · Y j ),
                                     i∈I                j ∈J
        ∗
where Y j = Y j ∪ { y ∗ }, and I ∩ J = ∅, and I ∪ J = {1, 2, . . . , n}.
                ˙                               ˙
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets                   •          227

   Note that, by Lemma 5, the L P -compact expressions of V do not make use
of the negation “−” and conjunction “·” operators between product expressions;
hence they belong to L P + .
Example.      For subset V in Figure 7, the expression
                               → →
                              − −∗            → →
                                             − −∗            → →
                                                            − −
                         s = ( C1 · Y 1 ) + ( C2 · Y 2 ) + ( C3 · Y 3 )
is both L P and L P + -compact. Therefore V L P = V L P + = (3 + 5) + (3 + 4) +
(3 + 5) = 23.
   The proof of Lemma 5 is by ruling out all other possible forms. Before delving
into the details of the proof of Lemma 5, let’s use it to prove the NP-hardness
of the L P + -MDL and L P -MDL problem.

   THEOREM 5. L P + -MDL and L P -MDL’s are NP-complete for multidimen-
sional partitions.
   PROOF. This follows from Lemma 5. As we mentioned, V                              LP +   = V   LP .   Let
s be a L P -compact expression of V . Since
                                        − −
                                         → →                    → →
                                                                − −∗
                            s=         ( Ci · Y i ) +          (C j · Y j ),
                                 i∈I                    j ∈J

its length is s = i≤n (|Ci | + |Y i |) + |J | = (3 + m)n + |J |. Since [s] = V , it is
                                                → →
                                                − −∗
necessarily the case that X × { y ∗ } ⊆ [ j ∈J (C j · Y j )], or that {C j } j ∈J covers X .
   Minimizing s with s in the given form is equivalent to minimization of |J |,
or finding a minimal cover of X , which is of course the objective of the minimum
set cover problem.
     The proof of Lemma 5 makes use of the following results.
   Definition 18 (Expression Rewriting). Let σ be a symbol, and t an expres-
sion. The rewriting, written · : σ → t is a function L → L : s → s : σ → t ,
defined inductively as
—      :σ →t = ,
                                                    t     if σ = σ ,
—for any symbol σ ∈         , σ :σ →t =             σ     else,
— for any two strings s, s ∈ L, s           s :σ →t = s:σ →t                     s : σ → t , where
    can be +, −, or ·.
     Basically s : σ → t replaces all occurrences of σ in s by the expression t.
   Definition 19 (Extended Expression Rewriting). Given a set of symbols
 0  ⊆ , and t an expression that does not make use of symbols in 0 , then
 s : 0 → t is the expression of replacing every occurrence of symbols in 0 by
the expression t.

     PROPOSITION 7 (SYMBOL REMOVAL).           For any expression s ∈ L P , we have that

                                 [s:z→           ] = [s] − [z],

for any symbol z ∈ X ∪ Y . In other words, s : z →                        ≡ s − z.
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
228       •     K. Q. Pu and A. O. Mendelzon

   PROOF. We prove by induction on the number of product expressions in s.
               → →
               − −
   Suppose s = A · B where A ⊆ X and B ⊆ Y . Without loss of generality, say
z ∈ A; then
                [s:z→           ] = (A − {z}) × B = A × B − [z] = [s] − [z].
   The induction goes as follows:
              [ t+t : z →         ] = [t:z→ + t :z→ ]
                                    = ([t] − [z]) ∪ ([t ] − [z]) = ([t] ∪ [t ]) − [z]
                                    = [t + t ] − [z].
Similar arguments apply to the cases t − t and t · t .
  We need to emphasize that Proposition 7 does not apply to expressions in L in
general. For instance, if s = x and z = y, we have that x : y → = x ≡ x − y.
  PROPOSITION 8 (SYMBOL ADDITION).                Let s ∈ L P and x, x ∈ X where x does
not occur in s. Then,
                            [ s : x → x + x ] = [s] ∪ ({x } × [s](x)),
                                                    ˙
where [s](x) = { y ∈ Y : (x, y) ∈ [s]}. Similarly,
                           [ s : y → y + y ] = [s] ∪ ([s]( y) × { y }).
                                                   ˙
  PROOF. As a notational convenience, let’s fix x, x ∈ X and write ↑ s = s :
x → x + x and d (s) = {x } × [s](x).
  Let be +, − or ·, and x not occur in s or s ; then by simple arguments,
[s s ](x) = [s](x) [s ](x). It follows, then, that
                 d (s     s ) = x × [s s ](x) = (x × [s](x))          (x × [s](x))
                              = d (s) d (s ).
So d (·) distributes over +, − and ·.
   We now prove Proposition 8 by induction on the number of product expres-
                             → →
                             − −
sions in s. For s = or s = A · B , it is obvious.
   Suppose that s = t + t ; then
                        [↑ s] = [↑ t] ∪ [↑ t ] = ([t] ∪ d (t)) ∪ ([t ] ∪ d (t ))
                                                      ˙                ˙
                              = ([t + t ]) ∪ (d (t + t )).
This is not sufficient yet since we need to show that the union of [t + t ] and
d (t + t ) is a disjoint one. It’s not too difficult since we recall that d (t + t ) =
x × [t + t ](x), but x does not occur in t nor in t ; therefore it is not in s. And
since t, t ∈ L P , [t + t ] ∩ [x ] = ∅.
   The cases for s = t − t and s = t · t are handled similarly. We only wish to
remark that, for these two cases, it is important to have the disjointness from
d (t) to both [t] and [t ].
   Again Proposition 8 does not generalize to L. As a counterexample, say s =
x + y. Then ↑ s = x + x + y, so [s](x) = Y . Indeed [↑ s] = [s] ∪ d (s), but it’s not
a disjoint union: [↑ s] = [s] ∪ d (s), since d (s) ∩ [s] = {(x , y)}.
                              ˙
   We now prove Lemma 5 using the results in Proposition 7 and Proposition 8.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                          Concise Descriptions of Subsets of Structured Sets                •      229

    PROOF. Let’s first define #z(s) to be the number of occurrences of the symbol
z in the expression s.
    Consider the reduction from an instance of the minimum three-set cover
problem. We have the instance of C = {Ci }i ∈ I where |Ci | = 3 for all i ∈ I . The
reduction produces the universe X ×Y where X = i Ci , and Y = i∈I Y i ∪{ y ∗ }˙
where Y i are disjoint and |Y i | > 3. The subset to be represented is V = (∪i (Ci ×
                                                                            ˙
Y i )) ∪ (X × { y ∗ }). Let s be a L P -compact expression.
       ˙
  Claim I: (∀i ≤ n)(∃ y ∈ Y i ) # y(s) = 1.
  By contradiction, suppose that (∃i)(∀ y ∈ Y i ) # y(s) > 1; then let s = s : Y i →
      → →
     − −
  + ( Ci · Y i ). By Proposition 7,
           [s ] = ([s] − [Y i ]) ∪ (Ci × Y i ) = ([s] − [Y i ]) ∪ ([s] ∩ [Y i ]) = [s].
So s is equivalent to s, but it is shorter in length:
                 s    = s −                 # y(s) + |Ci | + |Y i |
                                    y∈Y i


                      ≤       s −             2 + |Ci | + |Y i |
                                      y∈Y i
                      = s − 2|Y i | + |Ci | + |Y i | = s + (|Ci | − |Y i |)
                      < s .
Therefore s strictly reduces to s , which is contradictory to the compactness of s.
    Claim II: (∀i)(∀ y ∈ Y i ) # y(s) = 1.
    For contradiction, let’s assume (∃i)(∃ y ∈ Y i ) # y(s) ≥ 2. By Claim I, for i,
there exists at least one z ∈ Y i such that #z(s) = 1. Define s1 = s : y →
and s = s1 : z → z + y . We show that s reduces strictly to s : First note that
[s1 ] = [s] − [ y], and [s ] = [s1 ] ∪ ([s1 ](z) × y). However, [s1 ](z) = ([s] − [ y])(z) =
                                     ˙
[s](z) − [ y](z) = [s](z) since [ y](z) = ∅. So
           [s ] = ([s] − [ y]) ∪ ([s]( y) × y) = ([s] − [ y]) ∪ ([s] ∩ [ y]) = [s].
In terms of its length,
                          s   = s1 + 1 = s − # y(s) + 1 < s .
Again a contradiction. Since each y ∈ i Y i must occur exactly once in s, it
must be then of the form as claimed. This proof works for both L P or L P + -
compactness.

2.5 The General L-MDL Problem is NP-Complete
As mentioned, the symbol removal and additions rules do not hold in general
for expressions in L and, as a result, it is not guaranteed that the minimal
expression for V is in the prescribed form in Lemma 5. Here is an example.
   Example. Consider once again the subset V in Figure 7, and an expression
in L but not in L P :
                          →
                          −               →
                                          −               →
                                                          −
            s = (A + B) · Y 1 + (D + E) · Y 2 + (A + D) · Y 3 + C + y ∗ .
                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
230       •     K. Q. Pu and A. O. Mendelzon

Note that [s] = V , but certainly s is not of the form given in Lemma 5. Its length
is s = (2 + 4) + (2 + 4) + (2 + 4) + 1 + 1 = 20. Therefore in this case we have
that V L < V L P .
   The richness of L prevents us from using Lemma 5 to arrive at the NP-
hardness of the L-MDL decision problem. We have to modify the reduction
from the minimum three-set cover problem, and deal with the expressions in
greater detail.
   Definition 20 (Domain Dependency). Let X 0 = X and Y 0 = Y as defined
in the reduction from a minimum cover problem. Define a sequence of sets
X 0 , X 1 , X 2 , . . . , and Y 0 , Y 1 , Y 2 , . . . , such that for all k ≥ 0, X k+1 = X k ∪ {αk }
                                                                                                ˙
and Y k+1 = Y k ∪        ˙ {βk }, where αk and βk are two symbols that do not belong to
X k and Y k , respectively. We therefore have a family of 2D product structures
{X k × Y k } with the propositional languages L0                       L1    L2 · · ·. Let s ∈ Lk , for
k ≥ k, and write [s]k to be the evaluation of the expression s in the language
Lk .
   For any k ≥ 0, we say that s ∈ Lk is domain independent if
                                       ∀k > k · [s]k = [s]k .
If s ∈ Lk is not domain independent, then it’s domain dependent.
    The notion of domain dependency naturally bipartitions the languages.
Let Lk = {s ∈ Lk : s is domain independent.}, and Lk = {s ∈ Lk :
       I                                                  D

s is domain dependent.}.
    Given an expression s, whether it is domain dependent or not depends on
the set of unbounded symbols, defined below.
   Definition 21 (Bounded Expressions). Let s be an expression in a propo-
sitional language. The set of unbounded symbols of s, U(s) is a set of symbols
that appear in s, defined as
— U( ) = ∅,
— U(σ ) = {σ }, and
— U(t + t ) = U(t) ∪ U(t ), U(t − t ) = U(t) − U(t ), U(t · t ) = U(t) ∩ U(t ).
   In case U(s) = ∅, we say that s is a bounded expression, or that it is bounded;
otherwise s is unbounded.
   An expression s ∈ Lk can be demoted to an expression in Lk−1 by erasing the
symbols αk−1 and βk−1 so the resulting expression is one in Lk−1 . Let’s write
↓ k s = s : αk−1 → : βk−1 → . Therefore ↓ k : Lk → Lk−1 .
  k−1                                          k−1
   The following is a useful fact.
   PROPOSITION 9.        For s ∈ Lk , [↓          k
                                                  k−1 s]k−1   = [s]k ∩ Uk−1 where Uk−1 =
X k−1 × Y k−1 .
   The proof of Proposition 9 is by straightforward induction on s in Lk .
   While s ∈ Lk can be demoted to Lk−1 , it can also be promoted to Lk+1 without
any syntactic modification. Of course, when treated as an expression in Lk+1 ,
it has a different evaluation.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Concise Descriptions of Subsets of Structured Sets            •      231

  PROPOSITION 10. For s ∈ Lk , [s]k+1 = [s]k ∪ (U X (s) × {βk }) ∪ ({αk } × UY (s)),
                                                 ˙               ˙
where U X (s) = U(s) ∩ X k and UY (s) = U(s) ∩ Y k .
   PROOF. We write δk (s) = (U X (s) × {βk }) ∪ ({αk } × UY (s)). So the result reads
                                              ˙
[s]k+1 = [s]k ∪ ˙ δk (s). Note [s]k = [↓ k+1 s]k = [s]k+1 ∩ Uk , which is disjoint
                                          k
from δk (s); hence the union is disjoint. The union inside δk (s) is disjoint for the
obvious reason that αk ∈ X k and βk ∈ Y k .
   By straightforward set manipulations, we can show that δk (t t ) = δk (t) δ(t )
for any t, t ∈ Lk and be +, − or ·.
   The rest of the proof is by induction on the construction of s mirroring
exactly that in Proposition 8 with δk (s) in the place of d (s).
  COROLLARY 3. An expression is domain independent if and only if it is
bounded, that is, ∀k.s ∈ Lk ⇐⇒ U(s) = ∅.
                          I


  PROOF.    [s]k+1 = [s]k ⇐⇒ U X (s) = UY (s) = ∅ ⇐⇒ U(s) = ∅.
   Another result that follows from Proposition 9 and Corollary 3 is the
following.
  COROLLARY 4. If s is domain-independent in Lk , then for all (x, y) ∈ [s]k ,
both x and y must appear in s.
   PROOF. Let s ∈ Lk . We show the contrapositive statement: if x or y does not
                     I

appear in s, then (x, y) ∈ [s]k . Let’s say x does not appear in s (the case for y is
by symmetry). Since s is domain independent, and U(↓ k s) ⊆ U(s) = ∅, ↓ k s
                                                           k−1                  k−1
is also domain independent. We can make the arbitrarily removed symbols αk−1
and βk−1 to be x and some z which does not appear in s either, respectively. This
means that ↓ k s = s, and (x, y) ∈ Uk ⊇ [↓ k s]k−1 = [s]k by Proposition 9.
              k−1                               k−1

  The importance of domain dependency of expressions is demonstrated by
the following results.
  LEMMA 6. Let V ⊆ X 0 × Y 0 , V Lk and V Lk be the lengths of its shortest
                                  I         D

domain-independent and domain-dependent expressions in Lk , respectively. We
have
                          ∀k ≥ 0. V    Lk
                                        I   ≥    V   Lk+1 ,
                                                      I       and
                         ∀k ≥ 0. V     Lk
                                        D        V   Lk+1 .
                                                      D



   PROOF. It’s easy to see why V Lk ≥ V Lk+1 : let s be a Lk -compact expres-
                                        I        I
                                                                   I

sion of V . Since s ∈ Lk+1 and [s]k+1 = [s]k = V , it also is the case that s ∈ Lk+1 ;
                                                                                 I

hence V Lk+1 ≤ s = V Lk .
             I               I

   To show V Lk    D     V Lk+1 , let s be a Lk+1 -compact expression for V . By
                             D
                                               D

Proposition 9, [↓ k+1 s]k = [s]k+1 ∩ Uk = V . It’s not difficult to see that αk+1 and
                  k
βk+1 do not appear in s, so U(↓ k+1 s) = U(s) = ∅. That is, ↓ k+1 s expresses V and
                                 k                            k
is domain dependent.
   Next we show that ↓ k+1 s < s by contradiction: if ↓ k+1 s = s , then
                            k                                      k
↓ k s = s since ↓ k+1 s is formed by removing symbols from s. Therefore we
  k+1
                    k
have that [↓ k+1 s]k+1 = [s]k+1 = [↓ k+1 s]k . But by Proposition 10, this means
             k                         k

                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
232       •     K. Q. Pu and A. O. Mendelzon

that U(↓ k+1 s) = ∅, which is a contradiction. Therefore,
         k

                               V    Lk
                                     D   ≤ ↓ k+1 s < s = V
                                             k                              Lk+1 .
                                                                             D



This concludes the proof.
   COROLLARY 5.         For any V ⊆ X 0 × Y 0 ,
                                   ∀k > (2|V |). V          Lk
                                                             I       V   Lk .
                                                                          D


In other words, ∀k > (2|V |). V             Lk   = V    Lk .
                                                         I


   Therefore by enlarging the dimensions X and Y by adding 2|V | new symbols
to each, we are guaranteed that all L-compact expressions are domain inde-
pendent, and hence are bounded. The reason to force the compact expressions
to be domain independent is so that we can reuse symbol removal and addition
rules of Proposition 7 and Proposition 8.
   From this point on, it is understood that the domain has been enlarged to
Uk for k > 2|V | and the subscript k is dropped. For instance, we write L for Lk .
   PROPOSITION 11.         Let s ∈ L I . Then,
(1) [ s : z → ] = [s] − [z].
(2) If z does not occur in s, then [ s : z → z + z ] = [s] ∪ (z × [s](z)).
                                                           ˙

   PROOF. For (1), suppose z is the symbol to be replaced with . One can show
that for all subexpressions s of s, for all x ∈ X and y ∈ Y , if x = z and y = z
then (x, y) ∈ [s ] ⇐⇒ (x, y) ∈ [ s : z → ] by induction on the subexpressions
of s. Therefore we immediately have [s] − [z] ⊆ [ s : z → ].
   For the other containment, observe that U( s : z → ) ⊆ U(s) = ∅, so
 s:z →        is also domain indenpendent by Corollary 3. It follows, then, that
every point (x, y) ∈ [ s : z → ] cannot be in [z] by Corollary 4, so x = z and
y = z; therefore (x, y) ∈ [s] − [z].
   For (2), z is to be replaced with z +z where z does not occur in s. Without loss
of generality, say z ∈ X . By induction, we can show that for all subexpressions
s of s, for all y ∈ Y , we have (z, y) ∈ [s ] ⇐⇒ (z , y) ∈ [ s : z → z + z ]. It
then follows that [ s : z → z + z ] = [s] ∪ [s](z) × z . The disjointness of the
union comes the fact that, since s is domain independent, [s] ∩ [z ] = ∅ since z
does not appear in s.
   This allows us to repeat the arguments as in Lemma 5 to obtain the following.
   LEMMA 7.        There exists a L I -compact expression for V of the form (∗).
                                            − −
                                             → →                    → →
                                                                    − −∗
                               s=          ( Ci · Y i ) +          (C j · Y j ).     (∗)
                                     i∈I                    j ∈J

  SKETCH OF PROOF. Let s be a L I -compact expression for V . Following the
arguments presented in the proof of Lemma 5 using symbol addition and
removal rules in Proposition 11, we obtain: (∀i)(∀ y ∈ Y i ) # y(s) = 1.
  It is still possible that, s is not in the form (∗), for L I is flexible enough that
                                                                                  →
                                                                                  −
there is no guarantee that, for all i, all y ∈ Y i occurs consecutively to form Y i .
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets             •      233

   But we can always rewrite s to that form. For each i, pick a y i , and let
Y i = Y i − { y i }. First rewrite s to s = s : Y i → . So s results from s by
replacing of all occurrences of y ∈ Y i for y = y i by the empty string . Then
                             →
                             −
construct s = s : y i → Y i . One can easily show that [s ] = [s] and s = s ,
                                           →
                                           −     −−−
                                                −− −→               →
                                                                    −       −−−
                                                                           −− −→
and in s , all occurrences of y occur in Y i or Y i ∪ { y ∗ }. Each Y i or Y i ∪ { y ∗ } is
                                         →
                                         −
necessarily individually bounded by Ci . Therefore s is of the form (∗).

   Finally we arrive at the more or less expected result:

   THEOREM 6. The L-MDL decision problem is NP-complete for multidimen-
sional partitions.

3. THE ORDER-STRUCTURE AND LANGUAGES
So far, all aforementioned structures are cover structures, namely, structures
characterized by a set cover on the universe. Another important family of
structures is the order-structure where structures are characterized by a
family of partial orders on the universe.

  Definition 22 (Order-Structured Set and Its Language). An order struc-
tured set is a set equipped with partial order relations (U, ≤1 , ≤2 , . . . , ≤ N ).
The language L(U, ≤1 , . . . , ≤ N ) is given by

— is an expression in L(U, ≤1 , . . . , ≤ N ),
— for any a ∈ U , a is an expression in L(U, ≤1 , . . . , ≤ N ),
— for any a, b ∈ U and 1 ≤ i ≤ N , (a →i b) is an expression in L(U, ≤1 , . . . , ≤ N ),
— (s + t), (s − t) and (s · t) are all expressions in L(U, ≤1 , . . . , ≤ N ) given that
  s, t ∈ L(U, ≤1 , . . . , ≤ N ), and
— nothing else is in L(U, ≤1 , . . . , ≤ N ).

When no ambiguity arises, we write L(U, ≤1 , . . . , ≤ N ) as L.

  Similar to the proposition language for cover structured sets, we define the
expression evaluation and length for the language L(U, ≤1 , . . . , ≤ N ).

   Definition 23 (Semantics and Length).              The evaluation function [·] : L(U,
≤1 , . . . , ≤ N ) → Pwr(U ) is defined as

— [ ] = ∅,
— [a] = {a} for any a ∈ U ,
— [a →i b] = {c ∈ U : a ≤i c and c ≤i b},
— [s + t] = [s] ∪ [t], [s − t] = [s] − [t] and [s · t] = [s] ∩ [t].

The length · : L(U, ≤) = N is given by                  = 0, a = 1, a →i b = 2, and
 s+t = s−t = s·t = s + t .

  Example 11. Consider a universe of names for cities: Toronto (TO),
San Francisco (SF), New York City (NYC), and Los Angeles (LA); U =
{TO, SF, NYC, LA}. We consider three orders. First, they are ordered from east
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
234       •     K. Q. Pu and A. O. Mendelzon




                                          Fig. 8. A set cover.

to west:
                                   NYC ≤1 TO ≤1 LA ≤1 SF.
Independently, they are also ordered from south to north:
                                   LA ≤2 SF ≤2 NYC ≤2 TO.
Finally, we know that San Francisco (SF) is much smaller in population
than Toronto (TO) and Los Angeles (LA), which are comparable, and in
turn New York city (NYC) has the largest population by far. Therefore, by
population, we order them partially as
                SF ≤3 TO, SF ≤3 LA,             and     TO ≤ NYC, LA ≤3 NYC,
but TO and LA are incomparable with respect to ≤3 .
  The following are expressions in L(U, ≤1 , ≤2 , ≤3 ):
— s1 = LA →2 TO; the cities north of LA and south of TO inclusively, and
  [s1 ] = U .
— s2 = (SF →3 NYC) − (SF + NYC); the cities larger than SF but smaller than
  NYC, so [s2 ] = {TO, LA}.
— s3 = (NYC →1 LA) · (LA →2 NYC) − (NYC + LA); the cities strictly between
  NYC and LA in both latitude and longtitude, and [s3 ] = ∅.
   The notion of compactness and the MDL-problem naturally extend to
expressions of order structures. Unfortunately, the general L(U, ≤)-MDL is
intractable even with one order relation.
  PROPOSITION 12. Even with one partial order ≤, the L(U, ≤)-MDL decision
problem is NP-complete.
   SKETCH OF PROOF. We reduce from the minimum set cover problem. Let
C = {Ci }i∈I where, without loss of generality, we assume that each Ci has at
least five elements that are not covered by other {C j : j = i}. This can always
be ensured by duplicating each element in the set into five distinct copies.
   The universe of our ordered structure set is U = i∈I (Ci ∪ { i , ⊥i }). For each
                                                                  ˙
cover Ci , we introduce two new symbols i and ⊥i . The ordering ≤ is defined
as (∀i ∈ I )(∀c ∈ Ci ) c < i and ⊥i < c. Nothing else is comparable.
   Consider the instance of a set-cover problem shown in Figure 8.
   We first duplicate each element into five copies, and obtain another instance
shown in Figure 9.
   Finally the order-structure is shown in Figure 10.
   The subset to be expressed is ∪i∈I Ci , and its L(U, ≤)-compact expression
                                                     −−→
                                                    −−−
is always of the form s =       j ∈J (( j → ⊥ j ) − { j , ⊥ j }). It will not mention

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Concise Descriptions of Subsets of Structured Sets             •      235




                          Fig. 9. Each element is duplicated.




                       Fig. 10. The transformed order-structure.

individual elements in any of the Ci , since by symmetry of the problem, if one is
mentioned, then its copies will be mentioned, and that would use five symbols,
                                       −
                                   −− −→
which is longer than ( i → ⊥i ) − { i , ⊥i }. Its length is then 4|J | where |J | is
the number of covers needed to cover i∈I Ci . Minimizing |J | is equivalent to
minimizing s .

3.1 Linear Ordering is in P
We say that an order-structure (U, ≤) is linear if there is only one ordering and
it is linear, that is, if every two elements u, u ∈ U are comparable. Therefore,
(U, ≤) forms a chain, and in this case, not surprisingly, the MDL-problem
is solvable in polynomial time. The formal argument for this statement is
analogous to that for partitions.
    In this section, we fix the structure (U, ≤) to be linear.
  Definition 24 (Closure and Segments). Let A ⊆ U . Its closure A is defined
as A = {u ∈ U : (∃a, b ∈ A)a ≤ u and u ≤ b}. A segment is a subset A of U such
that A = A. The length of a segment is simply |A|.
Segments are particularly easy to express: if A is a segment with length
greater than 2, then A L(U,≤) = 2 always since it can be expressed by the
expression (min A → max A) using only two symbols.
   A segment of V is simply a segment A such that A ⊆ V . We denote the
set of maximal segments in V by SEG(V ). Note that maximal segments are
pairwise disjoint. The set SEG(V ) also has a natural−compact expression:
                                                         −→
   A∈SEG(V ) (min A → max A), which from now on we call SEG(V ).
   Example. Consider a universe U with 10 elements linearly ordered by ≤.
We simply call them 1 to 10, and ≤ is the ordering of the natural numbers. Let
V be {2, 4, 5, 7, 8} shown in Figure 11.
                                                      −
                                                     −→
   The segments of V are {2}, {4, 5} and {7, 8}, and SEG(V ) = (2 → 2) + (4 →
5) + (7 → 8).
   PROPOSITION 13. For any two subsets A and B, we have, for               be-
ing any of ∪, ∩, or −, |SEG(A B)| ≤ |SEG(A)| + |SEG(B)|. Therefore,
 −−→                −−→          −
                                −→
 SEG(A B) ≤ SEG(A) + SEG(B) .
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
236       •     K. Q. Pu and A. O. Mendelzon




                 Fig. 11. A subset V of the universe: filled elements belong to V .

   One might at first be tempted to just express a set V by its segment de-
             −−→
composition SEG(V ). But we can in general do better than that. For instance,
consider the previous example with V shown in Figure 11. The expression
−−→
SEG(V ) has a length of 6, but V can be expressed by only four symbols by
s = (4 → 8) + 2 − 6 or (2 → 8) − (3 + 6).
   For the remainder of this section, we fix the subset V and assume that V does
not contain the extrema max U, min U of U . This restriction on V relieves us
from considering some trivial cases, and can be lifted without loss of generality.
   Definition 25 (Normal Form for Linear Order—Structures). The normal
form is the sublanguage Nlin of L(U, ≤) consisting of expressions of the form
                                                 → −
                                                 −    →
                                         s = t + A+ − A− ,
where the subexpression t =                 i (ai   → ai ) is a union of segments and A+ =
[s] − [t] and A− = [t] − [s].
  LEMMA 8. For the linear order-structure (U, ≤), every expression of L(U, ≤)
can be reduced to an expression in Nlin .
   OUTLINE OF PROOF. The proof is very similar to Lemma 1. It is by induction.
The base cases of s = and s = u for u ∈ U are trivial.
                  −→ −   →                   → −
                                             −     →
   Let s1 = t1 + A+ − A− and s2 = t2 + A+ − A− be two expressions already in
                    1     1                    2    2
Nlin . We need to show that s1 + s2 , s1 − s2 and s1 · s2 are all reducible to Nlin .
   s = s1 + s2 : Let t = SEG([t1 ] ∪ [t2 ]), A+ = [s] − [t], and A− = [t] − [s]. Then
we have that t ≤ t1 + t2 , A+ ≤ A+ + A+ , and A− ≤ A+ + A−
                                                 1     2                   1          2
(as was the case in the proof of Lemma 1). The other two cases are handled
similarly.
   COROLLARY 6.          V   L   = V    Nlin .

  Therefore the L(U, ≤) MDL-problem reduces to the Nlin MDL-problem when
the ordering is linear. We only need to show that the latter is tractable.
  Definition 26 (Neighbors, Holes, Isolated, Interior, and Exterior Points).
Consider an element u in the universe U . We define
                                  max{u ∈ U : u < u}                if u = min U ,
                   u−1 =
                                  undefined                          if u = min U .,
                                  min{u ∈ U : u > u}            if u = max U ,
                   u+1 =
                                  undefined                      if u = max U .,
to be the immediate predecessor and the immediate successor, respectively.
   We say that u ∈ U is a hole in V if u ∈ V − V but {u − 1, u + 1} ⊆ V . The
set of all holes of V is denoted by Hol(V ).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                        Concise Descriptions of Subsets of Structured Sets            •      237

   An element u ∈ U is an isolated point in V if u ∈ V but u − 1, u + 1 ∈ V . The
set of all isolated points is denoted by Pnt(V ). An interior point u of V is when
u ∈ V and at least one of u − 1 or u + 1 is also in V . All the interior points of
V is Int(V ). Conversely, an exterior point of V is an element u ∈ U such that
u ∈ V and {u − 1, u + 1} − V = ∅. Ext(V ) are all the exterior points of V .
  Example. Consider the subset V in the universe in Figure 11. Observe that
Hol(V ) = {3, 6}, Pnt(V ) = {2}, Int(V ) = {4, 5, 7, 8} and Ext(V ) = {1, 9, 10}.
  Note that the universe is partitioned into holes, isolated, interior, and
exterior points of V . These concepts allow us to define extended segments of V
which are very useful in constructing a compact expression of V .
  Definition 27 (Extended Segments). A subset A is an extended segment of
V if A ⊆ V ∪ Hol(V ), A = A, and A ∩ Int(V ) = ∅.
  So an extended segment is a segment that can only contain elements of V
and holes in V , and must contain at least one interior point of V . Observe
that the maximally extended segments of V are pairwise disjoint. The set of
the maximally extended segments is denoted by XSEG(V ). The expression
                                            −−
                                           −−→
  A∈XSEG(V ) (min A → max A) is denoted by XSEG(V ).

    Example. Again, consider V Figure 11. The extended segments in V are
{2, 3, 4, 5}, {4, 5, 6, 7, 8}, {5, 6, 7} · · ·. In general, there could be many maximally
extended segments, but in this case there is only one : {2, 3, 4, 5, 6, 7, 8}.
              −−
             −−→
Therefore XSEG(V ) = (2 → 8).
                                                       −→      →
                                                               −                 −−
                                                                                −−→
    THEOREM 7. An expression s∗ = t∗ + A+ − A− , where t∗ = XSEG(V ),
                                                         ∗       ∗
A+ = V − [t] and A− = [t] − V is compact for V in Nlin .
  ∗                      ∗

   SKETCH OF PROOF. We show that any expression s ∈ Nlin for V can be
reduced to s∗ . The proof is by describing explicitly a set of rewrite procedures
that take any expressions of V in the normal form and reduce it to s∗ . Without
loss of generality, we assume that all segments in t are of length of at least two.
(1) First we make sure that all the segments a → a in t are such that a, a ∈ V ,
    and all segments are disjoint: this can be done without increasing the
    length of the expression.
(2) Remove exterior points from [t]: if there is an exterior point u in [t], then it
    appears in some a → a in t. Since u ∈ Ext(V ), at least one of its neighbors
    u ∈ {u − 1, u + 1} must also be exterior to V and appear in a → a . They
    must then appear in A− . Rewrite a → a to at most two segments a → b
    and b → a such that u and its neighbor u are no longer included in t.
    This increases the length of t by at most 2. We then remove u, u from A− .
    The overall expression length is not increased.
(3) Add all interior points to [t]: if there is an interior point u that is not
    in [t], then it must appear in A+ . Since u ∈ Int(V ), there is a neighbor
    u ∈ {u − 1, u + 1} ∩ V . If u ∈ [t], then it is in A+ as well. In this case, create
    a new segment u → u (or u → u if u = u − 1) in t, and delete u, u from
    A+ . If u ∈ A+ , then it must appear in a segment a → a in t. Extend the
    segment to include u, and delete u from A+ .
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
238       •     K. Q. Pu and A. O. Mendelzon

(4) Remove all segments in [t] not containing interior points: if there is a
    segment a → a in t and [a → a ] ∩ Int(V ) = ∅, then it must contain
    only isolated points and holes of V , but not exterior points (by step 2).
    Furthermore, since the end points a, a ∈ V (by step 1), there is one more
    isolated point than holes in [a → a ]. The holes appear in A− . Delete
    a → a from t, holes from A− , and add isolated points to A+ . The overall
    expression length is then reduced by 1.
      At this point, observe that all segments in t contain some interior
    points and none of exterior points, and hence are extended segments of V .
    Therefore, [t] ⊆ XSEG(V ).
(5) Add     XSEG(V ) − [t] to [t]: consider u ∈          XSEG(V ) − [t]. Let
    u ∈ A ∈ XSEG(V ). The segment A must contain an interior point v
    which must appear in some segment [a → b] in t. It is always possible to
    extend a → b (and possibly merge with neighboring segments in t) to cover
    u. The extension will include some holes and isolated points which need
    to be added to A− and removed from A+ respectively. This can always be
    done without increasing the length of the expression.
  By the end of the rewriting, − have [t] =
                              − − we
                               − →                                  XSEG(V ), and clearly the
minimal expression for [t] is XSEG(V ).
   COROLLARY 7. The L(U, ≤) MDL-problem can be solved in linear time for
linear order-structure.

3.2 Multilinear Ordering Is “Hard”
It is not terribly realistic to consider only a single ordering of the universe.
There are often many: we may order people by age, or by their names, or some
other attributes. In this section, we introduce multiorder structures and the
corresponding language. In this case the MDL-problem is hard even when we
only have two linear orders.
   Definition 28 (2-Linear Order-Structure). Consider the universe U = X ×
Y where both X and Y are linearly ordered by ≤1 and ≤2 . We define two
orderings ≤ X and ≤Y over the universe U as the lexicographical ordering
along X and Y respectively. Formally,
              (x, y) ≤ X (x , y ) ⇐⇒ (x ≤1 x ) ∧ ((x <1 x ) ∨ ( y ≤2 y )),
              (x, y) ≤Y (x , y ) ⇐⇒ ( y ≤2 y ) ∧ (( y <2 y ) ∨ (x ≤1 x )).
We refer to this specific structure (U, ≤ X , ≤Y ) as the 2-linear order-structure
since both ≤ X and ≤Y are linear.
   The 2-linear order-structure is the counterpart of the 2D product structure
defined in Definition 14. Recall that we have shown that the MDL-problem for
a 2D structure is NP-hard even though the cover structure is made up of two
simple partitions each of which is on its own tractable. We will see that the
same type of complexity increase seems to hold for the 2-linear order-structure.
Though individually linear orders yield a tractable MDL-problem, together
the L MDL-problem for the 2-linear order-structure is hard. We first identify
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Concise Descriptions of Subsets of Structured Sets            •      239

a sublanguage L P + ⊆ L for the 2-linear structure that has an NP-complete
                  ≤
MDL-problem. It is very much similar to the disjunctive product language
(Definition 16).

   Definition 29 (Some Product Sublanguages).              We define L P + (U, ≤ X , ≤Y )
or simply L P + as
              ≤

— ∈ LP +, ≤
— ((a → X b) · (a →Y b)) ∈ L P + if a and b are in X × Y ,
                                    ≤
— (s + t) ∈ L P + if s, t ∈ L P + ,
                ≤               ≤
—nothing else is in L P + . ≤

A natural generalization of L P + is to allow (s + t), (s − t), and (s · t) as part of
                                ≤
the language. We call the more general language L P ≤ .

   Expressions of the form (a → X b) · (a →Y b) are really descriptions of
rectangles. Since a ∈ X × Y , it is a pair (though still one symbol) (a1 , a2 ), and
the same holds for b, b = (b1 , b2 ). The points expressed by (a → X b) · (a →Y b)
are exactly these {(x, y) ∈ X × Y : a1 ≤1 x ≤1 b1 and a2 ≤2 y ≤2 b2 }. The
points a and b are then the bounding corners of this rectangle. This connection
of expressions to unions of rectangles leads to an immediate result.

  PROPOSITION 14.     The L P + -MDL-problem is NP-complete.
                              ≤

  PROOF. It is directly reducible from the rectangle covering problem [Keil
1999; Garey and Johnson 1979].

   The expressions in the more general language L P ≤ also have a geometric
interpretation: they are the general rectangle decomposition of axis-aligned
polygons allowing set union, difference, and intersection. The generalized
polygon decomposition has been studied in Tor and Middleditch [1984] and
Batchelor [1980] in the context of using only components that are convex
polygons. So far, we are not aware that anyone has shown whether this more
general decomposition problem is NP-hard or not.
   We believe that the MDL-problems for L P ≤ and the most general language
L for the 2-linear order-structure are NP-hard.

4. APPLICATIONS OF COMPACT EXPRESSIONS
We give two examples of practical applications of compact expressions: sum-
marization of large query answers, and the application used as motivation in
the Introduction, reduction in length of SELECT queries in a relational-OLAP
system.
   The MDL principle has been proposed as a guiding principle for summariza-
tion of large query results in data mining applications [Agrawal et al. 1998;
Lakshmanan et al. 1999]. Our theory of compact expressions is clearly appli-
cable to concise summarization of hierarchical query results. We demonstrate
how compact expressions can be used in summarizing keyword search results
in hierarchically organized data format, such as XML.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
240       •     K. Q. Pu and A. O. Mendelzon




                Fig. 12. An XML document with each node having a unique name.




                        Fig. 13. The tree representing the XML document.


   We then propose to view an OLAP data cube as a mapping whose domain is
a structured set, and queries as expressions of subsets of the domain. We argue
that it is generally beneficial to rewrite the query to be as short as possible,
that is, to express the subset of the domain using a compact expression.
Although, in general, shorter queries are not necessarily faster, for the family
of simple SELECT queries in a setting of typical relational OLAP storage,
compact expressions have a performance advantage.

4.1 Summarizing Keyword-Search
We view an XML document as a labeled tree with the leaf-nodes being the
content. We assume that each node has a unique name. For instance, for the
XML document in Figure 12, the corresponding tree is as shown in Figure 13.
   The structured set corresponding to the tree has the leaf-nodes as elements
of the universe, and names of the higher nodes as the alphabet . The
interpretation of a symbol is simply the set of its descendant leaf-nodes.
   The result of a keyword search for the word “WINTER” is {Section:1.1,
Section:1.2, Section:1.3}, which is compactly expressed by “Chapter:1”, while
a keyword search for the word “is” results in {Section:1.1, Section:1.2, Sec-
tion:1.3, Section:2.1}. By the decomposition algorithm (Theorem 4), its compact
expression is “Book:root − Section:1.2”; the word “IS” is found everywhere
except Section:1.2 in the book. One way of summarizing hierarchical data is
by the lowest-common-ancestors (LCA) of the answer set ([Lakshmanan et al.
1999]), that is, the set of closest ancestor nodes whose descedants are exactly
the answer set. LCA-based summarization would give as a representation:
“Section:1.1 + Section:1.3 + Chapter:2”, which is less informative in this case.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                            Concise Descriptions of Subsets of Structured Sets               •     241

4.2 Compact Expressions as Better OLAP SELECT Queries
In this section, we focus on SELECT queries in relational OLAP (ROLAP). In
ROLAP scenarios, an OLAP cube is stored in databases using tables with a
star-schema. A SELECT query specifying the restrictions on each dimension
is then an SQL query of the form:
       SELECT measure FROM cube
       WHERE dim1 IN (· · ·) AND dim2 IN (· · ·) AND dim3 IN (· · ·) · · ·
In many cases, this SQL statement can be very long in length, and conse-
quently problematic for many back-end relational database management
systems (RDBMs).3 We argue that rewriting the predicates using compact
expressions will alleviate this problem.
  First we model OLAP dimensions as hierarchically structured sets, and
OLAP cubes as functions whose domain is a multidimensional hierarchy.

   4.2.1 Modeling OLAP Cubes and Selection Queries Using Structures and
Expressions. There has been a plethora of formal models of OLAP databases:
Gyssens and Lakshmanan [1997], Hurtado and Mendelzon [2002], Agrawal
et al. [1997], Cabibbo and Torlone [1997, 1998]. These models are formal in
the sense that they provide precise semantics for the data model and the
query language. The focus is on the expressiveness of the model and the query
language but not on the performance issues of query execution. In this section,
we show how structured sets and their languages can be used to model basic
multidimensional databases and their query languages.
   Similarly to the approaches in Agrawal et al. [1997] and Cabibbo and
Torlone [1998], we model an OLAP dataset as a function. Formally, a dataset
is a function D : U → M ∪ {null}, where the domain U is a universe of a
                              ˙
structured set (U, ), and the codomain M the measure values. If a point
u ∈ U does not have a measure, then D(u) = null.
   If (U, ) is a multidimensional hierarchy, then we say that D is a multidimen-
sional hierarchical cube, or simply a cube. In this case, U = D1 × D2 · · · × D N
and = ∪ 1≤i≤N i such that (Di , i ), and hence (U, i ) is a hierarchy. We call
            ˙
(Di , i ) the dimensions and N the dimensionality of D.
   To better illustrate how this captures an OLAP cube, let us consider the
following example.
   Consider an OLAP cube with three dimensions: TIME, PRODUCT, STORE,
as shown in Figure 14. There are two measures: Total Sales and Sales Count.
We will refer to this cube as SALES.
   To model this OLAP cube, the dimensions themselves are structured sets:
Product is a granular hierarchy with alphabet Product = Name ∪ Family. ˙
Similarly the alphabet for the store dimension is Store = Street ∪ City. Finally
                                                                   ˙
the time dimension is not a hierarchy, but is a linear order-structure. To
represent a data cube, we define the domain (U, ) to be the product structure:

3 In the experience of one of the authors, relational back ends will often poorly execute or even reject

SQL queries for excessive length, and thus some ROLAP implementations break them down into
multiple queries with the consequent increase in overhead.

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
242       •     K. Q. Pu and A. O. Mendelzon




                                Fig. 14. The dimensions of SALES.

U = Month × Name × Street. The alphabet is = Time ∪ Product ∪ Store. The
interpretation is the usual one when forming a product structure as it was done
in Section 2.3. The codomain is M = R × N containing pairs of real and natural
numbers for the Total Sales and Sales Count. A data cube is then D : U → M.
   We consider only the simple selection queries: a subset V of the domain is
specified in query q, and the answer to the query is the function D|V —the re-
striction of the function D on the subset V . The propositional language L(U, ),
or product sublanguages L+ (U, ) and L P (U, ), can be used to describe the
                              P
subset V . We write q(s) to indicate that the expression s represents the region
of interest for the query q. The answer for q(s) is then D|[s].
   For example, let s1 = 01 · Beverage · Canal, and s2 = 01 · (Soda + Beer Spirits) ·
Canal. Since s1 and s2 are equivalent, the queries q(s1 ) and q(s2 ) will have the
same answer: it’s the sales information on beverage products for January for
the store on Canal street.
                                  01    Soda        Canal       –   –
                                  01    Beer        Canal       –   –
                                  01    Spirits     Canal       –   –

   4.2.2 Single and Multi-OLAP Query Optimization. Given a query q(V ) for
some subset V ⊆ U , the objective of optimization is to find a compact expres-
sion s for V such that the query can be expressed as q(s). A region of interest
V is typically generated automatically by report writing software4 and is often
expressed as an explicit list of elements in U . When the backend storage of the
OLAP cube is a relational database, evaluating the query q(V ) can be cumber-
some and inefficient. We shall soon see that there is a performance advantage in
evaluating a compact expression of V when the storage is a relational database.
   Since (U, ) is a multidimensional-hierarchical structure, computing s is in
itself a NP-complete problem (Theorem 6). However in the special case that V =
A1 × A2 × · · · × AN where Ai is a subset of the dimension Di , the problem can be

4 Available from Microstrategy and Brio . Go online to www.microstrategy.com. and www.brio.com.


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                       Concise Descriptions of Subsets of Structured Sets             •      243

solved efficiently (Theorem 4). We can compute a compact expression si for each
Ai , and then express V by s1 ·s2 · · · sN . This expression of V is economical in both
length and in its evaluation, since it makes use of higher-level symbols in each
dimension as much as possible, which is beneficial to the query execution. A mi-
nor problem is that, by its strict definition, L P (U, ) does not include the range
→ operator, and hence we cannot officially take advantage of an ordered dimen-
sion Dk such as TIME for the cube SALES. This problem can be overcome if we
carefully introduce → to the language allowing only σ → σ when σ and σ are
in the universe of an ordered dimension Dk . Therefore we would have the subex-
pression sk in s be L(Dk , ≤)-compact. In practice, we only consider dimensions
that are linearly ordered structures, so sk is easily constructed (by Corollary 7).
   Consider our sample cube SALES with the dimensions in Figure 14.
Typically, when stored in a relational database, the OLAP cube would be
mapped to a star-schema with dimension tables for TIME(Month), PROD-
UCT(Family, Name), and STORE(City, Street). The dimension tables would
look as in Figure 14. The measures are stored in a fact table FACT(Month,
Name, Street, TotalSales, SalesCount). For convenience, we assume that
all the dimension tables are joined with the fact table to form a full view
OLAPVIEW(Month, Family, Name, City, Street, TotalSales, SalesCount). A report
on the sales information for the products in Beverage for stores on streets
Canal and Grand in New York in the first four months of the year has a
region V = {Jan, Feb, Mar, Apr} × {Soda, Beer, Spirits} × {Canal, Grand}. A naive
translation of q(V ) into SQL would result in a needlessly long statement:

     SELECT * FROM OLAPVIEW
     WHERE Month IN (‘01’, ‘02’, ‘03’, ‘04’) AND
           Name   IN (‘Soda’, ‘Beer’, ‘Spirits’ ) AND
           Street IN (‘Canal’, ‘Grand’);

The L P -compact expression of V is

               s = (01 → 04) · (Beverage) · (NewYork − Broadway).

The corresponding SQL to q(s) is

     SELECT * FROM OLAPVIEW
     WHERE (Month BETWEEN ‘01’ AND ‘04’) AND
           Family = ‘Beverage’ AND
           City = ‘NewYork’ AND Street <> ‘Broadway’;

Note that the SQL statement for q(s) makes use of higher-level symbols (such
as Beverage and NewYork) instead of the lower-level symbols (Soda, Beer, . . .).
This is because of the fact that the algorithm will try to rewrite the expression
using as many higher-level symbols as possible without increasing the length
of the expression. Since there are fewer symbols in the higher level than in
the lower level, the higher-level indices are smaller and can be accessed more
quickly. Therefore, the index-access time for the rewritten query is cut down.
For instance, there are three symbols in Family level, but 10 in Name level,
which means that evaluating Family = ‘Beverage’ is faster than evaluating
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
244       •       K. Q. Pu and A. O. Mendelzon

Name IN (‘Soda’, ‘Beer’, ‘Spirits’) provided that proper indices are built.
The same applies to City and Street predicates.
   Given a family of queries {q(Vi )}, it is advantageous to consider them simul-
taneously and to form the amalgamated query q            i Vi . From the minimal
length point of view, this is motivated by the simple fact that    i Vi ≤    i Vi ,
that is, the descriptive length of the union is always at most as long as the sum
of the descriptive lengths of the individual parts. The possibility of reduction in
length lies in the potential overlap among Vi . Practically, computing a compact
expression for i Vi is difficult since, even though each Vi is a Cartesian
product, i Vi is hardly ever a Cartesian product, rendering our single query
optimization ineffective. We must resort to heuristic approaches. As mentioned
in Section 5.1, some existing heuristics can be applied such as the greedy growth
algorithm in [Agrawal et al. 1998] or the polygon covering algorithm in [Kumar
and Ramesh 1999]. Unfortunately, they all assume linearly ordered dimensions,
so the approximation factor given in Kumar and Ramesh [1999] does not hold.

4.3 Performance Considerations
It is clear that our proposed optimization techniques require additional and
possibly intensive access to the dimensional structures. Often these structures
are mapped to dimensional tables, which, along with the fact table, are
stored in a relational database. This means that traversal of the dimension
hierarchies is costly. So to make the optimization overhead minimal and this
approach practical, we need to index the dimensions using tree-based native
data structures. The dimensions are usually of manageable size and slow
varying, making it possible to be more heavily indexed. The fact table, how-
ever, is fast changing and much larger in size, potentially spanning multiple
remote storage servers. With this in mind, it is reasonable to expend effort on
optimizing the retrieval queries by exploiting the structural information of the
dimensions. Indeed many OLAP systems commercially available coincide with
this type of architecture. For instance, ESSBASE5 has a separate dimension
index which is much smaller and faster to access than the data page file.

5. RELATED WORK

5.1 Some Related MDL Problems
The idea of minimal descriptive length has been a classical theme in areas
of machine learning [Lam and Bacchus 1994] and statistics [Hansen and Yu
2001]. There, the motivation is to select a model that adequately explains the
observations while having an economical representation. In computational
geometry, the interest behind the polygon covering problem is to represent a
given polygon using minimal number of simpler components [Keil 1999]. Work
more relevant to this article has been done with the emphasis on the compact
representation of data sets [Lodi et al. 1979; Edmonds et al. 2001]; Agrawal
et al. [1998]; Lakshmanan et al. [1999].

5 Available   from Hyperion. Go online to www.essbase.com.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                      Concise Descriptions of Subsets of Structured Sets            •      245

   In Agrawal et al. [1998] and Lakshmanan et al. [1999], the authors were in
a succinct summarization of query answers for multidimensional databases.
This is especially important for complex queries such as those in data mining
because the user needs to easily comprehend the result of the query. The
general guideline in improving the comprehensibility of the query result is to
reduce its descriptive length. In Agrawal et al. [1998], the set of clusters are
identified, and each cluster is basically a subset of a Cartesian product space. It
is assumed that each dimension is numerical and discrete, and hence linearly
ordered. The authors proposed to describe each cluster using a disjunctive
normal form (DNF) expression. For instance: ((30 ≤ age ≤ 50) ∧ (4K ≤
salary ≤ 8K )) ∨ ((40 ≤ age ≤ 60) ∧ (2K ≤ salary ≤ 6K )) is a cluster in the two
dimensional space of age and salary. In the framework of structured sets, the
universe is the Cartesian product of values of age and salary, and is with two
orders ≤age and ≤salary . A cluster is then a subset to be represented, and the
DNF representations are expressions in L P + (U, ≤age , ≤salary ). Minimizing the
expression is of course NP-hard as it coincides with the well-studied rectilinear
polygon covering problem. Dimensions (such as geography or product) that
do not have natural orderings but are categorically structured are treated as
ordered dimensions. A simple greedy growth algorithm was proposed in which
rectangles are grown until they are bounded by the boundary of the dataset.
   Lakshmanan and colleagues continued to examine the compact representa-
tion of multidimensional subset in Lakshmanan et al. [1999], where they have
relaxed the accuracy of the representation in order to gain reduction in the
descriptive length. Some points in the product space are “blue”; these are the
points to be represented. Some are “red”; these must not be included in the pre-
sentation. The rest are “white,” which are considered harmless but unnecessary
when included in the presentation. As part of the problem, there is an upper
limit to the number of white points that can be included. When the limit is set
to zero, the problem reduces to the minimal DNF expressions in Agrawal et al.
[1998]. A number of heuristic algorithms were given to solve the MDL-problem
when dimensions are spatial. Also in Lakshmanan et al. [1999], they considered
the case when all the dimensions are hierarchical. A polynomial time algorithm
was given to solve the MDL-problem for hierarchical dimensions. Interestingly
enough, this algorithm finds the “optimal” expression. This is in sharp contrast
with our NP-hardness results of multidimensional hierarchical structures in
Theorem 6. This seemingly paradoxical disagreement comes from the fact that,
in Lakshmanan et al. [1999], the language is much more restricted than even
L P + (U, ) for multidimensional hierarchical structures. Specifically, it does
not allow general product expressions which are the source of the complexity.
   These algorithms, such as greedy growth in Agrawal et al. [1998], can be
adapted to handle unordered dimensions, and therefore can serve as heuristics
for our L P + (U, ) MDL-problem for multidimensional cover structures.
   For the multilinear order structures which correspond to when the di-
mensions are linearly ordered, the MDL-problem is essentially the polygon
decomposition problem. Much is known about the approximation of rectilinear
polygon covering [Kumar and Ramesh 1999; Levcopoulos and Gudmundsson
1997]. Kumar and Ramesh provided an approximation algorithm that covers a
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
246       •     K. Q. Pu and A. O. Mendelzon

given rectilinear polygon with rectangles. It was shown to approximate within
a factor of O( log n) from the optimum, where n is the minimum of the number
of vertical and horizontal edges. Since the trivial reduction of L P + (U, ≤ X , ≤Y )
MDL-problem from the rectilinear polygon covering problem preserves the
approximation factor, it too can be approximated to a factor of O( log n). We
do not know the exact complexity of the MDL-problem for L P (U, ≤ X , ≤Y ) (nor
for the even more general language L(U, ≤ X , ≤Y )), but it corresponds to the
generalized polygon covering problem where both set union and set difference
are allowed. Some algorithms ([Batchelor 1980; Tor and Middleditch 1984;
Keil 1999]) exist for the generalized polygon decomposition, but none gives
necessarily the minimal cover.

5.2 Some Related Query Optimization Techniques
Multi-OLAP query optimization has been considered by Liang and Orlowska
[2000], Zhao et al. [1998], and Kalnis and Papadias [2001]. The underlying
model for an OLAP cube is relational, and they considered the very-low-level
costs of physical data access such as input/output cost, table join, and table
scan cost. The motivation for multiquery optimization under such a setting
is that the access plan may share a common set of physical operations which
can be executed once if all queries are evaluated simultaneously. The authors
have indicated though not explicitly proved that this is in general NP-hard.
Our discussion in Section 4.2.2 relies on the given model of multidimensional
database and the very simplified cost model: the length of the expression for
the query. But the conclusion is the same: redundancy ought to be removed,
but to do so maximally is intractable. This indicates that optimizing on the
query length, though not so rigorously justified, is a good measure and can
serve as a guideline in OLAP query optimization.
   It is important to point out that in Liang and Orlowska [2000], Zhao et al.
[1998], and Kalnis and Papadias [2001], the content of the dimensions is
not considered by the optimizer as it is thought of as part of the database.
But our computation of query expression reduction makes explicit use of
the dimensional structures. We argue that, in the OLAP applications, this is
valid for the slow varying nature of dimensions. Using dimensions to rewrite
a multidimensional query also appeared in Park et al. [2001], in which the
authors took into account the dimension tables while rewriting the query.

6. CONCLUSION AND FUTURE WORK
We have defined structured sets, languages expressing their subsets, and the
corresponding MDL-problems. The two types of structures we introduced are
cover structures and order-structures; the former corresponds to categorical
classification and the latter to sequential or partial ordering. In both cases, the
MDL-problem is NP-complete in the most general setting. We further studied
specialized instances of these structures. We restricted the cover structures
to partitions and hierarchies, and the order-structures to linear orders, and
showed that these restricted structures are simple in the sense that they
enjoy enough algebraic regularity that their MDL-problems can be solved
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
                         Concise Descriptions of Subsets of Structured Sets             •      247

in polynomial time. However, when having two dimensions, each of which is
simple, the resulting MDL-problem becomes hard. We have shown that the
MDL-problem associated with the two-dimensional partition is NP-complete.
In the case of a two-dimensional linearly ordered structure, we demonstrated
that the MDL-problem with respect to the syntactically restricted language
L P + corresponds to the rectangular covering problem, which is well known to
    ≤
be NP-compete; but the complexity of the MDL-problem over the unrestricted
language L≤ remains unknown. We summarize as follows (P = polynomial
time, NPC = NP-complete):

                                    Cover structure                       Order-structure
                            Partition Hierarchy General                   Linear Partial
    One dimensional            P           P        NPC                     P       NPC
    Multidimensional          NPC        NPC        NPC                      ?      NPC

   Structured sets arise naturally in databases. We have seen that simple XML
documents can be viewed as hierarchically structured sets and OLAP cubes as
multidimensionally structured sets. We showed that compact expressions are
useful in succinctly summarizing query answers and in query optimization of
SELECT queryies in OLAP.
   So far, only very simple multidimensional database structures in which
dimensions are either hierarchical or linearly ordered have been considered.
It would be desirable to generalize this. For instance, in the TIME dimension,
we only considered the chronological linear order for months, but months are
also hierarchically organized into quarters and then years, making TIME a
hybrid between a cover structure and an order-structure. Of course, the most
general case of such structures has an intractable MDL-problem, and we are
interested in finding a reasonably relaxed class of tractable hybrid structures
and providing an algorithm for computing compact expressions in these
structures.
   We have only considered the simple selection OLAP queries. Our future
work will include incorporating aggregation and enriching the language of
structured sets with more constructs such that a larger class of queries can be
encompassed by the framework.

ACKNOWLEDGMENTS

We thank the anonymous referees for their helpful comments.

REFERENCES

AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN, P. 1998. Automatic subspace clustering
  of high dimensional data for data mining applications. In Proceedings of SIGMOD 1998. ACM
  Press, New York, NY, 94–105.
AGRAWAL, R., GUPTA, A., AND SARAWAGI, S. 1997. Modeling multidimensional databases. In
  Proceedings of ICDE 1997. 232–243.
BATCHELOR, B. 1980. Hierarchical shape description based on convex hulls of concavities. J.
  Cybernet. 10, 205–210.
CABIBBO, L. AND TORLONE, R. 1997. Querying multidimensional databases. In Proceedings of the
  6th DBPL Workshop. 319–335.

                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
248       •     K. Q. Pu and A. O. Mendelzon

CABIBBO, L. AND TORLONE, R. 1998. A logical approach to multidimensional databases. In
  Proceedings of EDBT 1998. 183–197.
EDMONDS, J., GRYZ, J., LIANG, D., AND MILLER, R. J. 2001. Mining for empty rectangles in large
  data sets. In Proceedings of ICDT 2001. 174–188.
GAREY, M. R. AND JOHNSON, D. S. 1979. Computers and Intractability: A Guide to the Theory of
  NP-Completeness. W. H. Freeman, New York, NY.
GYSSENS, M. AND LAKSHMANAN, L. 1997. A foundation for multi-dimensional databases. In
  Proceedings of VLDB 1997. 106–115.
HANSEN, M. H. AND YU, B. 2001. Model selection and the principle of minimum description
  length. J. Amer. Statist. Assoc. 96, 454, 746–774.
HURTADO, C. AND MENDELZON, A. 2002. OLAP dimensional constraints. In Proceedings of PODS
  2002. 169–179.
KALNIS, P. AND PAPADIAS, D. 2001. Optimization Algorithms for Simultaneous Multidimensional
  Queries in OLAP Environments. Lecture Notes in Computer Science, vol. 2114. Springer-Verlag,
  Berlin, Germany, 264–273.
KEIL, J. 1999. Polygon decomposition. In Handbook of Computational Geometry. Elsevier
  Sciences, Amsterdem, The Netherlands, Chap. 11, 491–518.
KIMBALL, R. 1996. The Data Warehouse Toolkit. Wiley, New York, NY.
KUMAR, V. S. A. AND RAMESH, H. 1999. Covering rectilinear polygons with axis-parallel rectangles.
  In Proceedings of the ACM Symposium on Theory of Computing 1999. ACM Press, New York,
  NY, 445–454.
LAKSHMANAN, L., NG, R. T., WANG, C. X., ZHOU, X., AND JOHNSON, T. J. 1999. The generalized MDL
  approach for summarization. In Proceedings of VLDB 1999. 445–454.
LAM, W. AND BACCHUS, F. 1994. Learning Bayesian belief networks: An approach based on the
  MDL principle. Comput. Intel. 10, 269–293.
LEVCOPOULOS, C. AND GUDMUNDSSON, J. 1997. Approximation algorithms for covering polygons
  with squares and similar problems. In Proceedings of the International Workshop on Randomiza-
  tion and Approximation Techniques in Computer Science. Lecture Notes in Computer Science,
  vol. 1269. Springer, Berlin, Germany, 27–41.
LIANG, W. AND ORLOWSKA, M. 2000. Optimizing multiple dimensional queries simultaneously in
  multidimensional databases. VLDB J. 8, 319–338.
LODI, E., LUCCIO, F., MUGNAI, C., AND PAGLI, L. 1979. On two-dimensional data organization I.
  Fundam. Inform. 2, 211–226.
PARK, C., KIM, M., AND LEE, Y. 2001. Rewriting OLAP queries using materialized views and
  dimension hierarchies in data warehouses. In Proceedings of ICDE 2001. 515–523.
TOR, S. AND MIDDLEDITCH, A. 1984. Convex decomposition of simple polygons. ACM Trans.
  Graph. 3, 244–265.
ZHAO, Y., DESHPANDE, P., NAUGHTON, J., AND SHUKLA, A. 1998. Simultaneous optimization and
  evaluation of multiple dimensional queries. In Proceedings of SIGMOD 1998. 271–282.

Received November 2003; revised June 2004; accepted September 2004




ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
What’s Hot and What’s Not: Tracking Most
Frequent Items Dynamically
GRAHAM CORMODE and S. MUTHUKRISHNAN
Rutgers University


Most database management systems maintain statistics on the underlying relation. One of the
important statistics is that of the “hot items” in the relation: those that appear many times (most
frequently, or more than some threshold). For example, end-biased histograms keep the hot items
as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers
in data mining, and in anomaly detection in many applications.
    We present new methods for dynamically determining the hot items at any time in a relation
which is undergoing deletion operations as well as inserts. Our methods maintain small space data
structures that monitor the transactions on the relation, and, when required, quickly output all
hot items without rescanning the relation in the database. With user-specified probability, all hot
items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to
implement, and have provable quality, space, and time guarantees. Previously known algorithms
for this problem that make similar quality and performance guarantees cannot handle deletions,
and those that handle deletions cannot make similar guarantees without rescanning the database.
Our experiments with real and synthetic data show that our algorithms are accurate in dynamically
tracking the hot items independent of the rate of insertions and deletions.
Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications
General Terms: Algorithms, Measurement
Additional Key Words and Phrases: Data stream processing, approximate query answering.




1. INTRODUCTION
One of the most basic statistics on a database relation is that of which items are
hot, that is, they occur frequently, but the set of hot items can change over time.
This gives a useful measure of the skew of the data. High-biased and end-biased
histograms [Ioannidis and Christodoulakis 1993; Ioannidis and Poosala 1995]
specifically focus on hot items to summarize data distributions for selectivity

The first author was supported by NSF ITR 0220280 and NSF EIA 02-05116; the second author
was supported by NSF EIA 0087022, NSF ITR 0220280, and NSF EIA 02-05116.
This is an extended version of an article which originally appeared as Cormode and Muthukrishnan
[2003].
Authors’ current addresses: G. Cormode, Room 2B-315, Bell Laboratories, 600 Mountain Avenue,
Murray Hill, NJ 07974; email: graham@dimacs.rutgers.edu; S. Muthukrishnan, Room 319, CoRE
Building, Department of Computer and Information Sciences, 110 Frelinghuysen Road, Piscataway,
NJ 08854; email: muthu@cs.rutgers.edu.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
 C 2005 ACM 0362-5915/05/0300-0249 $5.00


                        ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 249–278.
250       •     G. Cormode and S. Muthukrishnan

estimation. Iceberg queries generalize the notion of hot items in relation to
aggregate functions over an attribute (or set of attributes) in order to find ag-
gregate values above a specified threshold. Hot item sets in market data are in-
fluential in decision support systems. They also influence caching, load balanc-
ing, and other system performance issues. There are other areas—such as data
warehousing, data mining, and information retrieval—where hot items find
applications. Keeping track of hot items also arises in application domains out-
side traditional databases. For example, in telecommunication networks such
as the Internet and telephone, it is of great importance for network operators to
see meaningful statistics about the operation of the network. Keeping track of
which network addresses are generating the most traffic allows management
of the network, as well as giving a warning sign if this pattern begins to change
unexpectedly. This has been studied extensively in the context of anomaly de-
tection [Barbara et al. 2001; Demaine et al. 2002; Gilbert et al. 2001; Karp et al.
2003].
     Our focus in this article is on dynamically maintaining hot items in the
presence of delete and insert transactions. In many of the motivating ap-
plications above, the underlying data distribution changes, sometimes quite
rapidly. Transactional databases undergo insert and delete operations, and
it is important to propagate these changes to the statistics maintained on
the database relations in a timely and accurate manner. In the context of
continuous iceberg queries, this is apt since the iceberg aggregates have
to reflect new data items that modify the underlying relations. In the net-
working application cited above, network connections start and end over
time, and hot items change over time significantly. A thorough discussion
by Gibbons and Matias [Gibbons and Matias 1999] described many appli-
cations for finding hot items and the challenges in maintaining them over
a changing database relation. Also, Fang et al. [1998] presented an influen-
tial case for finding and maintaining hot items and, more generally, iceberg
queries.
     Formally, the problem is as follows. We imagine that we observe a sequence
of n transactions on items. Without loss of generality, we assume that the item
identifiers are integers in the range 1 to m. Throughout, we will assume the
RAM model of computation, where all quantities and item identifiers can be
encoded in one machine word. The net occurrence of any item x at time t, de-
noted nx (t), is the number of times it has been inserted less the number of
times it has been deleted. The current frequency of any item is then given by
 f x (t) = nx (t)/ m n y (t). The most frequent item at time t is the one with
                      y=1
 f x (t) = max y f y (t). The k most frequent items at time t are those with the k
largest f x (t)’s. We are interested in the related notion of frequent items that
we call hot items. An item x is said to be a hot item if f x (t) > 1/(k + 1), that is,
if it appears a significant fraction of the entire dataset; here k is a parameter.
Clearly, there can be at most k hot items, and there may be none. We assume
throughout that a basic integrity constraint is maintained, that nx (t) for every
item is nonnegative (the number of deletions never exceeds the number of in-
sertions). From now on, we drop the index t, and all occurrences will be treated
as being taken at the current timestep t.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
    What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically             •      251

   Our main results are highly efficient, randomized algorithms for main-
taining hot items. There are three important characteristics to consider: the
space used, the time to update the data structure following each transaction
(the update time), and the time to produce the hot items (the query time).
Our algorithms monitor the changes to the data distribution and maintain
O(k log(k) log(m)) space summary data structures. Processing each transaction
takes time O(log(k) log(m)). When queried, we can find all hot items in time
O(k log(k) log(m)) from the summary data structure, without scanning the un-
derlying relation. Additionally, given a user-specified parameter , the algo-
rithms return no items whose frequency is less than k+1 − . More formally,
                                                           1

for any user-specified probability δ, the algorithm succeeds with probability at
least 1 − δ, as is standard in randomized algorithms.
   Since k is typically very small compared to the size of the data, our results
here maintain small summary data structures—significantly sublinear in the
dataset size—and accurately detect hot items at any time in the presence of the
full repertoire of inserts and deletes. Despite extensive work on this problem
(which will be summarized in Section 2), most of the prior work with comparable
guarantees works only for insert-only transactions. Prior work that deals with
the fully general situation where both inserts and deletes are present cannot
provide the guarantees we give, without rescanning the underlying database
relation. Thus, our result is the first provable result for maintaining hot items,
with small space.
   A common approach to summarizing data distribution or finding hot items
relies on keeping samples on the underlying database relation. These samples—
deterministic or randomized—can be updated if data items are only inserted.
Samples can then faithfully represent the underlying data relation. However, in
the presence of deletes, in particular cases where the data distribution changes
significantly over time, samples cannot be maintained without rescanning the
database relation. For example, the entire set of sampled values may get erased
from the relation by a sequence of deletes if there are very many deletions.
   We present two different approaches for solving the problem. Our first result
here relies on random sampling to construct groups (O(k log(k)) sets) of items,
but we further group such sets deterministically into a small number (log m) of
subgroups. Our summary data structure comprises a sum of the items in each
group and subgroup. The grouping is based on error-correcting codes, and the
entire procedure may be thought of as “group testing,” which is described in
more detail later. The second result makes use of log m small space “sketches”
to act as oracles to approximate the count of any item or certain groups of
items, and uses an intuitive divide and conquer approach to find the hot items.
This is a different style of group testing, and the two methods give different
guarantees for the problem. We also give additional time and space tradeoffs
for both methods, where the time to process each update can be reduced by
constant factors, at the cost of devoting extra space to the data structures. We
perform a set of experiments on large datasets, which allow us to characterize
further the advantages of each approach. We also see that, in practice, the
methods given outperform their theoretical guarantees, and can operate very
quickly using a small amount of space but still give almost perfect results.
                               ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
252       •     G. Cormode and S. Muthukrishnan

   Once the hot items have been identified, a secondary problem is to approxi-
mate the counts nx of these items. We do not focus on this problem, since there
are many existing solutions which can be applied to the problem of, given x,
estimate nx , in the presence of insertions and deletions [Gilbert et al. 2002b;
Charikar et al. 2002; Cormode and Muthukrishnan 2004a]. However, we ob-
serve that for the solutions we propose, no additional storage is needed, since the
information needed to make estimates of the count of items is already present
in the data structures that we propose. We will show how to estimate the counts
of individual items, but we do not give experimental results since experiments
for these estimators can be found in prior work.
   The rest of the article is organized as follows. In Section 2, we summarize pre-
vious work, which is rather extensive. In Section 3 and Section 4 we present our
algorithms and prove their guarantees, and compare the different approaches
in Section 5. In Section 6, we present an experimental study of our algorithms
using synthetic data as well as real network data addressing the application
domain cited earlier and show that our algorithms are effective and practical.
Conclusions and closing remarks are given in Section 7.

2. PRELIMINARIES
If one is allowed O(m) space, then a simple heap data structure will process each
insert or delete operation in O(log m) time and find the hot items in O(k log m)
time in the worst case [Aho et al. 1987]. Our focus here is on algorithms that
only maintain a summary data structure, that is, one that uses sublinear space
as it monitors inserts and deletes to the data.
     In a fundamental article, Alon et al. [1996] proved that estimating
 f ∗ (t) = maxx f x (t) is impossible with o(m) space. Estimating the k most
frequent items is at least as hard. Hence, research in this area studies related,
relaxed versions of the problems. For example, finding hot items, that is, items
each of which has frequency above 1/(k + 1), is one such related problem. The
lower bound of Alon et al. [1996] does not directly apply to this problem. But a
simple information theory argument suffices to show that solving this problem
exactly requires the storage of a large amount of information if we give a
strong guarantee about the output. We provide the simple argument here for
completeness.
  LEMMA 2.1. Any algorithm which guarantees to find all and only items
which have frequency greater than 1/(k + 1) must store (m) bits.
   PROOF. Consider a set S ⊆ {1 · · · m}. Transform S into a sequence of n = |S|
insertions of items by including x exactly once if and only if x ∈ S. Now process
these transactions with the proposed algorithm. We can then use the algorithm
to extract whether x ∈ S or not: for some x, insert n/k copies of x. Suppose
x ∈ S; then the frequency of x is n/k /(n + n/k ) = n/k / n(k + 1)/k ≤
 n/k /(k +1) n/k = 1/(k +1), and so x will not be output. On the other hand, if
x ∈ S then ( n/k + 1)/(n+ n/k ) > (n/k)/(n+ n/k) = 1/(k + 1) and so x will be
output. Hence, we can extract the set S, and so the space stored must be (m)
since, by an information theoretic argument, the space to store an arbitrary
subset S is m bits.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
    What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                     •   253

          Table I. Summary of Previous Results on Insert-Only Methods (LV (Las
           Vegas) and MC (Monte Carlo) are types of randomized algorithms. See
                       Motwani and Raghavan [1995] for details.)
         Algorithm               Type             Time per item            Space
         Lossy Counting      Deterministic         O(log(n/k))           (k log(n/k))
         [Manku and                                amortized
         Motwani 2002]
         Misra-Gries         Deterministic      O(log k) amortized          O(k)
         [Misra and
         Gries 1982]
         Frequent           Randomized (LV)       O(1) expected             O(k)
         [Demaine et al.
         2002]
         Count Sketch         Approximate,          O(log(1/δ))          (k/   2   log n)
         [Charikar et al.   randomized (MC)
         2002]


   This also applies to randomized algorithms. Any algorithm which guarantees
to output all hot items with probability at least 1 − δ, for some constant δ,
must also use (m) space. This follows by observing that the above reduction
corresponds to the Index problem in communication complexity [Kushilevitz
and Nisan 1997], which has one-round communication complexity (m). If the
data structure stored was o(m) in size, then it could be sent as a message, and
this would contradict the communication complexity lower bound.
   This argument suggests that, if we are to use less than (m) space, then
we must sometimes output items which are not hot, since we will endeavor
to include every hot item in the output. In our guarantees, we will instead
guarantee that (with arbitrary probability) all hot items are output and no
items which are far from being hot will be output. That is, no item which has
frequency less than k+1 − will be output, for some user-specified parameter .
                       1



2.1 Prior Work
Finding which items are hot is a problem that has a history stretching back
over two decades. We divide the prior results into groups: those which find
frequent items by keeping counts of particular items; those which use a filter to
test each item; and those which accommodate deletions in a heuristic fashion.
Each of these approaches is explained in detail below. The most relevant works
mentioned are summarized in Table I.

   2.1.1 Insert-Only Algorithms with Item Counts. The earliest work on find-
ing frequent items considered the problem of finding an item which occurred
more than half of the time [Boyer and Moore 1982; Fischer and Salzberg 1982].
This procedure can be viewed as a two-pass algorithm: after one pass over the
data, a candidate is found, which is guaranteed to be the majority element if
any such element exists. A second pass verifies the frequency of the item. Only
a constant amount of space is used. A natural generalization of this method to
find items which occur more than n/k times in two passes was given by Misra
and Gries [1982]. The total time to process n items is O(n log k), with space O(k)
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
254       •     G. Cormode and S. Muthukrishnan

(recall that we assume throughout that any item label or counter can be stored
in constant space). In the Misra and Gries implementation, the time to process
any item is bounded by O(k log k) but this time is only incurred O(n/k) times,
giving the amortized time bound. The first pass generates a set of at most k
candidates for the hot items, and the second pass computes the frequency of
each candidate exactly, so the infrequent items can be pruned out. It is possible
to drop the second pass, in which case at most k items will be output, among
which all hot items are guaranteed to be included.
   Recent interest in processing data streams, which can be viewed as one-
pass algorithms with limited storage, has reopened interest in this problem
(see surveys such as those by Muthukrishnan [2003] and Garofalakis et al.
[2002]). Several authors [Demaine et al. 2002; Karp et al. 2003] have redis-
covered the algorithm of Misra and Gries [1982], and using more sophisticated
data structures have been able to process each item in expected O(1) time while
still keeping only O(k) space. As before, the output guarantees to include all hot
items, but some others will be included in the output, about which no guarantee
of frequency is made. A similar idea was used by Manku and Motwani [2002]
with the stronger guarantee of finding all items which occur more than n/k
times and not reporting any that occur fewer than n( k − ) times. The space
                                                         1

required is bounded by O( log n)—note that ≤ k and so the space is effec-
                             1                        1

tively (k log(n/k)). If we set = k for some small c then it requires time at
                                     c

worst O(k log(n/k)) per item, but this occurs only every 1/k items, and so the
total time is O(n log(n/k)). Another recent contribution was that of Babcock
and Olston [2003]. This is not immediately comparable to our work, since their
focus was on maintaining the top-k items in a distributed environment, and
the goal was to minimize communication. Counts of all items were maintained
exactly at each location, so the memory space was (m). All of these mentioned
algorithms are deterministic in their operation: the output is solely a function
of the input stream and the parameter k.
   All the methods discussed thus far have certain features in common: in
particular, they all hold some number of counters, each of which counts the
number of times a single item is seen in the sequence. These counters are
incremented whenever their corresponding item is observed, and are decre-
mented or reallocated under certain circumstances. As a consequence, it is
not possible to directly adapt these algorithms to the dynamic case where
items are deleted as well as inserted. We would like the data structure to
have the same contents following the deletion of an item, as if that item had
never been inserted. But it is possible to insert an item so that it takes up
a counter, and then later delete it: it is not possible to decide which item
would otherwise have taken up this counter. So the state of the algorithm
will be different from that reached without the insertions and deletions of the
item.

   2.1.2 Insert-Only Algorithms with Filters. An alternative approach to find-
ing frequent items is based on constructing a data structure which can be
used as a filter. This has been suggested several times, with different ways


ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically             •      255

to construct such filters being suggested. The general procedure is as follows:
as each item arrives, the filter is updated to reflect this arrival and then the
filter is used to test whether this item is above the threshold. If it is, then it is
retained (for example, in a heap data structure). At output time, all retained
items can be rechecked with the filter, and those which pass the filter are out-
put. An important point to note is that, in the presence of deletions, this filter
approach cannot work directly, since it relies on testing each item as it arrives.
In some cases, the filter can be updated to reflect item deletions. However, it
is important to realize that this does not allow the current hot items to be
found from this: after some deletions, items seen in the past may become hot
items. But the filter method can only pick up items which are hot when they
reach the filter; it cannot retrieve items from the past which have since become
frequent.
   The earliest filter method appears to be due to Fang et al. [1998], where it was
used in the context of iceberg queries. The authors advocated a second pass over
the data to count exactly those items which passed the filter. An article which
has stimulated interest in finding frequent items in the networking community
was by Estan and Varghese [2002], who proposed a variety of filters to detect
network addresses which are responsible for a large fraction of the bandwidth.
In both these articles, the analysis assumed very strong hash functions which
exhibit “perfect” randomness. An important recent result was that of Charikar
et al. [2002], who gave a filter-based method using only limited (pairwise) inde-
pendent hash functions. These were used to give an algorithm to find k items
whose frequency was at least (1− ) times the frequency of the kth most frequent
item, with probability 1−δ. If we wish to only find items with count greater than
n/(k + 1) then the space used is O( k2 log(n/δ)). A heap of frequent items is kept,
and if the current items exceed the threshold, then the least frequent item in
the heap is ejected, and the current item inserted. We shall return to this work
later in Section 4.1, when we adapt and use the filter as the basis of a more ad-
vanced algorithm to find hot items. We will describe the algorithm in full detail,
and give an analysis of how it can be used as part of a solution to the hot items
problem.

   2.1.3 Insert and Delete Algorithms. Previous work that studied hot items
in the presence of both of inserts and deletes is sparse [Gibbons and Matias
1998, 1999]. These articles have proposed methods to maintain a sample of
items and count of the number of times each item occurs in the data set, and
focused on the harder problem of monitoring the k most frequent items. These
methods work provably for the insert-only case, but provide no guarantees
for the fully dynamic case with deletions. However, the authors studied how
effective these samples are for the deletion case through experiments. Gibbons
et al. [1997] presented methods to maintain various histograms in the presence
of inserts and deletes using a “backing sample,” but these methods too need
access to large portion of the data periodically in the presence of deletes.
   A recent theoretical work presented provable algorithms for maintaining
histograms with guaranteed accuracy and small space [Gilbert et al. 2002a].


                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
256       •     G. Cormode and S. Muthukrishnan

The methods in this article can yield algorithms for maintaining hot items,
but the methods are rather sophisticated and use powerful range summable
random variables, resulting in k log O(1) n space and time algorithms where the
O(1) term is quite large. We draw some inspiration from the methods in this
article—we will use ideas similar to the “sketching” developed in Gilbert et al.
[2002a], but our overall methods are much simpler and more efficient. Finally,
recent work in maintaining quantiles [Gilbert et al. 2002b] is similar to ours
since it keeps the sum of items in random subsets. However, our result is, of
necessity, more involved, involving a random group generation phase based on
group testing, which was not needed in [Gilbert et al. 2002b]. Also, once such
groups are generated, we maintain sums of deterministic sets (in contrast to
the random sets as in Gilbert et al. [2002b]), given again by error correcting
codes. Finally, our algorithm is more efficient than the (k 2 log2 m) space and
time algorithms given in Gilbert et al. [2002b].


2.2 Our Approach
We propose some new approaches to this problem, based on ideas from group
testing and error-correcting codes. Our algorithms depend on ideas drawn from
group testing [Du and Hwang 1993]. The idea of group testing is to arrange a
number of tests, each of which groups together a number of the m items in order
to find up to k items which test “positive.” Each test reports either “positive”
or “negative” to indicate whether there is a positive item among the group,
or whether none of them is positive. The familiar puzzle of how to use a pan
balance to find one “positive” coin among n good coins, of equal weight, where
the positive coin is heavier than the good coins, is an example of group testing.
The goal is to minimize the number of tests, where each test in the group testing
is applied to a subset of the items (a group). Our goal of finding up to k hot items
can be neatly mapped onto an instance of group testing: the hot items are the
positive items we want to find.
   Group testing methods can be categorized as adaptive or nonadaptive. In
adaptive group testing, the members of the next set of groups to test can be
specified after learning the outcome of the previous tests. Each set of tests is
called a round, and adaptive group testing methods are evaluated in terms of
the number of rounds, as well as the number of tests, required. By contrast,
nonadaptive group testing has only one round, and so all groups must be chosen
without any information about which groups tested positive. We shall give two
main solutions for finding frequent items, one based on nonadaptive and the
other on adaptive group testing. For each, we must describe how the groups
are formed from the items, and how the tests are performed. An additional
challenge is that our tests here are not perfect, but have some chance of failure
(reporting the wrong result). We will prove that, in spite of this, our algorithms
can guarantee finding all hot items with high probability. The algorithms we
propose differ in the nature of the guarantees that they give, and result in
different time and space guarantees. In our experimental studies, we were
able to explore these differences in more detail, and to describe the different
situations which each of these algorithms is best suited to.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically             •      257

3. NONADAPTIVE GROUP TESTING
Our general procedure is as follows: we divide all items up into several (over-
lapping) groups. For each transaction on an item x, we determine which groups
it is included in (denoting these G(x)). Each group is associated with a counter,
and for an insertion we increment the counter for all G(x); for a deletion, we
correspondingly decrement these counters. The test will be whether the count
for a subset exceeds a certain threshold: this is evidence that there may a hot
item within the set. Identifying the hot items is a matter of putting together
the information from the different tests to find an overall answer.
    There are a number of challenges involved in following this approach:
(1) bounding the number of groups required; (2) finding a concise represen-
tation of the groups; and (3) giving an efficient way to go from the results of
tests to the set of hot items. We shall be able to address all of these issues.
To give greater insight into this problem, we first give a simple solution to the
k = 1 case, which is to find an item that occurs more than half of the time.
Later, we will consider the more general problem of finding k > 1 hot items,
which will use the procedure given below as a subroutine.

3.1 Finding the Majority Item
If an item occurs more than half the time, then it is said to be the majority item.
While finding the majority item is mostly straightforward in the insertions-
only case (it is solved in constant space and constant time per insertion by the
algorithms of Boyer and Moore [1982] and Fischer and Salzberg [1982]), in the
dynamic case, it looks less trivial. We might have identified an item which is
very frequent, only for this item to be the subject of a large number of deletions,
meaning that some other item is now in the majority.
   We give an algorithm to solve this problem by keeping log2 m + 1 counters.
The first counter, c0 , merely keeps track of n(t) = x nx (t), which is how many
items are “live”: in other words, we increment this counter on every insert, and
decrement it on every deletion. The remaining counters are denoted c1 · · · c j .
We make use of the function bit(x, j ), which reports the value of the j th bit
of the binary representation of the integer x; and g t(x, y), which returns 1 if
x > y and 0 otherwise. Our procedures are as follows:

Insertion of item x: increment each counter c j such that bit(x, j ) = 1 in time
O(log m).
Deletion of x: decrement each counter c j such that bit(x, j ) = 1 in time O(log m).
                                                    log2 m
Search: if there is a majority, then it is given by j =1 2 j g t(c j , n/2), computed
in time O(log m).

   The arrangement of the counters is shown graphically in Figure 1. The two
procedures of this method—one to process updates, another to identify the ma-
jority element—are given in Figure 2 (where trans denotes whether the trans-
action is an insertion or a deletion).

  THEOREM 3.1. The algorithm in Figure 2 finds a majority item if there is one
with time O(log m) per update and search operation.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
258       •     G. Cormode and S. Muthukrishnan




Fig. 1. Each test includes half of the range [1 · · · m], corresponding to the binary representation
of values.




              Fig. 2. Algorithm to find the majority element in a sequence of update.

    PROOF. We make two observations: first, that the state of the data structure
is equivalent to that following a sequence of c0 insertions only, and second, that
in the insertions only case, this algorithm identifies a majority element. For the
first point, it suffices to observe that the effect of each deletion of an element x is
to precisely cancel out the effect of a prior insertion of that element. Following
a sequence of I insertions and D deletions, the state is precisely that obtained
if there had been I − D = n insertions only.
    The second part relies on the fact that if there is an item whose count is
greater than n/2 (that is, it is in the majority), then for any way of dividing
the elements into two sets, the set containing the majority element will have
weight greater than n/2, and the other will have weight less than n/2. The tests
are arranged so that each test determines the value of a particular bit of the
index of the majority element. For example, the first test determines whether
its index is even or odd by dividing on the basis of the least significant bit. The
log m tests with binary outcomes are necessary and sufficient to determine the
index of the majority element.
   Note that this algorithm is completely deterministic, and guarantees always
to find the majority item if there is one. If there is no such item, then still some
item will be returned, and it will not be possible to distinguish the difference
based on the information stored. The simple structure of the tests is standard
in group testing, and also resembles the structure of the Hamming single error-
correcting code.

3.2 Finding k Hot Items
When we perform a test based on comparing the count of items in two buck-
ets, we extract from this a single bit of information: whether there is a hot item
present in the set or not. This leads immediately to a lower bound on the number
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically             •      259

of tests necessary: to locate k items among m locations requires log2 (m ) ≥  k
k log(m/k) bits.
    We make the following observation: suppose we selected a group of items
to monitor which happened to contain exactly one hot item. Then we could
apply the algorithm of Section 3.1 to this group (splitting it into a further log m
subsets) and, by keeping log m counters, identify which item was the hot one. We
would simply have to “weigh” each bucket, and, providing that the total weight
of other items in the group were not too much, the hot item would always be in
the heavier of the two buckets.
    We could choose each group as a completely random subset of the items, and
apply the algorithm for finding a single majority item described at the start of
this section. But for a completely random selection of items then in order to store
the description of the groups, we would have to list every member of every group
explicitly. This would consume a very large amount of space, at least would be
linear in m. So instead, we shall look for a concise way to describe each group,
so that given an item we can quickly determine which groups it is a member of.
We shall make use of hash functions, which will map items onto the integers
1 · · · W , for some W that we shall specify later. Each group will consist of all
items which are mapped to the same value by a particular hash function. If the
hash functions have a concise representation, then this describes the groups in
a concise fashion. It is important to understand exactly how strong the hash
functions need to be to guarantee good results.

   3.2.1 Hash Functions. We will make use of universal hash functions de-
rived from those given by Carter and Wegman [1979]. We define a family of
hash functions f a,b as follows: fix a prime P > m > W , and draw a and b
uniformly at random in the range [0 · · · P − 1]. Then set
                      f a,b(x) = ((ax + b mod P )         mod W ).
Using members of this family of functions will define our groups. Each hash
function is defined by a and b, which are integers less than P . P itself is chosen
to be O(m), and so the space required to represent each hash function is O(log m)
bits.

   Fact 3.2 (Proposition 7 of Carter and Wegman [1979]).                  Over all choices
of a and b, for x = y, Pr[ f a,b(x) = f a,b( y)] ≤ 1/W .
  We can now describe the data structures that we will keep in order to allow
us to find up to k hot items.

   3.2.2 Nonadaptive Group Testing Data Structure. The group testing data
structure is initialized with two parameters W and T , and has three
components:
— a three-dimensional array of counters c, of size T × W × (log(m) + 1);
— T universal hash functions h, defined by a[1 · · · T ] and b[1 · · · T ] so hi =
   f a[i],b[i] ;
— the count n of the current number of items.
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
260       •      G. Cormode and S. Muthukrishnan




              Fig. 3. Procedures for finding hot items using nonadaptive group testing.


    The data structure is initialized by setting all the counters, c[1][0][0] to
c[T ][W − 1][log m], to zero, and by choosing values for each entry of a and b
uniformly at random in the range [0 · · · P −1]. The space used by the data struc-
ture is O(T W log m). We shall specify values for W and T later. We will write
hi to indicate the ith hash function, so hi (x) = a[i] ∗ x + b[i] mod P mod W . Let
G i, j = {x|hi (x) = j } be the (i, j )th group. We will use c[i][ j ][0] to keep the count
of the current number of items within the G i, j . For each such group, we shall also
keep counts for log m subgroups, defined as G i, j,l = {x|x ∈ G i, j ∧ bit(x, l ) = 1}.
These correspond to the groups we kept for finding a majority item. We will use
c[i][ j ][l ] to keep count of the current number of items within subgroup G i, j,l .
This leads to the following update procedure.

   3.2.3 Update Procedure. Our procedure in processing an input item x is
to determine which groups it belongs to, and to update the log m counters for
each of these groups based on the bit representation of x in exactly the same
way as the algorithm for finding a majority element. If the transaction is an
insertion, then we add one to the appropriate counters, and subtract one for
a deletion. The current count of items is also maintained. This procedure is
shown in pseudocode as PROCESSITEM (x, trans, T , W ) in Figure 3. The time to
perform an update is the time taken to compute the T hash functions, and to
modify O(T log m) counters.
   At any point, we can search the data structure to find hot items. Various
checks are made to avoid including in the output any items which are not hot.
In group testing terms, the test that we will use is whether the count for a
group or subgroup exceeds the threshold needed for an item to be hot, which
is n/(k + 1). Note that any group which contains a hot item will pass this test,
but that it is possible that a group which does not contain a hot item can also
pass this test. We will later analyze the probability of such an event, and show
that it can be made quite small.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
      What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                        •      261

   3.2.4 Search Procedure. For each group, we will use the information about
the group and its subgroups to test whether there is a hot item in the group,
and if so, to extract the identity of the hot item. We process each group G i, j in
turn. First, we test whether there can be a hot item in the group. If c[i][ j ][0] ≤
n/(k + 1) then there cannot be a hot item in the group, and so the group is
rejected. Then we look at the count of every subgroup, compared to the count
of the whole group, and consider the four possible cases:
 c[i][ j ][l ] > k+1 ?
                  n
                          c[i][ j ][0] − c[i][ j ][l ] > k+1 ?
                                                          n
                                                                                 Conclusion
          No                              No                     Cannot be a hot item in the group,
                                                                    so reject group
         No                              Yes                     If a hot item x is in group,
                                                                    then bit(l , x) = 0
         Yes                              No                     If a hot item x is in group,
                                                                    then bit(l , x) = 1
         Yes                             Yes                     Not possible to identify the hot item,
                                                                    so reject group

  If the group is not rejected, then the identity of the candidate hot item, x,
can be recovered from the tests. Some verification of the hot items can then be
carried out.
— The candidate item must belong to the group it was found in, so check hi (x) =
  j.
— If the candidate item is hot, then every group it belongs in should be above
  the threshold, so check that c[i][hi (x)][0] > n/(k + 1) for all i.
The time to find all hot items is O(T 2 W log m). There can be at most T W can-
didates returned, and checking them all takes worst-case time O(T ) each. The
full algorithms are illustrated in Figure 3. We now show that for appropriate
choices of T and W we can first ensure that all hot items are found, and second
ensure that no items are output which are far from being hot.
   LEMMA 3.3. Choosing W ≥ 2k and T = log2 ( k ) for a user chosen parameter
                                                   δ
δ ensures that the probability of all hot items being output is at least 1 − δ.
   PROOF. Consider each hot item x, in turn, remembering that there are at
most k of these. Using Fact 3.2 about the hash functions, then the probability
for any other item falling into the same group as x under the ith hash function
is given by 1/W ≤ 2k . Using linearity of expectation, then the expectation of
                     1

the total frequency of other items which land in the same group as item x is
                                                                             fy   1 − fx      1
E                        fy   =         f y · Pr[hi ( y) = hi (x)] ≤            ≤        ≤          .
    y=x,hi ( y)=hi (x)            y=x                                  y=x
                                                                             2k     2k     2(k + 1)
                                                                                                         (1)
Our test cannot fail if the total weight of other items which fall in the same
bucket is less than 1/(k + 1). This is because each time we compare the counts
of items in the group we conclude that the hot item is in the half with greater
count. If the total frequency of other items is less than 1/(k + 1), then the hot
                                            ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
262       •     G. Cormode and S. Muthukrishnan

item will always be in the heavier half, and so, using a similar argument to
that for the majority case, we will be able to read off the index of the hot item
using the results of log m groups. The probability of failing due to the weight
of other items in the same bucket being more than 1/(k + 1) is bounded by
the Markov inequality as 1 , since this is at least twice the expectation. So
                               2
the probability that we fail on every one of the T independent tests is less
        log(k/δ)
than 12
                 = δ/k. Using the union bound, then, over all hot items, the prob-
ability of any of them failing is less than δ, and so each hot item is output with
probability at least 1 − δ.
   LEMMA 3.4. For any user specified fraction        ≤ k+1 , if we set W ≥ 2
                                                        1

and T = log2 (k/δ), then the probability of outputting any item y with f y <
 1
k+1
    − is at most δ/k.
  PROOF. This lemma follows because of the checks we perform on every item
before outputting it. Given a candidate item, we check that every group it is a
member of is above the threshold. Suppose the frequency of the item y is less
than ( k+1 − ). Then the frequency of items which fall in the same group under
        1

hash function i must be at least , to push the count for the group over the
threshold for the test to return positive. By the same argument as in the above
lemma, the probability of this event is at most 1 . So the probability that this
                                                   2
                                              1 log k/δ
occurs in all groups is bounded by            2
                                                          = δ/k.
  Putting these two lemmas together allows us to state our main result on
nonadaptive group testing:
  THEOREM 3.5. With probability at least 1 − δ, then we can find all hot items
whose frequency is more than k+1 , and, given ≤ k+1 , with probability at least
                               1                  1

1 − δ/k each item which is output has frequency at least k+1 − using space
                                                           1

   1
O( log(m) log(k/δ)) words. Each update takes time O(log(m) log(k/δ)). Queries
take time no more than O( 1 log2 (k/δ) log m).

   PROOF. This follows by setting W = 2 and T = log(k/δ), and applying the
above two lemmas. To process an item, we compute T hash functions, and
update T log m counters, giving the time cost. To extract the hot items involves
a scan over the data structure in linear time, plus a check on each hot item
found that takes time at most O(T ), giving total time O(T 2 W log m).
   Next, we describe additional properties of our method which imply its sta-
bility and resilience.
   COROLLARY 3.6. The data structure created with T = log(k/δ) can be used
to find hot items with parameter k for any k < k with the same probability of
success 1 − δ.
  PROOF. Observe in Lemma 3.3 that, to find k hot items, we required W ≥
2k . If we use a data structure created with W ≥ 2k, then W ≥ 2k > 2k ,
and so the data structure can be used for any value of k less than the value it
was created for. Similarly, we have more tests than we need, which can only
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically             •      263

help the accuracy of the group testing. All other aspects of the data structure
are identical. So, if we run the procedure with a higher threshold, then with
probability at least 1 − δ, we will find the hot items.
  This property means that we can fix k to be as large as we want, and are then
able to find hot items with any frequency greater than 1/(k + 1) determined at
query time.
   COROLLARY 3.7.    The output of the algorithm is the same for any reordering
of the input data.
   PROOF. During any insertion or deletion, the algorithm takes the same ac-
tion and does not inspect the contents of the memory. It just adds or subtracts
values from the counters, as a function solely of the item value. Since addition
and subtraction commute, the corollary follows.

   3.2.5 Estimation of Count of Hot Items. Once the hot items have been iden-
tified, we may wish to additionally estimate the count, nx , of each of these items.
One approach would be to keep a second data structure enabling the estimation
of the counts to be made. Such data structures are typically compact, fast to
update, and give accurate answers for items whose count is large, that is, hot
items [Gilbert et al. 2002b; Charikar et al. 2002; Cormode and Muthukrishnan
2004a]. However, note that the data structure that we keep embeds a structure
that allows us to compute an estimate of the weight of each item [Cormode and
Muthukrishnan 2004a].
  COROLLARY 3.8. Computing mini c[i][hi (x)][0] gives a good estimate for nx
with probability at least 1 − (δ/k).
   PROOF. This follows from the proofs of Lemma 3.3 and Lemma 3.4. Each
estimate c[i][hi (x)][0] = nx + y=x,hi (x)=hi ( y) n y . But by Lemma 3.3, this addi-
tional noise is bounded by n with constant probability at least 1 , as shown in
                                                                      2
Equation (1). Taking the minimum over all estimates amplifies this probability
to 1 − (δ/k).

3.3 Time-Space Tradeoff
In certain situations when transactions are occurring at very high rates, it is
vital to make the update procedure as fast as possible. One of the drawbacks of
the current procedure is that it depends on the product of T and log m, which
can be slow for items with large identifiers. For reducing the time dependency
on T , note that the data structure is intrinsically parallelizable: each of the
T hash functions can be applied in parallel, and the relevant counts modified
separately. In the experimental section we will show that good results are ob-
served even for very small values of T ; therefore, the main bottleneck is the
dependence on log m.
   The dependency on log m arises because we need to recover the identifier
of each hot item, and we do this 1 bit at a time. Our observation here is that
we can find the identifier in different units, for example, 1 byte at a time, at
the expense of extra space usage. Formally, define dig(x, i, b) to be the ith digit
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
264       •     G. Cormode and S. Muthukrishnan

in the integer x when x is written in base b ≥ 2. Within each group, we keep
(b − 1) × logb m subgroups: the i, j th subgroup counts how many items have
dig(x, i, b) = j for i = 1 · · · logb m and j = 1 · · · b − 1. We do not need to keep a
subgroup for j = 0 since this count can be computed from the other counts for
that group. Note that b = 2 corresponds to the binary case discussed already,
and b = m corresponds to the simple strategy of keeping a count for every item.
   THEOREM 3.9. Using the above procedure, with probability at least 1−δ, then
                                                         1
we can find all hot items whose frequency is more than k+1 , and with probability
at least 1 − (δ/k), each item which is output has frequency at least k+1 − using
                                                                      1
           b
space O( logb(m) log(k/δ)) words. Each update takes time O(logb(m) log(k/δ))
and queries take O( b logb(m) log2 (k/δ)) time.
   PROOF. Each subgroup now allows us to read off one digit in the base-b
representation of the identifier of any hot item x. Lemma 3.3 applies to this
situation just as before, as does Lemma 3.4. This leads us to set W and T as
before. We have to update one counter for each digit in the base b representation
of each item for each transaction, which corresponds to logb m counters per test,
giving an update time of O(T logb(m)). The space required is for the counters
to record the subgroups of T W groups, and there are (b − 1) logb(m) subgroups
of every group, giving the space bounds.
   For efficient implementations, it will generally be preferable to choose b to
be a power of 2, since this allows efficient computation of indices using bit-
level operations (shifts and masks). The space cost can be relatively high for
speedups: choosing b = 28 means that each update operation is eight times
faster than for b = 2, but requires 32 times more space. A more modest value of
b may strike the right balance: choosing b = 4 doubles the update speed, while
the space required increases by 50%. We investigate the effects of this tradeoff
further in our experimental study.

4. ADAPTIVE GROUP TESTING
The more flexible model of adaptive group testing allows conceptually simpler
choices of groups, although the data structures required to support the tests
become more involved. The idea is a very natural “divide-and-conquer” style
approach, and as such may seem straightforward. We give the full details here
to emphasize the relation between viewing this as an adaptive group testing
procedure and the above nonadaptive group testing approach. Also, this method
does not seem to have been published before, so we give the full description for
completeness.
   Consider again the problem of finding a majority item, assuming that one
exists. Then an adaptive group testing strategy is as follows: test whether
the count of all items in the range {1 · · · m/2} is above n/2, and also whether
the count of all items in the range {m/2 + 1 · · · m} is over the threshold. Recurse
on whichever half contains more than half the items, and the majority item is
found in log2 m rounds.
   The question is: how to support this adaptive strategy as transactions are
seen? As counts increase and decrease, we do not know in advance which queries
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically              •      265




                        Fig. 4. Adaptive group testing algorithms.
will be posed, and so the solution seems to be to keep counts for every test that
could be posed—but there are (m) such tests, which is too much to store. The
solution comes by observing that we do not need to know counts exactly, but
rather it suffices to use approximate counts, and these can be supported using
a data structure that is much smaller, with size dependent on the quality of
approximation. We shall make use of the fact that the range of items can be
mapped onto the integers 1 · · · m. We will initially describe an adaptive group
testing method in terms of an oracle that is assumed to give exact answers, and
then show how this oracle can be realized approximately.
   Definition 4.1. A dyadic range sum oracle returns the (approximate) sum
of the counts of items in the range l = (i2 j + 1) · · · r = (i + 1)2 j for 0 ≤ j ≤ log m
and 0 ≤ i ≤ m/2 j .
    Using such an oracle, which reflects the effect of items arriving and departing,
it is possible to find all the hot items, with the following binary search divide-
and-conquer procedure. For simplicity of presentation, we assume that m, the
range of items, is a power of 2. Beginning with the full range, recursively split in
two. If the total count of any range is less than n/(k+1), then do not split further.
Else, continue splitting until a hot item is found. It follows that O(k log(m/k))
calls are made to the oracle. The procedure is presented as ADAPTIVEGROUPTEST
on the right in Figure 4.
    In order to implement dyadic range sum oracles, define an approximate count
oracle to return the (approximate) count of the item x. A dyadic range sum oracle
can be implemented using j = 0 · · · log m approximate count oracles: for each
item in the stream x, insert 2xj into the j th approximate count oracle, for all j .
Recent work has given several methods of implementing the approximate count
oracle, which can be updated to reflect the arrival or departure of any item. We
now list three examples of these and give their space and update time bounds:
— The “tug of war sketch” technique of Alon et al. [1999] uses space and time
  O( 12 log 1 ) to approximate any count up to n with a probability of at least
            δ
  1 − δ.
— The method of random subset sums described in Gilbert et al. [2002b] uses
  space and time O( 12 log 1 ).
                             δ
— The method of Charikar et al. [2002]. builds a structure which can be used
  to approximate the count of any item correct upto n in space O( 12 log 1 ) and
                                                                         δ
  time per update O(log 1 ).
                           δ

                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
266       •       G. Cormode and S. Muthukrishnan

  The fastest of these methods is that of Charikar et al. [2002], and so we shall
adopt this as the basis of our adaptive group testing solution. In the next section
we describe and analyze the data structure and algorithms for our purpose of
finding hot items.

4.1 CCFC Count Sketch
We shall briefly describe and analyze the CCFC count sketch.1 This is a different
and shorter analysis compared to that given in Charikar et al. [2002], since here
the goal is to estimate each count to within an error in terms of the total count
of all items rather than in the count of the kth most frequent item, as was the
case in the original article.

   4.1.1 Data Structure. The data structure used consists of a table of coun-
ters t, with width W and height T , initialized to zero. We also keep T pairs of
universal hash functions: h1 · · · hT , which map items onto 1 · · · W , and g 1 · · · g T ,
which map items onto {−1, +1}.

  4.1.2 Update Routine. When an insert transaction of item x occurs, we
update t[i][hi (x)] ← t[i][hi (x)]+ g i [x] for all i = 1 · · · T . For a delete transaction,
we update t[i][hi (x)] ← t[i][hi (x)] − g i [x] for all i = 1 · · · T .

    4.1.3 Estimation. To estimate the count of x, compute mediani (t[i][hi (x)] ·
g i (x)).

  4.1.4 Analysis. Use the random variable X i to denote t[i][hi (x)]· g i (x). The
expectation of each estimate is

E(X i ) = nx +           Pr[hi ( y) = hi (x)] · (Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)]) = nx
                   y=x

since Pr[ g i (x) = g i ( y)] = 1 . The variance of each estimate is
                                2

 Var(X i ) = E X i2 − E(X i )2                                                                          (2)
              = E( g i (x) (t[i][hi (x)]) ) −
                             2               2
                                                   n2
                                                    x                                                   (3)
              = 2            n y nz Pr[hi ( y) = hi (z)](Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)])
                                                                                                         (4)
                    y=x,z

                  + n2 +
                     x             g i2 ( y)n2 Pr[hi ( y) = hi (x)] − n2
                                             y                         x                                (5)
                             y=x

                        n2
                         y        n2
              =               ≤      .                                                                  (6)
                  y=x
                        W         W
                                                                                              √
  Using the Chebyshev inequality, it follows that Pr[|X i − x| > √2n ] < 1 .
                                                                      W      2
Taking the median of T estimates amplifies this probability to 2T/4 , by a stan-
dard Chernoff bounds argument [Motwani and Raghavan 1995].

1 CCFC   denotes the initials of the authors of Charikar et al. [2002].

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically              •      267

  4.1.5 Space and Time. The space used is for the W T counters and the 2T
hash functions. The time taken for each update is the time to compute the 2T
hash functions, and update T counters.
  THEOREM 4.2. By setting W = 22 and T = 4 log 1 then we can estimate the
                                                    δ
count of any item up to error ± n with probability at least 1 − δ.

4.2 Adaptive Group Testing Using CCFC Count Sketch
We can now implement an adaptive group testing solution to finding hot items.
The basic idea is to apply the adaptive binary search procedure using the above
count sketch to implement the dyadic range sum oracle. The full procedure is
shown in Figure 4.
    THEOREM 4.3. Setting W = 22 and T = log k log m allows us to find every item
                                                       δ
with frequency greater than k+1 + , and report no item with frequency less than
                                  1
  1
k+1
     − , with a probability of at least 1−δ. The space used is O( 12 log(m) log k log m )
                                                                                    δ
words, and the time to perform each update is O(log(m) log k log m ). The query time
                                                                  δ
is O(k log m log k log m ) with a proabability of at least 1 − δ.
                     δ
                                                              δ
   PROOF. We set the probability of failure to be low ( k log m ), so that for the
O(k log m) queries that we pose to the oracle, there is probability at most δ of
any of them failing, by the union bound. Hence, we can assume that with a
probability of at least 1 − δ, all approximations are within the ± n error bound.
Then, when we search for hot items, any range containing a hot item will have
its approximate count reduced by at most n. This will allow us to find the hot
item, and output it if its frequency is at least k+1 + . Any item which is output
                                                  1

must pass the final test, based on the count of just that item, which will not
happen if its frequency is less than k+1 − .
                                         1

   Space is needed for log(m) sketches, each of which has size O(T W ) words. For
these settings of T and W , we obtain the space bounds listed in the theorem.
The time per update is that needed to compute 2T log(m) hash values, and then
to update up to this many counters, which gives the stated update time.

   4.2.1 Hot Item Count Estimation. Note that we can immediately extract
the estimated counts for each hot item using the data structure, since the count
of item x is given by using the lowest-level approximate count. Hence, the count
nx is estimated with error at most n in time O(log(m) log k log m ).
                                                              δ


4.3 Time-Space Tradeoffs
As with the nonadaptive group testing method, the time cost for updates de-
pends on T and log m. Again, in practice we found that small values of T could
be used, and that computation of the hash functions could be parallelized for
extra speedup. Here, the dependency on log m is again the limiting factor. A
similar trick to the nonadaptive case is possible, to change the update time
dependency to logb m for arbitrary b: instead of basing the oracle on dyadic
ranges, base it on b-adic ranges. Then only logb m sketches need to be updated
for each transaction. However, under this modification, the same guarantees
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
268       •     G. Cormode and S. Muthukrishnan

do not hold. In order to extract the hot items, many more queries are needed:
instead of making at most two queries per hot item per level, we make at most
b queries per hot item per level, and so we need to reduce the probability of
making a mistake to reflect this. One solution would be to modify T to give a
guarantee—but this can lose the point of the exercise, which is to reduce the
cost of each update. So instead we treat this as a heuristic to try out in practice,
and to see how well it performs.
   A more concrete improvement to space and time bounds comes from observ-
ing that it is wasteful to keep sketches for high levels in the hierarchy, since
there are very few items to monitor. It is therefore an improvement to keep
exact counts for items at high levels in the hierarchy.

5. COMPARISON BETWEEN METHODS AND EXTENSIONS
We have described two methods to find hot items after observing a sequence of
insertion and deletion transactions, and proved that they can give guarantees
about the quality of their output. These are the first methods to be able to give
such guarantees in the presence of deletions, and we now go on to compare
these two different approaches. We will also briefly discuss how they can be
adapted when the input may come in other formats.
   Under the theoretical analysis, it is clear that the adaptive and nonadap-
tive methods have some features in common. Both make use of universal hash
functions to map items to counters where counts are maintained. However, the
theoretical bounds on the adaptive search procedure look somewhat weaker
than those on the nonadaptive methods. To give a guarantee of not outputting
items which are more than from being hot items, the adaptive group testing
depends on 1/ 2 in space, whereas nonadaptive testing uses 1/ . The update
times look quite similar, depending on the product of the number of tests, T ,
and the bit depth of the universe, logb(m). It will be important to see how these
methods perform in practice, since these are only worst-case guarantees. In or-
der to compare these methods in concrete terms, we shall use the same values
of T and W for adaptive and nonadaptive group testing in our tests, so that
both methods are allocated approximately the same amount of space.
   Another difference is that adaptive group testing requires many more hash
function evaluations to process each transaction compared to nonadaptive
group testing. This is because adaptive group testing computes a different hash
for each of log m prefixes of the item, whereas nonadaptive group testing com-
putes one hash function to map the item to a group, and then allocates it to
subgroups based on its binary representation. Although the universal hash
functions can be implemented quite efficiently [Thorup 2000], this extra pro-
cessing time can become apparent for high transaction rates.

5.1 Other Update Models
In this work we assume that we modify counts by one each time to model in-
sertions or deletions. But there is no reason to insist on this: the above proofs
work for arbitrary count distributions; hence it is possible to allow the counts
to be modified by arbitrary increments or decrements, in the same update time
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                •      269

bounds. The counts can even include fractional values if so desired. This holds
for both the adaptive and nonadaptive methods. Another feature is that it is
straightforward to combine the data structures for the merge of two distribu-
tions: providing both data structures were created using the same parameters
and hash functions, then summing the counters coordinatewise gives the same
set of counts as if the whole distribution had been processed by a single data
structure. This should be contrasted to other approaches [Babcock and Olston
2003], which also compute the overall hot items from multiple sources, but keep
a large amount of space at each location: instead the focus is on minimizing the
amount of communication. Immediate comparison of the approaches is not pos-
sible, but for periodic updates (say, every minute) it would be interesting to
compare the communication used by the two methods.

6. EXPERIMENTS

6.1 Evaluation
To evaluate our approach, we implemented our group testing algorithms in
C. We also implemented two algorithms which operate on nondynamic data,
the algorithm Lossy Counting [Manku and Motwani 2002] and Frequent [De-
maine et al. 2002]. Neither algorithm is able to cope with the case of the dele-
tion of an item, and there is no obvious modification to accommodate dele-
tions and still guarantee the quality of the output. We instead performed
a “best effort” modification: since both algorithms keep counters for certain
items, which are incremented when that item is inserted, we modified the
algorithms to decrement the counter whenever the corresponding item was
deleted. When an item without a counter was deleted, then we took no action.2
This modification ensures that when the algorithms encounter an inserts-only
dataset, then their action is the same as the original algorithms. Code for
our implementations is available on the Web, from http://www.cs.rutgers.
edu/˜muthu/massdal-code-index.html.

   6.1.1 Evaluation Criteria. We ran tests on both synthetic and real data,
and measured time and space usage of all four methods. Evaluation was carried
out on a 2.4-GHz desktop PC with 512-MB RAM. In order to evaluate the quality
of the results, we used two standard measures: the recall and the precision.
  Definition 6.1. The recall of an experiment to find hot items is the pro-
portion of the hot items that are found by the method. The precision is the
proportion of items identified by the algorithm which are hot items.
   It will be interesting to see how these properties interact. For example, if
an algorithm outputs every item in the range 1 · · · m then it clearly has perfect
recall (every hot item is indeed included in the output), but its precision is
very poor. At the other extreme, an algorithm which is able to identify only the

2 Many   variations of this theme are possible. Our experimental results here that compare our
algorithms to modifications of Lossy Counting [Manku and Motwani 2002] and Frequent [Demaine
et al. 2002] should be considered proof-of-concept only.

                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
270       •     G. Cormode and S. Muthukrishnan




Fig. 5. Experiments on a sequence of 107 insertion-only transactions. Left: testing recall (propor-
tion of the hot items reported). Right: testing precision (proportion of the output items which were
hot).

most frequent item will have perfect precision, but may have low recall if there
are many hot items. For example, the Frequent algorithm gives guarantees on
the recall of its output, but does not strongly bound the precision, whereas,
for Lossy Counting, the parameter affects the precision indirectly (depending
on the properties of the sequence). Meanwhile, our group testing methods give
probabilistic guarantees of perfect recall and good precision.

   6.1.2 Setting of Parameters. In all our experiments, we set = k+1 and1

hence set W = k+1 , since this keeps the memory usage quite small. In practice,
                 2

we found that this setting of gave quite good results for our group testing
methods, and that smaller values of did not significantly improve the results.
In all the experiments, we ran both group testing methods with the same val-
ues of W and T , which ensured that on most base experiments they used the
same amount of space. In our experiments, we looked at the effect of varying
the value of the parameters T and b. We gave the parameter to each algo-
rithm and saw how much space it used to give a guarantee based on this .
In general, the deterministic methods used less space than the group testing
methods. However, when we made additional space available to the determin-
istic methods equivalent to that used by the group testing approaches, we did
not see any significant improvement in their precision and we saw a similar
pattern of dependency on the Zipf parameter.

6.2 Insertions-Only Data
Although our methods have been designed for the challenges of transaction
sequences that contain a mix of insertions and deletions, we first evaluated a
sequence of transactions which contained only insertions. These were gener-
ated by a Zipf distribution, whose parameter was varied from 0 (uniform) to 3
(highly skewed). We set k = 1000, so we were looking for all items with fre-
quency 0.1% and higher. Throughout, we worked with a universe of size m = 232 .
Our first observation on the performance of group testing-based methods is
that they gave good results with very small values of T . The plots in Figure 5
show the precision and recall of the methods with T = 2, meaning that each
item was placed in two groups in nonadaptive group testing, and two estimates
were computed for each count in adaptive group testing. Nonadaptive group
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
    What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                •      271




           Fig. 6. Experiments on synthetic data consisting of 107 transactions.

testing is denoted as algorithm “NAGT,” and adaptive group testing as algo-
rithm “Adapt.” Note that, on this data set, the algorithms Lossy Counting and
Frequent both achieved perfect recall, that is, they returned every hot item.
This is not surprising: the deterministic guarantees ensure that they will find
all hot items when the data consists of inserts only. Group testing approaches
did pretty well here: nonadaptive got almost perfect recall, and adaptive missed
only a few for near uniform distributions. On distributions with a small Zipf
parameter, many items had counts which were close to the threshold for be-
ing a hot item, meaning that adaptive group testing can easily miss an item
which is just over the threshold, or include an item which is just below. This is
also visible in the precision results: while nonadaptive group testing included
no items which were not hot, adaptive group testing did include some. How-
ever, the deterministic methods also did quite badly on precision, frequently
including many items which were not hot in its output while, for this value
of , Lossy Counting did much better than Frequent, but consistently worse
than group testing. As we increased T , both nonadaptive and adaptive group
testing got perfect precision and recall on all distributions. For the experiment
illustrated, the group testing methods both used about 100 KB of space each,
while the deterministic methods used a smaller amount of space (around half as
much).

6.3 Synthetic Data with Insertions and Deletions
We created synthetic datasets designed to test the behavior when confronted
with a sequence including deletes. The datasets were created in three equal
parts: first, a sequence of insertions distributed uniformly over a small range;
next, a sequence of inserts drawn from a Zipf distribution with varying param-
eters; last, a sequence of deletes distributed uniformly over the same range as
the starting sequence. The net effect of this sequence should be that the first
and last groups of transactions would (mostly) cancel out, leaving the “true”
signal from the Zipf distribution. The dataset was designed to test whether the
algorithms could find this signal from the added noise. We generated a dataset
of 10,000,000 items, so it was possible to compute the exact answers in order
to compare, and searched for the k = 1000 hot items while varying the Zipf pa-
rameter of the signal. The results are shown in Figure 6, with the recall plotted
on the left and the precision on the right. Each data point comes from one trial,
rather than averaging over multiple repetitions.
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
272       •     G. Cormode and S. Muthukrishnan

   The purpose of this experiment was to demonstrate a scenario where insert-
only algorithms would not be able to cope when the dataset included many
deletes (in this case, one in three of the transactions was a deletion). Lossy
Counting performed worst on both recall and precision, while Frequent man-
aged to get good recall only when the signal was very skewed, meaning the
hot items had very high frequencies compared to all other items. Even when
the recall of the other algorithms was reasonably good (finding around three-
quarters of the hot items), their precision was very poor: for every hot item that
was reported, around 10 infrequent items were also included in the output,
and we could not distinguish between these two types. Meanwhile, both group
testing approaches succeeded in finding almost all hot items, and outputting
few infrequent items.
   There is a price to pay for the extra power of the group testing algorithm: it
takes longer to process each item under our implementation, and requires more
memory. However, these memory requirements are all very small compared
to the size of the dataset: both group testing methods used 187 kB—Lossy
Counting allocated 40 kB on average, and Frequent used 136 kB.3 In a later
section, we look at the time and space costs of the group testing methods in
more detail.

6.4 Real Data with Insertions and Deletions
We obtained data from one of AT&Ts networks for part of a day, totaling around
100 MB. This consisted of a sequence of new telephone connections being initi-
ated, and subsequently closed. The duration of the connections varied consid-
erably, meaning that at any one time there were huge numbers of connections
in place. In total, there were 3.5 million transactions. We ran the algorithms
on this dynamic sequence in order to test their ability to operate on naturally
occurring sequences. After every 100,000 transactions we posed the query to
find all (source, destination) pairs with a current frequency greater than 1%.
We were grouping connections by their regional codes, giving many millions of
possible pairs, m, although we discovered that geographically neighboring ar-
eas generated the most communication. This meant that there were significant
numbers of pairings achieving the target frequency. Again, we computed recall
and precision for the three algorithms, with the results shown in Figure 7: we
set T = 2 again and ran nonadaptive group testing (NAGT) and adaptive group
testing (Adapt).
   The nonadaptive group testing approach is shown to be justified here on real
data. In terms of both recall and precision, it is nearly perfect. On one occasion,
it overlooked a hot item, and a few times it included items which were not
hot. Under certain circumstances this may be acceptable if the items included
are “nearly hot,” that is, are just under the threshold for being considered hot.
However, we did not pursue this line. In the same amount of space, adaptive
group testing did almost as well, although its recall and precision were both

3 Thesereflected the space allocated for the insert-only algorithms based on upper bounds on the
space needed. This was done to avoid complicated and costly memory allocation while processing
transactions.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
      What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                 •      273




                            Fig. 7. Performance results on real data.




Fig. 8. Choosing the frequency level at query time: the data structure was built for queries at the
0.5% level, but was then tested with queries ranging from 10% to 0.01%.

less good overall than nonadaptive. Both methods reached perfect precision and
recall as T was increased: nonadaptive group testing achieved perfect scores
for T = 3, and adaptive for T = 7.
   Lossy Counting performed generally poorly on this dynamic dataset, its
quality of results swinging wildly between readings but on average finding
only half the hot items. The recall of the Frequent algorithm looked reasonably
good, especially as time progressed, but its precision, which began poorly,
appeared to degrade further. One possible explanation is that the algorithm
was collecting all items which were ever hot, and outputting these whether
they were hot or not. Certainly, it output between two to three times as many
items as were currently hot, meaning that its output necessarily contained
many infrequent items.
   Next, we ran tests which demonstrated the flexibility of our approach. As
noted in Section 3.2, if we create a set of counters for nonadaptive group testing
for a particular frequency level f = 1/(k + 1), then we can use these counters to
answer a query for a higher frequency level without any need for recomputation.
To test this, we computed the data structure for the first million items of the
real data set based on a frequency level of 0.5%. We then asked for all hot items
for a variety of frequencies between 10% and 0.5%. The results are shown
in Figure 8. As predicted, the recall level was the same (100% throughout),
and precision was high, with a few nonhot items included at various points.
We then examined how much below the designed capability we could push the
group testing algorithm, and ran queries asking for hot items with progressively
lower frequencies. For nonadaptive group testing with T = 1, the quality of the
recall began deteriorating after the query frequency descended below 0.5%, but
                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
274       •     G. Cormode and S. Muthukrishnan




                                 Fig. 9. Timing results on real data.

for T = 3 the results maintained an impressive level of recall down to around
the 0.05% level, after which the quality deteriorated (around this point, the
threshold for being considered a hot item was down to having a count in single
figures, due to deletions removing previously inserted items). Throughout, the
precision of both sets of results were very high, close to perfect even when used
far below the intended range of operation.

6.5 Timing Results
On the real data, we timed how long it took to process transactions, as we
varied certain parameters of the methods. We also plotted the time taken by
the insert-only methods for comparison. Timing results are shown in Figure 9.
On the left are timing results for working through the whole data set. As we
would expect, the time scaled roughly linearly with the number of transac-
tions processed. Nonadaptive group testing was a few times slower than for
the insertion-only methods, which were very fast. With T = 2, nonadaptive
group testing processed over a million transactions per second. Adaptive group
testing was somewhat slower. Although asymptotically the two methods have
the same update cost, here we see the effect of the difference in the methods:
since adaptive group testing computes many more hash functions than non-
adaptive (see Section 5), the cost of this computation is clear. It is therefore
desirable to look at how to reduce the number of hash function computations
done by adaptive group testing. Applying the ideas discussed in Sections 3.3
and 4.3, we tried varying the parameter b from 2.
   The results for this are shown on the right in Figure 9. Here, we plot the
time to process two million transactions for different values of b against T , the
number of repetitions of the process. It can be seen that increasing b does indeed
bring down the cost of adaptive and nonadaptive group testing. For T = 1,
nonadaptive group testing becomes competitive with the insertion methods in
terms of time to process each transaction. We also measured the output time
for each method. The adaptive group testing approach took an average 5 ms
per query, while the nonadaptive group testing took 2 ms. The deterministic
approaches took less than 1 ms per query.

6.6 Time-Space Tradeoffs
To see in more detail the effect of varying b, we plotted the time to process two
million transactions for eight different values of b (2, 4, 8, 16, 32, 64, 128, and
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
     What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically              •      275




                       Fig. 10. Time and space costs of varying b.




                Fig. 11. Precision and recall on real data as b and T vary.

256) and three values of T (1, 2, 3) at k = 100. The results are shown in
Figure 10. Although increasing b does improve the update time for every
method, the effect becomes much less pronounced for larger values of b, sug-
gesting that the most benefit is to be had for small values of b. The benefit seems
strongest for adaptive group testing, which has the most to gain. Nonadaptive
group testing still computes T functions per item, so eventually the benefit of
larger b is insignificant compared to this fixed cost.
   For nonadaptive group testing, the space must increase as b increases. We
plotted this on the right in Figure 10. It can be seen that the space increases
quite significantly for large values of b, as predicted. For b = 2 and T = 1, the
space used is about 12 kB, while for b = 256, the space has increased to 460 kB.
For T = 2 and T = 3, the space used is twice and three times this, respectively.
   It is important to see the effect of this tradeoff on accuracy as well. For non-
adaptive group testing, the precision and recall remained the same (100% for
both) as b and T were varied. For adaptive group testing, we kept the space
fixed and looked at how the accuracy varied for different values of T . The results
are given in Figure 11. It can be seen that there is little variation in the recall
with b, but it increases slightly with T , as we would expect. For precision, the
difference is more pronounced. For small values of T , increasing b to speed up
processing has an immediate effect on the precision: more items which are not
hot are included in the output as b increases. For larger values of T , this effect
is reduced: increasing b does not affect precision by as much. Note that the
transaction processing time is proportional to T/ log(b), so it seems that good
tradeoffs are achieved for T = 1 and b = 4 and for T = 3 and b = 8 or 16.
Looking at Figure 10, we see that these points achieve similar update times, of
approximately one million items per second in our experiments.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
276       •     G. Cormode and S. Muthukrishnan

7. CONCLUSIONS
We have proposed two new methods for identifying hot items which occur more
than some frequency threshold. These are the first methods which can cope with
dynamic datasets, that is, the removal as well as the addition of items. They
perform to a high degree of accuracy in practice, as guaranteed by our analysis of
the algorithm, and are quite simple to implement. In our experimental analysis,
it seemed that an approach based on nonadaptive group testing was slightly
preferable to one based on adaptive group testing, in terms of recall, precision,
and time.
    Recently, we have taken these ideas of using group testing techniques to
identify items of interest in small space, and applied them to other problems.
For example, consider finding items which have the biggest frequency differ-
ence between two datasets. Using a similar arrangement of groups but a dif-
ferent test allows us to find such items while processing transactions at very
high rates and keeping only small summaries for each dataset [Cormode and
Muthukrishnan 2004b]. This is of interest in a number of scenarios, such as
trend analysis, financial datasets, and anomaly detection [Yi et al. 2000]. One
point of interest is that, for that scenario, it is straightforward to generalize the
nonadaptive group testing approach, but the adaptive group testing approach
cannot be applied so easily.
    Our approach of group testing may have application to other problems, no-
tably in designing summary data structures for the maintenance of other statis-
tics of interest and in data stream applications. An interesting open problem
is to find combinatorial designs which can achieve the same properties as our
randomly chosen groups, in order to give a fully deterministic construction for
maintaining hot items. The main challenge here is to find good “decoding” meth-
ods: given the result of testing various groups, how to determine what the hot
items are. We need such methods that work quickly in small space.
    A significant problem that we have not approached here is that of continu-
ously monitoring the hot items—that is, to maintain a list of all items that are
hot, and keep this updated as transactions are observed. A simple solution is to
keep the same data structure, and to run the query procedure when needed, say
once every second, or whenever n has changed by more than k. (After an item
is inserted, it is easy to check whether it is now a hot item. Following deletions,
other items can become hot, but the threshold of n/(k + 1) only changes when
n has decreased by k + 1.) In our experiments, the cost of running queries is
a matter of milliseconds and so is quite a cheap operation to perform. In some
situations this is sufficient, but a more general solution is needed for the full
version of this problem.

ACKNOWLEDGMENTS

We thank the anonymous referees for many helpful suggestions.

REFERENCES

AHO, A. V., HOPCROFT, J. E.,   AND   ULLMAN, J. D. 1987.   Data structures and algorithms. Addison-
 Wesley, Reading, MA.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
      What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically                  •      277

ALON, N., GIBBONS, P., MATIAS, Y., AND SZEGEDY, M. 1999. Tracking join and self-join sizes in limited
  storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems.
  10–20.
ALON, N., MATIAS, Y., AND SZEGEDY, M. 1996. The space complexity of approximating the frequency
  moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Com-
  puting. 20–29. Journal version in J. Comput. Syst. Sci., 58, 137–147, 1999.
BABCOCK, B. AND OLSTON, C. 2003. Distributed top-k monitoring. In Proceedings of ACM SIGMOD
  International Conference on Management of Data.
BARBARA, D., WU, N., AND JAJODIA, S. 2001. Detecting novel network intrusions using Bayes esti-
  mators. In Proceedings of the First SIAM International Conference on Data Mining.
BOYER, B. AND MOORE, J. 1982. A fast majority vote algorithm. Tech. Rep. 35. Institute for Com-
  puter Science, University of Texas, at Austin, Austin, TX.
CARTER, J. L. AND WEGMAN, M. N. 1979. Universal classes of hash functions. J. Comput. Syst.
  Sci. 18, 2, 143–154.
CHARIKAR, M., CHEN, K., AND FARACH-COLTON, M. 2002. Finding frequent items in data streams. In
  Procedings of the International Colloquium on Automata, Languages and Programming (ICALP).
  693–703.
CORMODE, G. AND MUTHUKRISHNAN, S. 2003. What’s hot and what’s not: Tracking most frequent
  items dynamically. In Proceedings of ACM Conference on Principles of Database Systems. 296–
  306.
CORMODE, G. AND MUTHUKRISHNAN, S. 2004a. An improved data stream summary: The count-min
  sketch and its applications. J. Algorithms. In press.
CORMODE, G. AND MUTHUKRISHNAN, S. 2004b. What’s new: Finding significant differences in net-
  work data streams. In Proceedings of IEEE Infocom.
                ´
DEMAINE, E., LOPEZ-ORTIZ, A., AND MUNRO, J. I. 2002. Frequency estimation of Internet packet
  streams with limited space. In Proceedings of the 10th Annual European Symposium on Algo-
  rithms. Lecture Notes in Computer Science, vol. 2461. Springer, Berlin, Germany, 348–360.
DU, D.-Z. AND HWANG, F. 1993. Combinatorial Group Testing and Its Applications. Series on Ap-
  plied Mathematics, vol. 3. World Scientific, Singapore.
ESTAN, C. AND VARGHESE, G. 2002. New directions in traffic measurement and accounting. In
  Proceedings of ACM SIGCOMM. Journal version in Comput. Commun. Rev. 32, 4, 323–338.
FANG, M., SHIVAKUMAR, N., GARCIA-MOLINA, H., MOTWANI, R., AND ULLMAN, J. D. 1998. Computing
  iceberg queries efficiently. In Proceedings of the International Conference on Very Large Data
  Bases. 299–310.
FISCHER, M. AND SALZBERG, S. 1982. Finding a majority among n votes: Solution to problem 81-5.
  J. Algorith. 3, 4, 376–379.
GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Querying and mining data streams: You only
  get one look. In Proceedings of the ACM SIGMOD International Conference on Management of
  Data.
GIBBONS, P. AND MATIAS, Y. 1998. New sampling-based summary statistics for improving approx-
  imate query answers. In Proceedings of the ACM SIGMOD International Conference on Manage-
  ment of Data, Journal version in ACM SIGMOD Rec. 27, 331–342.
GIBBONS, P. AND MATIAS, Y. 1999. Synopsis structures for massive data sets. DIMACS Series in
  Discrete Mathematics and Theoretical Computer Science A.
GIBBONS, P. B., MATIAS, Y., AND POOSALA, V. 1997. Fast incremental maintenance of approximate
  histograms. In Proceedings of the International Conference on Very Large Data Bases. 466–
  475.
GILBERT, A., GUHA, S., INDYK, P., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002a. Fast,
  small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th ACM
  Symposium on the Theory of Computing. 389–398.
GILBERT, A., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2001. QuickSAND: Quick summary
  and analysis of network data. DIMACS Tech. Rep. 2001–43, Available online at http://dimacs.
  crutgers.edu/Techniclts.
GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002b. How to summarize the
  universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on
  Very Large Data Bases. 454–465.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
278       •     G. Cormode and S. Muthukrishnan

IOANNIDIS, Y. E. AND CHRISTODOULAKIS, S. 1993. Optimal histograms for limiting worst-case error
  propagation in the size of the join radius. ACM Trans. Database Syst. 18, 4, 709–748.
IOANNIDIS, Y. E. AND POOSALA, V. 1995. Balancing histogram optimality and practicality for query
  result size estimation. In Proceedings of the ACM SIGMOD International Conference on the
  Management of Data. 233–244.
KARP, R., PAPADIMITRIOU, C., AND SHENKER, S. 2003. A simple algorithm for finding frequent ele-
  ments in sets and bags. ACM Trans. Database Syst. 28, 51–55.
KUSHILEVITZ, E. AND NISAN, N. 1997. Communication Complexity. Cambridge University Press,
  Cambridge, U.K.
MANKU, G. AND MOTWANI, R. 2002. Approximate frequency counts over data streams. In Proceed-
  ings of the International Conference on Very Large Data Bases. 346–357.
MISRA, J. AND GRIES, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143–152.
MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press,
  Cambridge, U.K.
MUTHUKRISHNAN, S. 2003. Data streams: Algorithms and applications. In Proceedings of the
  14th Annual ACM-SIAM Symposium on Discrete Algorithms. Available online at http://
  athos.rutgers.edu/∼muthu/stream-1-1.ps.
THORUP, M. 2000. Even strongly universal hashing is pretty fast. In Proceedings of the 11th
  Annual ACM-SIAM Symposium on Discrete Algorithms. 496–497.
YI, B.-K., SIDIROPOULOS, N., JOHNSON, T., JAGADISH, H., FALOUTSOS, C., AND BILIRIS, A. 2000. Online
  data mining for co-evolving time sequences. In Proceedings of the 16th International Conference
  on Data Engineering (ICDE’ 00). 13–22.

Received October 2003; revised June 2004; accepted September 2004




ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
XML Stream Processing Using Tree-Edit
Distance Embeddings
MINOS GAROFALAKIS
Bell Labs, Lucent Technologies
and
AMIT KUMAR
Indian Institute of Technology


We propose the first known solution to the problem of correlating, in small space, continuous
streams of XML data through approximate (structure and content) matching, as defined by a
general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously
embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case)
upper bound of O(log2 n log∗ n) on the distance distortion between any data trees with at most n
nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known
random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data
tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance
computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous
XML document streams. Experimental results from an empirical study with both synthetic and
real-life XML data trees validate our approach, demonstrating that the average-case behavior of
our embedding techniques is much better than what would be predicted from our theoretical worst-
case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-
distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity
joins) XML data in the streaming model.
Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing;
G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms
General Terms: Algorithms, Performance, Theory
Additional Key Words and Phrases: XML, data streams, data synopses, approximate query pro-
cessing, tree-edit distance, metric-space embeddings



1. INTRODUCTION
The Extensible Markup Language (XML) is rapidly emerging as the new
standard for data representation and exchange on the Internet. The simple,

A preliminary version of this article appeared in Proceedings of the 22nd Anuual ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (San Diego, CA, June)
[Garofalakis and Kumar 2003].
Authors’ addresses: M. Garofalakis, Bell Labs, Lucent Technologies, 600 Mountain Ave., Murray
Hill, NJ 07974; email: minos@research.bell-labs.com; A. Kumar, Department of Computer Sci-
ence and Engineering, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India; email:
amitk@cse.iitd.ernet.in.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to
redistribute to lists requires prior specific permission and/or a fee.
 C 2005 ACM 0362-5915/05/0300-0279 $5.00


                        ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 279–332.
280         •   M. Garofalakis and A. Kumar

self-describing nature of the XML standard promises to enable a broad suite of
next-generation Internet applications, ranging from intelligent Web searching
and querying to electronic commerce. In many respects, XML documents are
instances of semistructured data: the underlying data model comprises an or-
dered, labeled tree of element nodes, where each element can be either an atomic
data item or a composite data collection consisting of references (represented
as edges) to child elements in the XML tree. Further, labels (or tags) stored
with XML data elements describe the actual semantics of the data, rather than
simply specifying how elements are to be displayed (as in HTML). Thus, XML
data is tree-structured and self-describing.
   The flexibility of the XML data model makes it a very natural and powerful
tool for representing data from a wide variety of Internet data sources. Of
course, given the typical autonomy of such sources, identical or similar data
instances can be represented using different XML-document tree structures.
For example, different online news sources may use distinct document type
descriptor (DTD) schemas to export their news stories, leading to different node
labels and tree structures. Even when the same DTD is used, the resulting XML
trees may not have the same structure, due to the presence of optional elements
and attributes [Guha et al. 2002].
   Given the presence of such structural differences and inconsistencies, it is
obvious that correlating XML data across different sources needs to rely on
approximate XML-document matching, where the approximation is quanti-
fied through an appropriate general distance metric between XML data trees.
Such a metric for comparing ordered labeled trees has been developed by the
combinatorial pattern matching community in the form of tree-edit distance
[Apostolico and Galil 1997; Zhang and Shasha 1989]. In a nutshell, the tree-
edit distance metric is the natural generalization of edit distance from the string
domain; thus, the tree-edit distance between two tree structures represents the
minimum number of basic edit operations (node inserts, deletes, and relabels)
needed to transform one tree to the other.
   Tree-edit distance is a natural metric for correlating and discovering approx-
imate matches in XML document collections (e.g., through an appropriately de-
fined similarity-join operation).1 The problem becomes particularly challeng-
ing in the context of streaming XML data sources, that is, when such cor-
relation queries must be evaluated over continuous XML data streams that
arrive and need to be processed on a continuous basis, without the benefit
of several passes over a static, persistent data image. Algorithms for corre-
lating such XML data streams would need to work under very stringent con-
straints, typically providing (approximate) results to user queries while (a) look-
ing at the relevant XML data only once and in a fixed order (determined by the
stream-arrival pattern) and (b) using a small amount of memory (typically, log-
arithmic or polylogarithmic in the size of the stream) [Alon et al. 1996, 1999;

1 Specific semantics associated with XML node labels and tree-edit operations can be captured
using a generalized, weighted tree-edit distance metric that associates different weights/costs with
different operations. Extending the algorithms and results in this article to weighted tree-edit
distance is an interesting open problem.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                 •      281




Fig. 1. Example DTD fragments (a) and (b) and XML Document Trees (c) and (d) for autonomous
bibliographic Web sources.

Dobra et al. 2002; Gilbert et al. 2001]. Of course, such streaming-XML tech-
niques are more generally applicable in the context of huge, terabyte XML
databases, where performing multiple passes over the data to compute an ex-
act result can be prohibitively expensive. In such scenarios, having single-pass,
space-efficient XML query-processing algorithms that produce good-quality ap-
proximate answers offers a very viable and attractive alternative [Babcock et al.
2002; Garofalakis et al. 2002].

   Example 1.1. Consider the problem of integrating XML data from two au-
tonomous, bibliographic Web sources WS1 and WS2 . One of the key issues in
such data-integration scenarios is that of detecting (approximate) duplicates
across the two sources [Dasu and Johnson 2003]. For autonomously managed
XML sources, such duplicate-detection tasks are complicated by the fact that
the sources could be using different DTD structures to describe their entries. As
a simple example, Figures 1(a) and 1(b) depict the two different DTD fragments
employed by WS1 and WS2 (respectively) to describe XML trees for academic
publications; clearly, WS1 uses a slightly different set of tags (i.e., article in-
stead of paper) as well as a “deeper” DTD structure (by adding the type and
authors structuring elements).
   Figures 1(c) and 1(d) depict two example XML document trees T1 and T2
from WS1 and WS2 , respectively; even though the two trees have structural
differences, it is obvious that T1 and T2 represent the same publication. In
fact, it is easy to see that T1 and T2 are within a tree-edit distance of 3 (i.e.,
one relabel and two delete operations on T1 ). Approximate duplicate detection
across WS1 and WS2 can be naturally expressed as a tree-edit distance simi-
larity join operation that returns the pairs of trees (T1 , T2 ) ∈ WS1 × WS2 that
are within a tree-edit distance of τ , where the user/application-defined simi-
larity threshold τ is set to a value ≥ 3 to perhaps account for other possible
differences in the joining tree structures (e.g., missing or misspelled coauthor
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
282       •     M. Garofalakis and A. Kumar

names). A single-pass, space-efficient technique for approximating such simi-
larity joins as the document trees from the two XML data sources are streaming
in would provide an invaluable data-integration tool; for instance, estimates of
the similarity-join result size (i.e., the number of approximate duplicate entries)
can provide useful indicators of the degree of overlap (i.e., “content similarity”)
or coverage (i.e., “completeness”) of autonomous XML data sources [Dasu and
Johnson 2003; Florescu et al. 1997].


1.1 Prior Work
Techniques for data reduction and approximate query processing for both re-
lational and XML databases have received considerable attention from the
database research community in recent years [Acharya et al. 1999; Chakrabarti
et al. 2000; Garofalakis and Gibbons 2001; Ioannidis and Poosala 1999;
Polyzotis and Garofalakis 2002; Polyzotis et al. 2004; Vitter and Wang 1999].
The vast majority of such proposals, however, rely on the assumption of a static
data set which enables either several passes over the data to construct effec-
tive data synopses (such as histograms [Ioannidis and Poosala 1999] or Haar
wavelets [Chakrabarti et al. 2000; Vitter and Wang 1999]); clearly, this as-
sumption renders such solutions inapplicable in a data-stream setting. Mas-
sive, continuous data streams arise naturally in a variety of different applica-
tion domains, including network monitoring, retail-chain and ATM transaction
processing, Web-server record logging, and so on. As a result, we are witnessing
a recent surge of interest in data-stream computation, which has led to several
(theoretical and practical) studies proposing novel one-pass algorithms with
limited memory requirements for different problems; examples include quan-
tile and order-statistics computation [Greenwald and Khanna 2001; Gilbert
et al. 2002b]; distinct-element counting [Bar-Yossef et al. 2002; Cormode et al.
2002a]; frequent itemset counting [Charikar et al. 2002; Manku and Motwani
2002]; estimating frequency moments, join sizes, and difference norms [Alon
et al. 1996, 1999; Dobra et al. 2002; Feigenbaum et al. 1999; Indyk 2000]; and,
computing one- or multidimensional histograms or Haar wavelet decomposi-
tions [Gilbert et al. 2002a; Gilbert et al. 2001; Thaper et al. 2002]. All these
articles rely on an approximate query-processing model, typically based on an
appropriate underlying stream-synopsis data structure. (A different approach,
explored by the Stanford STREAM project [Arasu et al. 2002], is to character-
ize subclasses of queries that can be computed exactly with bounded memory.)
The synopses of choice for a number of the above-cited data-streaming articles
are based on the key idea of pseudorandom sketches which, essentially, can
be thought of as simple, randomized linear projections of the underlying data
item(s) (assumed to be points in some numeric vector space).
   Recent work on XML-based publish/subscribe systems has dealt with XML
document streams, but only in the context of simple, predicate-based filtering of
individual documents [Altinel and Franklin 2000; Chan et al. 2002; Diao et al.
2003; Gupta and Suciu 2003; Lakshmanan and Parthasarathy 2002]; more re-
cent work has also considered possible transformations of the XML documents
in order to produce customized output [Diao and Franklin 2003]. Clearly, the
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                    •      283

problem of efficiently correlating XML documents across one or more input
streams gives rise to a drastically different set of issues. Guha et al. [2002]
discussed several different algorithms for performing tree-edit distance joins
over XML databases. Their work introduced easier-to-compute bounds on the
tree-edit distance metric and other heuristics that can significantly reduce the
computational cost incurred due to all-pairs tree-edit distance computations.
However, Guha et al. focused solely on exact join computation and their al-
gorithms require multiple passes over the data; this obviously renders them
inapplicable in a data-stream setting.

1.2 Our Contributions
All earlier work on correlating continuous data streams (through, e.g., join or
norm computations) in small space has relied on the assumption of flat, rela-
tional data items over some appropriate numeric vector space; this is certainly
the case with the sketch-based synopsis mechanism (discussed above), which
has been the algorithmic tool of choice for most of these earlier research efforts.
Unfortunately, this limitation renders earlier streaming results useless for di-
rectly dealing with streams of structured objects defined over a complex metric
space, such as XML-document streams with a tree-edit distance metric.
   In this article, we propose the first known solution to the problem of approx-
imating (in small space) the result of correlation queries based on tree-edit dis-
tance (such as the tree-edit distance similarity joins described in Example 1.1)
over continuous XML data streams. The centerpiece of our solution is a novel
algorithm for effectively (i.e., “obliviously” [Indyk 2001]) embedding streaming
XML and the tree-edit distance metric into a numeric vector space equipped
with the standard L1 distance norm, while guaranteeing a worst-case upper
bound of O(log2 n log∗ n) on the distance distortion between any data trees with
at most n nodes.2 Our embedding is completely deterministic and relies on
parsing an XML tree into a hierarchy of special subtrees. Our parsing makes
use of a deterministic coin-tossing process recently introduced by Cormode and
Muthukrishnan [2002] for embedding a variant of the string-edit distance (that,
in addition to standard string edits, includes an atomic “substring move” op-
eration) into L1 ; however, since we are dealing with general trees rather than
flat strings, our embedding algorithm and its analysis are significantly more
complex, and result in different bounds on the distance distortion.3
   We also demonstrate how our vector-space embedding construction can be
combined with earlier sketching techniques [Alon et al. 1999; Dobra et al. 2002;
Indyk 2000] to obtain novel algorithms for (1) constructing a small sketch syn-
opsis of a massive, streaming XML data tree that can be used as a concise
2 Alllog’s in this article denote base-2 logarithms; log∗ n denotes the number of log applications
required to reduce n to a quantity that is ≤ 1, and is a very slowly increasing function of n.
3 Note that other known techniques for approximating string-edit distance based on the decom-

position of strings into q-grams [Ukkonen 1992; Gravano et al. 2001] only give one-sided error
guarantees, essentially offering no guaranteed upper bound on the distance distortion. For in-
stance, it is not difficult to construct examples of very distinct strings with nearly identical q-gram
sets (i.e., arbitrarily large distortion). Furthermore, to the best of our knowledge, the results in
Ukkonen [1992] have not been extended to the case of trees and tree-edit distance.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
284       •     M. Garofalakis and A. Kumar

surrogate for the full tree in tree-edit distance computations, and (2) estimat-
ing the result size of a tree-edit-distance similarity join over two streams of
XML documents. Finally, we present results from an empirical study of our
embedding algorithm with both synthetic and real-life XML data trees. Our ex-
perimental results offer some preliminary validation of our approach, demon-
strating that the average-case behavior of our techniques over realistic data sets
is much better than what our theoretical worst-case distortion bounds would
predict, and revealing several interesting characteristics of our algorithms in
practice. To the best of our knowledge, ours are the first algorithmic results on
oblivious tree-edit distance embeddings, and on effectively correlating contin-
uous, massive streams of XML data.
   We believe that our embedding algorithm also has other important ap-
plications. For instance, exact tree-edit distance computation is typically a
computationally-expensive problem that can require up to O(n4 ) time (for the
conventional tree-edit distance metric [Apostolico and Galil 1997; Zhang and
Shasha 1989]), and is, in fact, N P-hard for the variant of tree-edit distance
considered in this article (even for the simpler case of flat strings [Shapira and
Storer 2002]). In contrast, our embedding scheme can be used to provide an
approximate tree-edit distance (to within a guaranteed O(log2 n log∗ n) factor)
in near-linear, that is, O(n log∗ n), time.

1.3 Organization
The remainder of this article is organized as follows. Section 2 presents back-
ground material on XML, tree-edit distance and data-streaming techniques.
In Section 3, we present an overview of our approach for correlating XML data
streams based on tree-edit distance embeddings. Section 4 presents our embed-
ding algorithm in detail and proves its small-time and low distance-distortion
guarantees. We then discuss two important applications of our algorithm for
XML stream processing, namely (1) building a sketch synopsis of a massive,
streaming XML data tree (Section 5), and (2) approximating similarity joins
over streams of XML documents (Section 6). We present the results of our
empirical study with synthetic and real-life XML data in Section 7. Finally,
Section 8 outlines our conclusions. The Appendix provides ancillary lemmas
(and their proofs) for the upper bound result.

2. PRELIMINARIES

2.1 XML Data Model and Tree-Edit Distance
An XML document is essentially an ordered, labeled tree T , where each node
in T represents an XML element and is characterized by a label taken from a
fixed alphabet of string literals σ . Node labels capture the semantics of XML
elements, and edges in T capture element nesting in the XML data. Without
loss of generality, we assume that the alphabet σ captures all node labels,
literals, and atomic values that can appear in an XML tree (e.g., based on the
underlying DTD(s)); we also focus on the ordered, labeled tree structure of the
XML data and ignore the raw-character data content inside nodes with string
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                   •      285




                       Fig. 2. Example XML tree and tree-edit operation.

labels (PCDATA, CDATA, etc.). We use |T | and |σ | to denote the number of nodes
in T and the number of symbols in σ , respectively.
    Given two XML document trees T1 and T2 , the tree-edit distance between T1
and T2 (denoted by d (T1 , T2 )) is defined as the minimum number of tree-edit op-
erations to transform one tree into another. The standard set of tree-edit opera-
tions [Apostolico and Galil 1997; Zhang and Shasha 1989] includes (1) relabeling
(i.e., changing the label) of a tree node v; (2) deleting a tree node v (and moving
all of v’s children under its parent); and (3) inserting a new node v under a node
w and moving a contiguous subsequence of w’s children (and their descendants)
under the new node v. (Note that the node-insertion operation is essentially the
complement of node deletion.) An example XML tree and tree-edit operation are
depicted in Figure 2. In this article, we consider a variant of the tree-edit dis-
tance metric, termed tree-edit distance with subtree moves, that, in addition to
the above three standard edit operations, allows a subtree to be moved under a
new node in the tree in one step. We believe that subtree moves make sense as
a primitive edit operation in the context of XML data—identical substructures
can appear in different locations (for example, due to a slight variation of the
DTD), and rearranging such substructures should probably be considered as
basic an operation as node insertion or deletion. In the remainder of this article,
the term tree-edit distance assumes the four primitive edit operations described
above, namely, node relabelings, deletions, insertions, and subtree moves.4

2.2 Data Streams and Basic Pseudorandom Sketching
In a data-streaming environment, data-processing algorithms are allowed to
see the incoming data records (e.g., relational tuples or XML documents) only
once as they are streaming in from (possibly) different data sources [Alon et al.
1996, 1999; Dobra et al. 2002]. Backtracking over the stream and explicit access
to past data records are impossible. The data-processing algorithm is also al-
lowed a small amount of memory, typically logarithmic or polylogarithmic in the
data-stream size, in order to maintain concise synopsis data structures for the
input stream(s). In addition to their small-space requirement, these synopses
should also be easily computable in a single pass over the data and with small
per-record processing time. At any point in time, the algorithm can combine the
maintained collection of synopses to produce an approximate result.
4 The problem of designing efficient (i.e., “oblivious”), guaranteed-distortion embedding schemes
for the standard tree-edit distance metric remains open; of course, this is also true for the much
simpler standard string-edit distance metric (i.e., without “substring moves”) [Cormode and
Muthukrishnan 2002].

                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
286       •     M. Garofalakis and A. Kumar

   We focus on one particular type of stream synopses, namely, pseudorandom
sketches; sketches have provided effective solutions for several streaming prob-
lems, including join and multijoin processing [Alon et al. 1996, 1999; Dobra
et al. 2002], norm computation [Feigenbaum et al. 1999; Indyk 2000], distinct-
element counting [Cormode et al. 2002a], and histogram or Haar-wavelet con-
struction [Gilbert et al. 2001; Thaper et al. 2002]. We describe the basics of
pseudorandom sketching schemes using a simple binary-join cardinality es-
timation query [Alon et al. 1999]. More specifically, assume that we want to
estimate Q = COUNT(R1 1 A R2 ), that is, the cardinality of the binary equi-
join of two streaming relations R1 and R2 over a (numeric) attribute (or,
set of attributes) A, whose values we assume (without loss of generality) to
range over {1, . . . , N }. (Note that, by the definition of the equijoin operator,
the two join attributes have identical value domains.) Letting f k (i) (k = 1, 2;
i = 1, . . . , N ) denote the frequency of the ith value in Rk , is is easy to see
                  N
that Q = i=1 f 1 (i) f 2 (i). Clearly, estimating this join size exactly requires
at least (N ) space, making an exact solution impractical for a data-stream
setting.
   In their seminal work, Alon et al. [Alon et al. 1996, 1999] proposed a ran-
domized technique that can offer strong probabilistic guarantees on the quality
of the resulting join-size estimate while using space that can be significantly
smaller than N . Briefly, the key idea is to (1) build an atomic sketch X k (essen-
tially, a randomized linear projection) of the distribution vector for each input
stream Rk (k = 1, 2) (such a sketch can be easily computed over the streaming
values of Rk in only O(log N ) space) and (2) use the atomic sketches X 1 and X 2
to define a random variable X Q such that (a) X Q is an unbiased (i.e., correct on
expectation) randomized estimator for the target join size, so that E[X Q ] = Q,
and (b) X Q ’s variance (Var[X Q ]) can be appropriately upper-bounded to allow
for probabilistic guarantees on the quality of the Q estimate. More formally,
this random variable X Q is constructed on-line from the two data streams as
follows:

— Select a family of four-wise independent binary random variates {ξi : i =
  1, . . . , N }, where each ξi ∈ {−1, +1} and P [ξi = +1] = P [ξi = −1] = 1/2 (i.e.,
  E[ξi ] = 0). Informally, the four-wise independence condition means that, for
  any 4-tuple of ξi variates and for any 4-tuple of {−1, +1} values, the probabil-
  ity that the values of the variates coincide with those in the {−1, +1} 4-tuple
  is exactly 1/16 (the product of the equality probabilities for each individual
  ξi ). The crucial point here is that, by employing known tools (e.g., orthogonal
  arrays) for the explicit construction of small sample spaces supporting four-
  wise independence, such families can be efficiently constructed on-line using
  only O(log N ) space [Alon et al. 1996].
— Define X Q = X 1 · X 2 , where the atomic sketch X k is defined simply as X k =
      N
      i=1 f k (i)ξi , for k = 1, 2. Again, note that each X k is a simple randomized
  linear projection (inner product) of the frequency vector of Rk .A with the
  vector of ξi ’s that can be efficiently generated from the streaming values of
  A as follows: start a counter with X k = 0 and simply add ξi to X k whenever
  the ith value of A is observed in the Rk stream.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                •      287

   The quality of the estimation guarantees can be improved using a standard
boosting technique that maintains several independent identically distributed
(iid) instantiations of the above process, and uses averaging and median-
selection operators over the X Q estimates to boost accuracy and probabilis-
tic confidence [Alon et al. 1996]. (Independent instances can be constructed by
simply selecting independent random seeds for generating the families of four-
wise independent ξi ’s for each instance.) We use the term (atomic) AMS sketch
to describe a randomized linear projection computed in the above-described
manner over a data stream. Letting SJk (k = 1, 2) denote the self-join size of
                     N
Rk .A (i.e., SJk = i=1 f k (i)2 ), the following theorem [Alon et al. 1999] shows
how sketching can be applied for estimating binary-join sizes in limited space.
(By standard Chernoff bounds [Motwani and Raghavan 1995], using median-
selection over O(log(1/δ)) of the averages computed in Theorem 2.1 allows
the confidence in the estimate to be boosted to 1 − δ, for any pre-specified
δ < 1.)
   THROEM 2.1 [ALON ET AL. 1999]. Let the atomic AMS sketches X 1 and X 2 be
as defined above. Then, E[X Q ] = E[X 1 X 2 ] = Q and Var(X Q ) ≤ 2 · SJ1 · SJ2 .
                                              ·SJ
Thus, averaging the X Q estimates over O( SJ12 2 2 ) iid instantiations of the basic
                                            Q
scheme, guarantees an estimate that lies within a relative error of at most from
Q with constant probability > 1/2.
   It should be noted that the space-usage bounds stated in Theorem 2.1 capture
the worst-case behavior of AMS-sketching-based estimation—empirical results
with synthetic and real-life data sets have demonstrated that the average-case
behavior of the AMS scheme is much better [Alon et al. 1999]. More recent
work has led to improved AMS-sketching-based estimators with provably better
space-usage guarantees (that actually match the lower bounds shown by Alon
et al. [1999]) [Ganguly et al. 2004], and has demonstrated that AMS-sketching
techniques can be extended to effectively handle one or more complex multi-
join aggregate SQL queries over a collection of relational streams [Dobra et al.
2002, 2004].
   Indyk [2000] discussed a different type of pseudorandom sketches which
                                                                      N
are, once again, defined as randomized linear projections X k = i=1 f k (i)ξi of
a streaming input frequency vector for the values in Rk , but using random
variates {ξi } drawn from a p-stable distribution (which can again be generated
in small space, i.e., O(log N ) space) in the X k computation. The class of p-
stable distributions has been studied for some time (see, e.g., Nolan [2004];
[Uchaikin and Zolotarev 1999])—they are known to exist for any p ∈ (0, 2],
and include well-known distribution functions, for example, the Cauchy distri-
bution (for p = 1) and the Gaussian distribution (for p = 2). As the following
theorem demonstrates, such p-stable sketches can provide accurate probabilis-
tic estimates for the L p -difference norm of streaming frequency vectors in small
space, for any p ∈ (0, 2].

   THEOREM 2.2 [INDYK 2000]. Let p ∈ (0, 2], and define the p-stable sketch for
                           N
the Rk stream as X k = i=1 f k (i)ξi , where the {ξi } variates are drawn from a
p-stable distribution (k = 1, 2). Assume that we have built l = O( log(1/δ) ) iid
                                                                         2


                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
288       •     M. Garofalakis and A. Kumar

                                      j    j
pairs of p-stable sketches {X 1 , X 2 } ( j = 1, . . . , l ), and define

                        X = median X 1 − X 2 |, . . . , |X 1 − X 2 .
                                     1     1               l     l


Then, X lies within a relative error of at most of the L p -difference norm || f 1 −
f 2 || p = [ i | f 1 (i) − f 2 (i)| p ]1/ p with probability ≥ 1 − δ.

   More recently, Cormode et al. [2002a] have also shown that, with small values
of p (i.e., p → 0), p-stable sketches can provide very effective estimates for the
Hamming (i.e., L0 ) norm (or, the number of distinct values) over continuous
streams of updates.

3. OUR APPROACH: AN OVERVIEW
The key element of our methodology for correlating continuous XML data
streams is a novel algorithm that embeds ordered, labeled trees and the tree-
edit distance metric as points in a (numeric) multidimensional vector space
equipped with the standard L1 vector distance, while guaranteeing a small dis-
tortion of the distance metric. In other words, our techniques rely on mapping
each XML tree T to a numeric vector V (T ) such that the tree-edit distances be-
tween the original trees are well-approximated by the L1 vector distances of the
tree images under the mapping; that is, for any two XML trees S and T , the L1
distance V (S) − V (T ) 1 = j |V (S)[ j ] − V (T )[ j ]| gives a good approximation
of the tree-edit distance d (S, T ).
   Besides guaranteeing a small bound on the distance distortion, to be appli-
cable in a data-stream setting, such an embedding algorithm needs to satisfy
two additional requirements: (1) the embedding should require small space and
time per data tree in the stream; and, (2) the embedding should be oblivious,
that is, the vector image V (T ) of a tree T cannot depend on other trees in the
input stream(s) (since we cannot explicitly store or backtrack to past stream
items). Our embedding algorithm satisfies all these requirements.
   There is an extensive literature on low-distortion embeddings of metric
spaces into normed vector spaces; for an excellent survey of the results in this
area, please see the recent article by Indyk [2001]. A key result in this area is
Bourgain’s lemma proving that an arbitrary finite metric space is embeddable
in an L2 vector space with logarithmic distortion; unfortunately, Bourgain’s
technique is neither small space nor oblivious (i.e., it requires knowledge of the
complete metric space), so there is no obvious way to apply it in a data-stream
setting [Indyk 2001]. To the best of our knowledge, our algorithm gives the
first oblivious, small space/time vector-space embedding for a complex tree-edit
distance metric.
   Given our algorithm for approximately embedding streaming XML trees and
tree-edit distance in an L1 vector space, known streaming techniques (like the
sketching methods discussed in Section 2.2) now become relevant. In this ar-
ticle, we focus on two important applications of our results in the context of
streaming XML, and propose novel algorithms for (1) building a small sketch
synopsis of a massive, streaming XML data tree, and (2) approximating the size
of a similarity join over XML streams. Once again, these are the first results on
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      289

correlating (in small space) massive XML data streams based on the tree-edit
distance metric.

3.1 Technical Roadmap
The development of the technical material in this article is organized as fol-
lows. Section 4 describes our embedding algorithm for the tree-edit distance
metric (termed TREEEMBED) in detail. In a nutshell, TREEEMBED constructs a
hierarchical parsing of an input XML tree by iteratively contracting edges to
produce successively smaller trees; our parsing makes repeated use of a re-
cently proposed label-grouping procedure [Cormode and Muthukrishnan 2002]
for contracting chains and leaf siblings in the tree. The bulk of Section 4 is de-
voted to proving the small-time and low distance-distortion guarantees of our
TREEEMBED algorithm (Theorem 4.2). Then, in Section 5, we demonstrate how
our embedding algorithm can be combined with the 1-stable sketching tech-
nique of Indyk [2000] to build a small sketch synopsis of a massive, streaming
XML tree that can be used as a concise surrogate for the tree in approximate
tree-edit distance computations. Most importantly, we show that the proper-
ties of our embedding allow us to parse the tree and build this sketch in
small space and in one pass, as nodes of the tree are streaming by without
ever backtracking on the data (Theorem 5.1). Finally, Section 6 shows how
to combine our embedding algorithm with both 1-stable and AMS sketching
in order to estimate (in limited space) the result size of an approximate tree-
edit-distance similarity join over two continuous streams of XML documents
(Theorem 6.1).

4. OUR TREE-EDIT DISTANCE EMBEDDING ALGORITHM

4.1 Definitions and Overview
In this section, we describe our embedding algorithm for the tree-edit distance
metric (termed TREEEMBED) in detail, and prove its small-time and low distance-
distortion guarantees. We start by introducing some necessary definitions and
notational conventions.
   Consider an ordered, labeled tree T over alphabet σ , and let n = |T |. Also,
let v be a node in T , and let s denote a contiguous subsequence of children of
node v in T . If the nodes in s are all leaves, then we refer to s as a contiguous
leaf-child subsequence of v. (A leaf child of v that is not adjacent to any other
leaf child of v is called a lone leaf child of v.) We use T [v, s] to denote the subtree
of T obtained as the union of all subtrees rooted at nodes in s and node v itself,
retaining all node labels. We also use the notation T [v, s] to denote exactly the
same subtree as T [v, s], except that we do not associate any label with the root
node v of the subtree. We define a valid subtree of T as any subtree of the form
T [v, s], T [v, s], or a path of degree-2 nodes (i.e., a chain) possibly ending in leaf
node in T .
   At a high level, our TREEEMBED algorithm produces a hierarchical parsing of
T into a multiset T (T ) of special valid subtrees by stepping through a number of
edge-contraction phases producing successively smaller trees. A key component
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
290       •     M. Garofalakis and A. Kumar

of our solution (discussed later in this section) is the recently proposed de-
terministic coin tossing procedure of Cormode and Muthukrishnan [2002] for
grouping symbols in a string—TREEEMBED employs that procedure repeatedly
during each contraction phase to merge tree nodes in a chain as well as sibling
leaf nodes. The vector image V (T ) of T is essentially the “characteristic vector”
for the multiset T (T ) (over the space of all possible valid subtrees). Our analysis
shows that the number of edge-contraction phases in T ’s parsing is O(log n),
and that, even though the dimensionality of V (T ) is, in general, exponential in
n, our construction guarantees that V (T ) is also very sparse: the total number
of nonzero components in V (T ) is only O(n). Furthermore, we demonstrate that
our TREEEMBED algorithm runs in near-linear, that is, O(n log∗ n) time. Finally,
we prove the upper and lower bounds on the distance distortion guaranteed by
our embedding scheme.

4.2 The Cormode-Muthukrishnan Grouping Procedure
Clearly, the technical crux lies in the details of our hierarchical parsing pro-
cess for T that produces the valid-subtree multiset T (T ). A basic element of
our solution is the string-processing subroutine presented by Cormode and
Muthukrishnan [2002] that uses deterministic coin tossing to find landmarks
in an input string S, which are then used to split S into groups of two or
three consecutive symbols. A landmark is essentially a symbol y (say, at lo-
cation j ) of the input string S with the following key property: if S is trans-
formed into S by an edit operation (say, a symbol insertion) at location l far
away from j (i.e., |l − j | >> 1), then the Cormode-Muthukrishnan string-
processing algorithm ensures that y is still designated as a landmark in S .
Due to space constraints, we do not give the details of their elegant landmark-
based grouping technique (termed CM-Group in the remainder of this arti-
cle) in our discussion—they can be found in Cormode and Muthukrishnan
[2002]. Here, we only summarize a couple of the key properties of CM-Group
that are required for the analysis of our embedding scheme in the following
theorem.
   THEOREM 4.1 [CORMODE AND MUTHUKRISHNAN 2002]. Given a string of length
k, the CM-Group procedure runs in time O(k log∗ k). Furthermore, the closest
landmark to any symbol x in the string is determined by at most log∗ k + 5
consecutive symbols to the left of x, and at most five consecutive symbols to the
right of x.

   Intuitively, Theorem 4.1 states that, for any given symbol x in a string of
length k, the group of (two or three) consecutive symbols chosen (by CM-Group)
to include x depends only on the symbols lying in a radius of at most log∗ k + 5
to the left and right of x. Thus, a string-edit operation occurring outside this
local neighborhood of symbol x is guaranteed not to affect the group formed
containing x. As we will see, this property of the CM-Group procedure is crucial in
proving the distance-distortion bounds for our TREEEMBED algorithm. Similarly,
the O(k log∗ k) complexity of CM-Group plays an important role in determining
the running time of TREEEMBED.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                      •     291

4.3 The TREEEMBED Algorithm
As mentioned earlier, our TREEEMBED algorithm constructs a hierarchical pars-
ing of T in several phases. In phase i, the algorithm builds an ordered, labeled
tree T i that is obtained from the tree of the previous phase T i−1 by contract-
ing certain edges. (The initial tree T 0 is exactly the original input tree T .)
Thus, each node v ∈ T i corresponds to a connected subtree of T —in fact, by
construction, our TREEEMBED algorithm guarantees that this subtree will be a
valid subtree of T . Let v(T ) denote the valid subtree of T corresponding to node
v ∈ T i . Determining the node label for v uses a hash function h() that maps the
set of all valid subtrees of T to new labels in a one-to-one fashion with high prob-
ability; thus, the label of v ∈ T i is defined as the hash-function value h(v(T )).
As we demonstrate in Section 7.1, such a valid-subtree-naming function can be
computed in small space/time using an adaptation of the Karp-Rabin string fin-
gerprinting algorithm [Karp and Rabin 1987]. Note that the existence of such
an efficient naming function is crucial in guaranteeing the small space/time
properties for our embedding algorithm since maintaining the exact valid sub-
trees v(T ) is infeasible; for example, near the end of our parsing, such subtrees
are of size O(|T |).5
   The pseudocode description of our TREEEMBED embedding algorithm is de-
picted in Figure 3. As described above, our algorithm builds a hierarchical
parsing structure (i.e., a hierarchy of contracted trees T i ) over the input tree
T , until the tree is contracted to a single node (|T i | = 1). The multiset T (T ) of
valid subtrees produced by our parsing for T contains all valid subtrees corre-
sponding to all nodes of the final hierarchical parsing structure tagged with a
phase label to distinguish between subtrees in different phases; that is, T (T )
comprises all < v(T i ), i > for all nodes v ∈ T i over all phases i (Step 18). Finally,
we define the L1 vector image V (T ) of T to be the “characteristic vector” of the
multi-set T (T ); in other words,
  V (T )[< t, i >] := number of times the < t, i > subtree-phase combination
                           appears in T (T ).
(We use the notation Vi (T ) to denote the restriction of V (T ) to only subtrees
occurring at phase i.) A small example execution of the hierarchical tree parsing
in our embedding algorithm is depicted pictorially in Figure 4.
   The L1 distance between the vector images of two trees S and T is defined
in the standard manner, that is, V (T ) − V (S) 1 = x∈T (T )∪T (S) |V (T )[x] −
V (S)[x]|. In the remainder of this section, we prove our main theorem on the
near-linear time complexity of our L1 embedding algorithm and the logarithmic
distortion bounds that our embedding guarantees for the tree-edit distance
metric.

5 An implicit assumption made in our running-time analysis of TREEEMBED (which is also present in
the complexity analysis of CM-Group in Cormode and Muthukrishnan [2002]—see Theorem 4.1) is
that the fingerprints produced by the naming function h() fit in a single memory word and, thus, can
be manipulated in constant (i.e., O(1)) time. If that is not the case, then an additional multiplicative
factor of O(log |T |) must be included in the running-time complexity to account for the length of
such fingerprints (see Section 7.1).

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
292       •      M. Garofalakis and A. Kumar




                               Fig. 3. Our tree-embedding algorithm.




                            Fig. 4. Example of hierarchical tree parsing.

   THEOREM 4.1. The TREEEMBED algorithm constructs the vector image V (T )
of an input tree T in time O(|T | log∗ |T |); further, the vector V (T ) contains at
most O(|T |) nonzero components. Finally, given two trees S and T with n =
max{|S|, |T |}, we have
              d (S, T ) ≤ 5 · V (T ) − V (S)        1   = O(log2 n log∗ n) · d (S, T ).

It is important to note here that, for certain special cases (i.e., when T is a
simple chain or a “star”), our TREEEMBED algorithm essentially degrades to
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                    •      293




           Fig. 5. Example of parsing steps for the special case of a full binary tree.

the string-edit distance embedding algorithm of Cormode and Muthukrishnan
[2002]. This, of course, implies that, for such special cases, their even tighter
O(log n log∗ n) bounds on the worst-case distance distortion are applicable. As
another special-case example, Figure 5 depicts the initial steps in the parsing
of a full binary tree T ; note that, after two contraction phases, our parsing
essentially reduces a full binary tree of depth h to one of depth h − 1 (thus
decreasing the size of the tree by a factor of about 1/2).
   As a first step in the proof of Theorem 4.1, we demonstrate the following
lemma which bounds the number of parsing phases. The key here is to show
that the number of tree nodes goes down by a constant factor during each
contraction phase of our embedding algorithm (Steps 3–16).
   LEMMA 4.3. The number of phases for our TREEEMBED algorithm on an input
tree T is O(log |T |).

  PROOF.    We partition the node set of T into several subsets as follows. First,
define
A(T ) = {v ∈ T : v is a nonroot node with degree 2 (i.e., with only one child) or
          v is a leaf child of a non-root node of degree 2}, and
B(T ) = {v ∈ T : v is a node of degree ≥ 3 (i.e., with at least two children) or
            v is the root node of T }.
Clearly, A(T ) ∪ B(T ) contains all internal (i.e., nonleaf) nodes of T ; in particular,
A(T ) contains all nodes appearing in (degree-2) chains in T (including potential
leaf nodes at the end of such chains). Thus, the set of remaining nodes of T ,
say L(T ), comprises only leaf nodes of T which have at least one sibling or
are children of the root. Let v be a leaf child of some node u, and let sv denote
the maximal contiguous set of leaf children of u which contains v. We further
partition the leftover set of leaf nodes L(T ) as follows:
L1 (T ) = {v ∈ L(T ) : |sv | ≥ 2},
L2 (T ) = {v ∈ L(T ) : |sv | = 1 and v is the leftmost such child of its parent}, and
L3 (T ) = L(T ) − L1 (T ) − L2 (T )
       = {v ∈ L(T ) : |sv | = 1 and v is not the leftmost such child of its parent}.
For notational convenience, we also use A(T ) to denote the set cardinality
|A(T )|, and similarly for other sets. We first prove the following ancillary claim.
                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
294       •     M. Garofalakis and A. Kumar

  CLAIM 4.4.        For any rooted tree T with at least two nodes, L3 (T ) ≤ L2 (T ) +
A(T )/2 − 1.
    PROOF. We prove this claim by induction on the number of nodes in T . Sup-
pose T has only two nodes. Then, clearly, L3 (T ) = 0, L2 (T ) = 1, and A(T ) = 0.
Thus, the claim is true for the base case.
    Suppose the claim is true for all rooted trees with less than n nodes. Let T
have n nodes and let r be the root of T . First, consider the case when r has
only one child node (say, s), and let T be the subtree rooted at s. By induction,
L3 (T ) ≤ L2 (T ) + A(T )/2 − 1. Clearly, L3 (T ) = L3 (T ). Is L2 (T ) equal to L2 (T )?
It is not hard to see that the only case when a node u can occur in L2 (T ) but not
in L2 (T ) is when s has only one child, u, which also happens to be a leaf. In this
case, obviously, u ∈ L2 (T ) (since it is the sole leaf child of the root), whereas in
                                                                                  /
T u is the end-leaf of a chain node, so it is counted in A(T ) and, thus, u ∈ L2 (T ).
On the other hand, it is easy to see that both s and r are in A(T ) − A(T ) in this
case, so that L2 (T ) + A(T )/2 = L2 (T ) + A(T )/2. Thus, the claim is true in this
case as well.
    Now, consider the case when the root node r of T has at least two children.
We construct several smaller subtrees, each of which is rooted at r (but contains
only a subset of r’s descendants). Let u1 , . . . , uk be the leaf children of r such that
sui = {ui } (i.e., have no leaf siblings); thus, by definition, u1 ∈ L2 (T ), whereas
ui ∈ L3 (T ) for all i = 2, . . . , k. We define the subtrees T1 , . . . , Tk+1 as follows.
For each i = 1, . . . , k + 1, Ti is the set of all descendants of r (including r itself)
that lie to the right of leaf ui−1 and to the left of leaf ui (as special cases, T1 is
the subtree to the left of u1 and Tk+1 is the subtree to the right of uk ). Note that
T1 and Tk+1 my not contain any nodes (other than the root node r), but, by the
definition of ui ’s, all other Ti subtrees are guaranteed to contain at least one
node other than r. Now, by induction, we have that
                                 L3 (Ti ) ≤ L2 (Ti ) + A(Ti )/2 − 1
for all subtrees Ti , except perhaps for T1 and Tk+1 (if they only comprise a sole
root node, in which case, of course, the L2 , L3 , and A subsets above are all
empty). Adding all these inequalities, we have
                          L3 (Ti ) ≤        L2 (Ti ) +       A(Ti )/2 − (k − 1),      (1)
                      i                 i                i

where we only have k − 1 on the right-hand side since T1 and Tk+1 may not
contribute a −1 to this summation.
    Now, it is easy to see that, if u ∈ A(Ti ), then u ∈ A(T ) as well; thus, A(T ) =
   i A(Ti ). Suppose u ∈ L2 (Ti ), and let w denote the parent of u. Note that w
cannot be the root node r of T , Ti . Indeed, suppose that w = r; then, since
u ∈ {u1 , . . . , uk }, su contains a leaf node other than u which is also not in Ti
(since u ∈ L2 (Ti ))). But then, it must be the case that u is adjacent to one of
the leaves u1 , . . . , uk , which is impossible; thus, w = r which, of course, implies
that u ∈ L2 (T ) as well. Conversely, suppose that u ∈ L2 (T ); then, either u = u1
or the parent of u is in one of the subtrees Ti . In the latter case, u ∈ D2 (Hi ).
Thus, L2 (H) = i L2 (Ti ) + 1.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                          •      295

   Finally, we can argue in a similar manner that, for each i = 1, . . . , k + 1,
L3 (Ti ) ⊂ L3 (T ). Furthermore, if u ∈ L3 (T ), then either u ∈ {u2 , . . . , uk } or
u ∈ L3 (Ti ). Thus, L3 (T ) = i L3 (Ti ) + k − 1. Putting everything together, we
have

          L3 (T ) =           L3 (Ti ) + k − 1
                        i

                  ≤           L2 (Ti ) +         A(Ti )/2             (by Inequality (1))
                        i                    i
                  = L2 (T ) + A(T )/2 − 1.

This completes the inductive proof argument.

   With Claim 4.4 in place, we now proceed to show that the number of nodes
in the tree goes down by a constant factor after each contraction phase of our
parsing. Recall that T i is the tree at the beginning of the (i + 1)th phase, and let
L (T i+1 ) ⊆ L(T i+1 ) denote the subset of leaf nodes in L(T i+1 ) that are created
by contracting a chain in T i . We claim that

                                                                                            A(T i )
  B(T i+1 ) ≤ B(T i )       and    B(T i+1 ) + A(T i+1 ) + L (T i+1 ) ≤ B(T i ) +                   . (2)
                                                                                              2
Indeed, it is easy to see that all nodes with degree at least three (i.e., ≥ two chil-
dren) in T i+1 must have had degree at least three in T i as well; this obviously
proves the first inequality. Furthermore, note that any node in B(T i+1 ) corre-
sponds to a unique node in B(T i ). Now, consider a node u in A(T i+1 ) ∪ L (T i+1 ).
There are two possible cases depending on how node u is formed. In the first
case, u is formed by collapsing some degree-2 (i.e., chain) nodes (and, possibly,
a chain-terminating leaf) in A(T i )—then, by virtue of the CM-Group procedure,
u corresponds to at least two distinct nodes of A(T i ). In the second case, there
is a node w ∈ B(T i ) and a leaf child of w that is collapsed into w to get u—then,
u corresponds to a unique node of B(T i ). The second inequality follows easily
from the above discussion.
   During the (i + 1)th contraction phase, the number of leaves in L1 (T i ) is
clearly reduced by at least one-half (again, due to the properties of CM-Group).
Furthermore, note that all leaves in L2 (T i ) are merged into their parent nodes
and, thus, disappear. Now, the leaves in L3 (T i ) do not change; so, we need
to bound the size of this leaf-node set. By Claim 4.4, we have that L3 (T i ) ≤
L2 (T i ) + A(T i )/2—adding 2 · L3 (T i ) on both sides and multiplying across with
1/3, this inequality gives

                                          L2 (T i ) 2            A(T i )
                            L3 (T i ) ≤            + L3 (T i ) +         .
                                             3      3              6

Thus, the number of leaf nodes in L(T i+1 ) − L (T i+1 ) can be upper-bounded as
follows:

                             L1 (T i )   L2 (T i ) 2            A(T i )  2          A(T i )
L(T i+1 ) − L (T i+1 ) ≤               +          + L3 (T i ) +         ≤ L(T i ) +         .
                                2           3      3              6      3            6
                                          ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
296       •     M. Garofalakis and A. Kumar

Combined with Inequality (2), this implies that the total number of nodes in
T i+1 is
                                                  A(T i )             2         A(T i )
        A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤               + B(T i ) + L(T i ) +
                                                    2                 3           6
                                                           2
                                                ≤ B(T ) + (A(T ) + L(T )).
                                                      i             i     i
                                                           3
Now, observe that B(T i ) ≤ A(T i )+ L(T i ) (the number of nodes of degree more
than two is at most the number of leaves in any tree)—the above inequality
then gives
                                              5          2                     1
      A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤       B(T i ) + (A(T i ) + L(T i )) + B(T i )
                                              6          3                     6
                                              5
                                             ≤ (A(T ) + B(T ) + L(T )).
                                                      i      i         i
                                              6
Thus, when going from tree T i to T i+1 , the number of nodes goes down by a
constant factor ≤ 5 . This obviously implies that the number of parsing phases
                  6
for our TREEEMBED algorithm is O(log |T |), and completes the proof.
   The proof of Lemma 4.3 immediately implies that the total number of nodes
in the entire hierarchical parsing structure for T is only O(|T |). Thus, the vector
image V (T ) built by our algorithm is a very sparse vector. To see this, note that
the number of all possible ordered, labeled trees of size at most n that can be
built using the label alphabet σ is O((4|σ |)n ) (see, e.g., Knuth [1973]); thus,
by Lemma 4.3, the dimensionality needed for our vector image V () to capture
input trees of size n is O((4|σ |)n log n). However, for a given tree T , only O(|T |)
of these dimensions can contain nonzero counts. Lemma 4.3, in conjunction
with the fact that the CM-Group procedure runs in time O(k log∗ k) for a string
of size k (Theorem 4.1), also implies that our TREEEMBED algorithm runs in
O(|T | log∗ |T |) time on input T . The following two subsections establish the
distance-distortion bounds stated in Theorem 4.1.
   An immediate implication of the above results is that we can use our
embedding algorithm to compute the approximate (to within a guaranteed
O(log2 n log∗ n) factor) tree-edit distance between T and S in O(n log∗ n) (i.e.,
near-linear) time. The time complexity of exact tree-edit distance computation
is significantly higher: conventional tree-edit distance (without subtree moves)
is solvable in O(|T S|d T d S ) time (where, d T (d S ) is the depth of T (respec-
tively, S)) [Apostolico and Galil 1997; Zhang and Shasha 1989], whereas in the
presence of subtree moves the problem becomes N P-hard even for the simple
case of flat strings [Shapira and Storer 2002].

4.4 Upper-Bound Proof
Suppose we are given a tree T with n nodes and let         denote the quantity
log∗ n + 5. As a first step in our proof, we demonstrate that showing the upper-
bound result in Theorem 4.2 can actually be reduced to a simpler problem,
namely, that of bounding the L1 distance between the vector image of T and
the vector image of a 2-tree forest created when removing a valid subtree from
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                    •      297

T . More formally, consider a (valid) subtree of T of the form T [v, s] for some
contiguous subset of children s of v (recall that the root of T [v, s] has no label).
Let us delete T [v, s] from T , and let T2 denote the resulting subtree; further-
more, let T1 denote the deleted subtree T [v, s]. Thus, we have broken T into a
2-tree forest comprising T1 = T [v, s] and T2 = T − T1 (see the leftmost portion
of Figure 8 for an example).
   We now compare the following two vectors. The first vector V (T ) is obtained
by applying our TREEEMBED parsing procedure to T . For the second vector, we
apply TREEEMBED to each of the trees T1 and T2 individually, and then add the
corresponding vectors V (T1 ) and V (T2 ) component-wise—call this vector V (T1 +
T2 ) = V (T1 ) + V (T2 ). (Throughout this section, we use (T1 + T2 ) to denote the
2-tree forest composed of T1 and T2 .) Our goal is to prove the following theorem.
  THEOREM 4.5. The L1 distance between vectors V (T ) and V (T1 + T2 ) is at
most O(log2 n log∗ n).
  Let us first see how this result directly implies the upper bound stated in
Theorem 4.2.
   PROOF OF THE UPPER BOUND IN THEOREM 4.2. It is sufficient to consider the
case when the tree-edit distance between S and T is 1 and show that, in this
case, the L1 distance between V (S) and V (T ) is ≤ O(log2 n log∗ n). First, assume
that T is obtained from S by deleting a leaf node v. Let the parent of v be w.
Define s = {v}, and delete S [w, s] from S. This splits S into T and S [w, s]—
call this S1 . Theorem 4.5 then implies that V (S) − V (T + S1 ) 1 = V (S) −
(V (T ) + V (S1 )) 1 ≤ O(log2 n log∗ n). But, it is easy to see that the vector V (S1 )
only has three nonzero components, all equal to 1; this is since S1 is basically
a 2-node tree that is reduced to a single node after one contraction phase of
TREEEMBED. Thus, V (S1 ) 1 = (V (T ) + V (S1 )) − V (T ) 1 ≤ 3. Then, a simple
application of the triangle inequality for the L1 norm gives V (S) − V (T ) 1 ≤
O(log2 n log∗ n). Note that, since insertion of a leaf node is the inverse of a leaf-
node deletion, the same holds for this case as well.
   Now, let v be a node in S and s be a contiguous set of children of v. Suppose T
is obtained from S by moving the subtree S [v, s], that is, deleting this subtree
and making it a child of another node x in S.6 Let S1 denote S [v, s], and let
S2 denote the tree obtained by deleting S1 from S. Theorem 4.5 implies that
 V (S) − V (S1 + S2 ) 1 ≤ O(log2 n log∗ n). Note, however, that we can also picture
(S1 + S2 ) as the forest obtained by deleting S1 from T . Thus, V (T ) − V (S1 +
S2 ) 1 is also ≤ O(log2 n log∗ n). Once again, the triangle inequality for L1 easily
implies the result.
   Finally, suppose we delete a nonleaf node v from S. Let the parent of v be
w. All children of v now become children of w. We can think of this process as
follows. Let s be the children of v. First, we move S [v, s] and make it a child of
w. At this point, v is a leaf node, so we are just deleting a leaf node now. Thus,

6 This is a slightly “generalized” subtree move, since it allows for a contiguous (sub)sequence of
sibling subtrees to be moved in one step. However, it is easy to see that it can be simulated with
only three simpler edit operations, namely, a node insertion, a single-subtree move, and a node
deletion. Thus, our results trivially carry over to the case of “single-subtree move” edit operations.

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
298       •     M. Garofalakis and A. Kumar

the result for this case follows easily from the arguments above for deleting a
leaf node and moving a subtree.
   As a consequence, it is sufficient to prove Theorem 4.5. Our proof proceeds
along the following lines. We define an influence region for each tree T i in
our hierarchical parsing (i = 0, . . . , O(log n))—the intuition here is that the
influence region for T i captures the complete set of nodes in T i whose parsing
could have been affected by the change (i.e., the splitting of T into (T1 + T2 )).
Initially (i.e., tree T 0 ), this region is just the node v at which we deleted the T1
subtree. But, obviously, this region grows as we proceed to subsequent phases
in our parsing. We then argue that, if we ignore this influence region in T i and
the corresponding region in the parsing of the (T1 +T2 ) forest, then the resulting
sets of valid subtrees look very similar (in any phase i). Thus, if we can bound
the rate at which this influence region grows during our hierarchical parsing, we
can also bound the L1 distance between the two resulting characteristic vectors.
The key intuition behind bounding the size of the influence region is as follows:
when we effect a change at some node v of T , nodes far away from v in the
tree remain unaffected, in the sense that the subtree in which such nodes are
grouped during the next phase of our hierarchical parsing remains unchanged.
As we will see, this fact hinges on the properties of the CM-Group procedure
used for grouping nodes during each phase of TREEEMBED (Theorem 4.1). The
discussion of our proof in the remainder of this section is structured as follows.
First, we formally define influence regions, giving the set of rules for “growing”
such regions of nodes across consecutive phases of our parsing. Second, we
demonstrate that, for any parsing phase i, if we ignore the influence regions
in the current (i.e., phase-(i + 1)) trees produced by TREEEMBED on input T and
(T1 + T2 ), then we can find a one-to-one, onto mapping between the nodes in the
remaining portions of the current T and (T1 + T2 ) that pairs up identical valid
subtrees. Third, we bound the size of the influence region during each phase of
our parsing. Finally, we show that the upper bound on the L1 distance of V (T )
and V (T1 + T2 ) follows as a direct consequence of the above facts.
   We now proceed with the proof of Theorem 4.5. Define (T1 + T2 )i as the 2-tree
forest corresponding to (T1 + T2 ) at the beginning of the (i + 1)th parsing phase.
We say that a node x ∈ T i+1 contains a node x ∈ T i if the set of nodes in T i
which are merged to form x contains x . As earlier, any node w in T i corresponds
to a valid subtree w(T ) of T ; furthermore, it is easy to see that if w and w are
two distinct nodes of T i , then the w(T ) and w (T ) subtrees are disjoint. (The
same obviously holds for the parsing of each of T1 , T2 .)
   For each tree T i , we mark certain nodes; intuitively, this node-marking de-
fines the influence region of T i mentioned above. Let M i be the set of marked
nodes (i.e., influence region) in T i (see Figure 6(a) for an example). The generic
structure of the influence region M i satisfies the following: (1) M i is a connected
subtree of T i that always contains the node v (at which the T1 subtree was re-
moved), that is, the node in T i which contains v (denoted by vi ) is always in
M i ; (2) there is a center node ci ∈ M i , and M i may contain some ancestor nodes
of ci —but all such ancestors (except perhaps for ci itself) must be of degree
2 only, and should form a connected path; and (3) M i may also contain some

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                  •      299




Fig. 6. (a) The subtree induced by the bold edges corresponds to the nodes in M i . (b) Node z
becomes the center of N i .

descendants of the center node ci . Finally, certain (unmarked) nodes in T i − M i
are identified as corner nodes—intuitively, these are nodes whose parsing will
be affected when they are shrunk down to a leaf node.
   Once again, the key idea is that the influence region M i captures the set
of those nodes in T i whose parsing in TREEEMBED may have been affected by
the change we made at node v. Now, in the next phase, the changes in M i can
potentially affect some more nodes. Thus, we now try to determine which nodes
M i can affect; that is, assuming the change at v has influenced all nodes in
M i , which are the nodes in T i whose parsing (during phase (i + 1)) can change
as a result of this. To capture this newly affected set of nodes, we define an
extended influence region N i in T i —this intuitively corresponds to the (worst-
case) subset of nodes in T i whose parsing can potentially be affected by the
changes in M i .
   First, add all nodes in M i to N i . We define the center node z of the extended
influence region N i as follows. We say that a descendant node u of vi (which
contains v) in T i is a removed descendant of vi if and only if its corresponding
subtree u(T ) in the base tree T is entirely contained within the removed subtree
T [v, s]. (Note that, initially, v0 = v is trivially a removed descendant of v0 .) Now,
let w be the highest node in M i —clearly, w is an ancestor of the current center
node ci as well as the vi node in T i . If all the descendants of w are either in
M i or are removed descendants of vi , then define the center z to be the parent
of node w, and add z to N i (see Figure 6(b)); otherwise, define the center z of
N i to be same as ci . The idea here is that the grouping of w’s parent in the
next phase can change only if the entire subtree under w has been affected by
the removal of the T [v, s] subtree. Otherwise, if there exist nodes under w in
T i whose parsing remains unchanged and that have not been deleted by the
subtree removal, then the mere existence of these nodes in T i means that it is
impossible for TREEEMBED to group w’s parent in a different manner during the
next phase of the (T1 + T2 ) parsing in any case. Once the center node z of N i
has been fixed, we also add nodes to N i according to the following set of rules
(see Figures 7(a) and (b) for examples).

                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
300       •      M. Garofalakis and A. Kumar




Fig. 7. (a) The nodes in dotted circles get added to N i due to Rules (i), (ii), and (iii). (b) The nodes in
the dotted circle get added to N i due to Rule (iv)—note that all descendants of the center z which are
not descendants of u are in M i . (c) Node u moves up to z, turning nodes a and b into corner nodes.


  (i) Suppose u is a leaf child of the (new) center z or the vi node in T i ; fur-
      thermore, assume there is some sibling u of u such that the following
      conditions are satisfied: u ∈ M i or u is a corner leaf node, the set of nodes
      s(u, u ) between u and u are leaves, and |s(u, u )| ≤ . Then, add u to N i .
      (In particular, note that any leaf child of z which is a corner node gets
      added to N i .)
 (ii) Let u be the leftmost lone leaf child of the center z which is not already in
      M i (if such a child exists); then, add u to N i . Similarly, for the vi node
      in T i , let u be a leaf child of vi such that one of the following condi-
      tions is satisfied: (a) u is the leftmost lone leaf child of vi when consid-
      ering only the removed descendants of vi ; or (b) u is the leftmost lone leaf
      child of vi when ignoring all removed descendants of vi . Then, add u to
      N i.
(iii) Let w be the highest node in M i ∪ {z} (so it is an ancestor of the center
      node z). Let u be an ancestor of w. Suppose it is the case that all nodes
      between u and w, except perhaps w, have degree 2, and the length of the
      path joining u and w is at most ; then, add u to N i .
(iv) Suppose there is a child u of the center z or the vi node in T i such that one
      of the following conditions is satisfied: (a) u is not a removed descendant
      of vi and all descendants of all siblings of u (other than u itself) are either
      already in M i or are removed descendants of vi ; or (b) u is a removed
      descendant of vi (and, hence, a child of vi ) and all removed descendants
      of vi which are not descendants of u are in M i . Then, let u be the lowest
      descendant of u which is in M i . If u is any descendant of u such that the
      path joining them contains degree-2 nodes only (including the end-points),
      and has length at most , then add u to N i .

  Let us briefly describe why we need these four rules. We basically want to
make sure that we include all those nodes in N i whose parsing can potentially
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      301

be affected if we delete or modify the nodes in M i (given, of course, the removal
of the T [v, s] subtree). The first three rules, in conjunction with the properties
of our TREEEMBED parsing, are easily seen to capture this fact. The last rule is
a little more subtle. Suppose u is a child of z (so that we are in clause (a) of
Rule (iv)); furthermore, assume that all descendants of z except perhaps those
of u are either already in M i or have been deleted with the removal of T [v, s].
Remember that all nodes in M i have been modified due to the change effected
at v, so they may not be present at all in the corresponding picture for (T1 + T2 )
(i.e., the (T1 +T2 )i forest). But, if we just ignore M i and the removed descendants
of vi , then z becomes a node of degree 2 only, which would obviously affect how u
and its degree-2 descendants are parsed in (T1 + T2 )i (compared to their parsing
in T i ). Rule (iv) is designed to capture exactly such scenarios; in particular, note
that clauses (a) and (b) in the rule are meant to capture the potential creation
                                                             i
of such degree-2 chains in the remainder subtree T2 and the deleted subtree
   i
T1 , respectively.
     We now consider the rule for marking corner nodes in T i . Once again, the
intuition is that certain (unaffected) nodes in T i − M i (actually, in T i − N i ) are
marked as corner nodes so that we can “remember” that their parsing will be
affected when they are shrunk down to a leaf. Suppose the center node z has
at least two children, and a leftmost lone leaf child u—note that, by Rule (ii),
u ∈ N i . If any of the two immediate siblings of u are not in N i , then we mark
them as corner nodes (see Figure 7(c)). The key observation here is that, when
parsing T i , u is going to be merged into z and disappear; however, we need
to somehow “remember” that a (potentially) affected node u was there, since
its existence could affect the parsing of its sibling nodes when they are shrunk
down to leaves. Marking u’s immediate siblings in T i as corner nodes essentially
achieves this effect.
     Having described the (worst-case) extended influence region N i in T i , let us
now define M i+1 , that is, the influence region at the next level of our parsing
of T . M i+1 is precisely the set of those nodes in T i+1 which contain a node of
N i . The center of M i+1 is the node which contains the center node z of N i ;
furthermore, any node in T i+1 which contains a corner node is again marked
as a corner node.
     Initially, define M 0 = {v} (and, obviously, v0 = c0 = v). Furthermore, if v has
a child node immediately on the left (right) of the removed child subsequence s,
then that node as well as the leftmost (respectively, rightmost) node in s are
marked as corner nodes. The reason, of course, is that these ≤ 4 nodes may be
parsed in a different manner when they are shrunk down to leaves during the
parsing of T1 and T2 . Based on the above set of rules, it is easy to see that M i
and N i are always connected subtrees of T i . It is also important to note that
the extended influence region N i is defined in such a manner that the parsing
of all nodes in T i − N i cannot be affected by the changes in M i . This fact should
become clear as we proceed with the details of the proofs in the remainder of
this section.

  Example 4.6. Figure 8 depicts the first three phases of a simple example
parsing for T and (T1 + T2 ), in the case of a 4-level full binary tree T that
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
302       •     M. Garofalakis and A. Kumar




Fig. 8. Example of TREEEMBED parsing phases for T and (T1 + T2 ) in the case of a full binary tree,
highlighting the influence regions M i in T i and the corresponding P i regions in (T1 + T2 )i (“o”
denotes an unlabeled node).

is split by removing the right subtree of the root (i.e., T1 = T [x3 , {x6 , x7 }],
T2 = T − T1 ). We use subscripted x’s and y’s to label the nodes in T i and
(T1 + T2 )i to emphasize the fact that these tree nodes are parsed independently
by TREEEMBED; furthermore, we employ the subscripts to capture the original
subtrees of T and (T1 + T2 ) represented by nodes in later phases of our parsing.
Of course, it should be clear that x and y nodes with identical subscripts refer
to identical (valid) subtrees of the original tree T ; for instance, both x4,8,9 ∈ T 2
and y 4,8,9 ∈ T22 represent the same subtree T [x4 , {x8 , x9 }] = {x4 , x8 , x9 } of T .
   As depicted in Figure 8, the initial influence region of T is simply M 0 = {x3 }
(with v0 = c0 = x3 ). Since, clearly, all descendants of x3 are removed descendants
of v0 , the center z for the extended influence region N 0 moves up to the parent
node x1 of x3 (and none of our other rules are applicable); thus, N 0 = {x1 , x3 }
and, obviously, M 1 = {x1 , x3 }. This is crucial since (as shown in Figure 8), due to
the removal of T1 , nodes y 1 and y 3 are processed in a very different manner in
the remainder subtree T20 (i.e., y 3 is merged up into y 1 as its leftmost lone leaf
child). Now, for T 1 , none of our rules for extending the influence region apply
and, consequently, N 1 = M 2 = {x1 , x3 }. The key thing to note here is that, for
each parsing phase i, ignoring the nodes in the influence region M i (and the
“corresponding” nodes in (T1 + T2 )i ), the remaining nodes of T i and (T1 + T2 )i
have been parsed in an identical manner by TREEEMBED (and correspond to an
identical subset of valid subtrees in T ); in other words, their corresponding
characteristic vectors in our embedding are exactly the same. We now proceed
to formalize these observations.
   Given the influence region M i of T i , we define a corresponding node set,
P , in the (T1 + T2 )i forest. In what follows, we prove that the nodes in T i −
  i

M i and (T1 + T2 )i − P i can be matched in some manner, so that each pair
of matched nodes correspond to identical valid subtrees in T and (T1 + T2 ),
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                   •      303




                     Fig. 9.   f maps from T i − M i to (T1 + T2 )i − P i .

respectively. The node set P i in (T1 + T2 )i is defined as follows (see Figure 8
                                                          i
for examples). P i always contains the root node of T1 . Furthermore, a node
u ∈ (T1 + T2 ) is in P , if and only if there exists a node u ∈ M i such that the
              i       i

intersection u(T1 + T2 ) ∩ u (T ) is nonempty (as expected, u(T1 + T2 ) denotes the
valid subtree corresponding to u in (T1 + T2 )). We demonstrate that our solution
always maintains the following invariant.
   INVARIANT 4.7. Given any node x ∈ T i − M i , there exists a node y = f (x)
in (T1 + T2 )i − P i such that x(T ) and y(T1 + T2 ) are identical valid subtrees on
the exact same subset of nodes in the original tree T . Conversely, given a node
y ∈ (T1 + T2 )i − P i , there exists a node x ∈ T i − M i such that x(T ) = y(T1 + T2 ).
   Thus, there always exists a one-to-one, onto mapping f from T i − M i to
(T1 + T2 )i − P i (Figure 9). In other words, if we ignore M i and P i from T i and
(T1 + T2 )i (respectively), then the two remaining forests of valid subtrees in this
phase are identical.
   Example 4.8. Continuing with our binary-tree parsing example in
Figure 8, it is easy to see that, in this case, the mapping f : T i − M i −→
(T1 +T2 )i − P i simply maps every x node in T i − M i to the y node in (T1 +T2 )i − P i
with the same subscript that, obviously, corresponds to exactly the same valid
subtree of T ; for instance, y 10,11 = f (x10,11 ) and both nodes correspond to the
same valid subtree T [x5 , {x10 , x11 }]. Thus, the collections of valid subtrees for
T i − M i and (T1 + T2 )i − P i are identical (i.e., the L1 distance of their cor-
responding characteristic vectors is zero); this implies that, for example, the
contribution of T 1 and (T1 + T2 )1 to the difference of the embedding vectors
V (T ) and V (T1 + T2 ) is upper-bounded by |M 1 | = 2.
   Clearly, Invariant 4.7 is true in the beginning (i.e., M 0 = {v}, P 0 = {v,
root(T1 )}). Suppose our invariant remains true for T i and (T1 + T2 )i . We now
need to prove it for T i+1 and (T1 + T2 )i+1 . As previously, let N i ⊇ M i be the
extended influence region in T i . Fix a node w in T i − N i , and let w be the
corresponding node in (T1 + T2 )i − P i (i.e., w = f (w)). Suppose w is contained
in node q ∈ T i+1 and w is contained in node q ∈ (T1 + T2 )i+1 .
  LEMMA 4.9. Given a node w in T i − N i , let q, q be as defined above. If q(T )
and q (T1 +T2 ) are identical subtrees for any node w ∈ T i − N i , then Invariant 4.7
holds for T i+1 and (T1 + T2 )i+1 as well.
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
304       •     M. Garofalakis and A. Kumar

   PROOF. We have to demonstrate the following facts. If x is a node in T i+1 −
M i+1 , then there exists a node y ∈ (T1 +T2 )i+1 −P i+1 such that y(T1 +T2 ) = x(T ).
Conversely, given a node y ∈ (T1 + T2 )i+1 − P i+1 , there is a node x ∈ T i+1 − M i+1
such that x(T ) = y(T1 + T2 ).
   Suppose the condition in the lemma holds. Let x be a node in T i+1 − M i+1 .
                                                                     /
Let x be a node in T i such that x contains x . Clearly, x ∈ N i , otherwise x
would be in M i+1 . Let y = f (x ), and let y be the node in (T1 + T2 )i+1 which
contains y . By the hypothesis of the lemma, x(T ) and y(T1 + T2 ) are identical
                                           /
subtrees. It remains to check that y ∈ P i+1 . Since y(T1 + T2 ) = x(T ), y(T1 + T2 )
is disjoint from z(T ) for any z ∈ T i+1 , z = x. By the definition of the P i node
                /
sets, since x ∈ M i+1 , we have that y ∈ (T1 + T2 )i+1 − P i+1 .
   Let us prove the converse now. Suppose y ∈ (T1 + T2 )i+1 − P i+1 . Let y be
a node in (T1 + T2 )i such that y contains y . If y ∈ P i , then (by definition)
there exists a node x ∈ M i such that x (T ) ∩ y (T1 + T2 ) = ∅. Let x be the node
in T i+1 which contains x . Since x ∈ N i , x ∈ M i+1 . Now, x(T ) ∩ y(T1 + T2 ) ⊇
x (T )∩ y (T1 + T2 ) = ∅. But then y should be in P i+1 , a contradiction. Therefore,
   /
 y ∈ P i . By the invariant for T i , there is a node x ∈ T i −M i such that y = f (x ).
   Let x be the node in T i+1 containing x . Again, if x ∈ N i , then x ∈ M i+1 .
But then x(T ) ∩ y(T1 + T2 ) ⊇ x (T ) ∩ y (T1 + T2 ), which is nonempty because
                                                                     /
x (T ) = y (T1 + T2 ). This would imply that y ∈ P i+1 . So, x ∈ N i . But then, by
the hypothesis of the lemma, x(T ) = y(T1 + T2 ). Further, x cannot be in M i+1 ,
otherwise y will be in P i+1 . Thus, the lemma is true.
   It is, therefore, sufficient to prove that, for any pair of nodes w ∈ T i − N i ,
w = f (w) ∈ (T1 + T2 )i − P i , the corresponding encompassing nodes q ∈ T i+1
and q ∈ (T1 + T2 )i+1 map to identical valid subtrees, that is, q(T ) = q (T1 + T2 ).
This is what we seek to do next. Our proof uses a detailed, case-by-case analysis
of how node w gets parsed in T i . For each case, we demonstrate that w will also
get parsed in exactly the same manner in the forest (T1 + T2 )i . In the interest
of space and continuity, we defer the details of this proof to the Appendix.
   Thus, we have established the fact that, if we look at the vectors V (T ) and
V (T1 + T2 ), the nodes corresponding to phase i of V (T ) which are not present
in V (T1 + T2 ) are guaranteed to be a subset of M i . Our next step is to bound
the size of M i .
  LEMMA 4.10. The influence region M i for tree T i consists of at most
O(i log∗ n) nodes.
   PROOF. Note that, during each parsing phase, Rule (iii) adds at most nodes
of degree at most 2 to the extended influence region N i . It is not difficult to see
that Rule (iv) also adds at most 4 nodes of degree at most 2 to N i during
each phase; indeed, note that, for instance, there is at most one child node u
of z which is not in M i and satisfies one of the clauses of Rule (iv). So, adding
over the first i stages of our algorithm the number of such nodes in M i can be
at most O(i log∗ n). Thus, we only need to bound the number of nodes that get
added to the influence region due to Rules (i) and (ii).
   We now want to count the number of leaf children of the center node ci which
are in M i . Let ki be the number of children of ci which become leaves for the
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                    •      305

first time in T i and are marked as corner nodes. Let Ci be the nodes in M i
which were added as the leaf children of the center node of T i , for some i < i.
Then, we claim that Ci can be partitioned into at most 1 + i−1 k j contiguous
                                                                 j =1
sets such that each set has at most 4 elements. We prove this by induction on
i. So, suppose it is true for T i .
   Consider such a contiguous set of leaves in Ci , call it C1 , where |C1 | ≤ 4 .
                                                             i            i
                                                     i                      i
We may add up to consecutive leaf children of c on either side of C1 to the
                               i
extended influence region N . Thus, this set may grow to a size of 6 contiguous
leaves. But when we parse this set (using CM-Group), we reduce its size by at
least half. Thus, this set will now contain at most 3 leaves (which is at most
4 ). Therefore, each of the 1 + i−1 k j contiguous sets in Ci correspond to a
                                     j =1
contiguous set in T i+1 of size at most 4 .
   Now, we may add other leaf children of ci to N i . This can happen only if a
corner node becomes a leaf. In this case, at most consecutive leaves on either
side of this node are added to N i (by Rule (i)); thus, we may add ki more such
sets of consecutive leaves to N i . This completes our inductive argument.
   But note that, in any phase, at most two new corner nodes (i.e., the immediate
siblings of the center node’s leftmost lone leaf child) can be added. (And, of
course, we also start out with at most four nodes marked as corners inside and
next to the removed child subsequence s.) So, ij =1 k j ≤ 2i + 2. This shows that
the number of nodes in Ci is O(i log∗ n). The contribution toward M i of the leaf
children of the vi node can also be upper bounded by O(i log∗ n) using a very
similar argument. This completes the proof.

   We now need to bound the nodes in (T1 + T2 )i which are not in T i . But
this can be done in exactly analogous manner if we switch the roles of T and
T1 + T2 in the proofs above. Thus, we can define a subset Q i of (T1 + T2 )i and
a one-to-one, onto mapping g from (T1 + T2 )i − Q i to a subset of T i such that
g (w)(T ) = w(T1 + T2 ) for every w ∈ (T1 + T2 )i − Q i . Furthermore, we can show
in a similar manner that |Q i | ≤ O(i log∗ n).
   We are now ready to complete the proof of Theorem 4.5.
    PROOF OF THEOREM 4.5. Fix a phase i. Consider those subtrees t such that
Vi (T )[< t, i >] ≥ Vi (T1 +T2 )[< t, i >]. In other words, t appears more frequently
in the parsed tree T i than in (T1 + T2 )i . Let the set of such subtrees be denoted
by S. We first observe that

               |M i | ≥         Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >].
                          t∈S


Indeed, consider a tree t ∈ S. Let V1 be the set of vertices u in T i such that
u(T ) = t. Similarly, define the set V2 in (T1 + T2 )i . So, |V1 | − |V2 | = Vi (T )[<
t, i >] − Vi (T1 + T2 )[< t, i >]. Now, the function f must map a vertex in V1 − M i
to a vertex in V2 . Since f is one-to-one, V1 − M i can have at most |V2 | nodes.
In other words, M i must contain |V1 |− |V2 | nodes from V1 . Adding this up for
all such subtrees in S gives us the inequality above.
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
306       •      M. Garofalakis and A. Kumar

   We can write a similar inequality for Q i . Adding these up, we get
               |M i | + |Q i | ≥       |Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]|,
                                   t

where the summation is over all subtrees t. Adding over all parsing phases i,
we have
                                               O(log n)
              V (T ) − V (T1 + T2 )    1   ≤              O(i log∗ n) = O(log2 n log∗ n).
                                                 i=1

This completes our proof argument.


4.5 Lower-Bound Proof
Our proof follows along the lower-bound proof of Cormode and Muthukrishnan
[2002], in that it does not make use of any special properties of our hier-
archical tree parsing; instead, we only assume that the parsing structure
built on top of the data tree is of bounded degree k (in our case, of course,
k = 3). The idea is then to show how, given two data trees S and T ,
we can use the “credit” from the L1 difference of their vector embeddings
 V (T ) − V (S) 1 to transform S into T . As in Cormode and Muthukrishnan
[2002], our proof is constructive and shows how the overall parsing structure
for S (including S itself at the leaves) can be transformed into that for T ;
the transformation is performed level-by-level in a bottom-up fashion (start-
ing from the leaves of the parsing structure). (The distance-distortion lower
bound for our embedding is an immediate consequence of Lemma 4.11 with
k = 3.7 )
   LEMMA 4.11. Assuming a hierarchical parsing structure with degree at most
k (k ≥ 2), the overall parsing structure for tree S can be transformed into exactly
that of tree T with at most (2k − 1) V (T ) − V (S) 1 tree-edit operations (node
inserts, deletes, relabels, and subtree moves).
  PROOF. As in Cormode and Muthukrishnan [2002], we first perform a top-
down pass over the parsing structure of S, marking all nodes x whose subgraph
appears in the both parse-tree structures, making sure that the number of
marked x nodes at level (i.e., phase) i of the parse tree does not exceed Vi (T )[x]
(we use x instead of v(x) to also denote the valid subtree corresponding to x in
order to simplify the notation). Descendants of marked nodes are also marked.
Marked nodes are “protected” during the parse-tree transformation process
described below, in the sense that we do not allow an edit operation to split a
marked node.
  We proceed bottom-up over the parsing structure for S in O(log n) rounds
(where n = max{|S|, |T |}), ensuring that after the end of round i we have created
an Si such that Vi (T ) − Vi (Si ) 1 = 0. The base case (i.e., level 0) deals with

7 It is probably worth noting at this point that the subtree-move operation is needed only to establish

the distortion lower-bound result in this section; that is, the upper bound shown in Section 4.1 holds
for the standard tree-edit distance metric as well.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      307




                            Fig. 10. Forming a level-i node x.


simple node labels and creates S0 in a fairly straightforward way: for each
label a, if V0 (S)[a] > V0 (T )[a] then we delete (V0 (S)[a]− V0 (T )[a]) unmarked
copies of a; otherwise, if V0 (S)[a] < V0 (T )[a], then we add (V0 (T )[a] − V0 (S)[a])
leaf nodes labeled a at some location of S. In each case, we perform |V0 (S)[a] −
V0 (T )[a]| edit operations which is exactly the contribution of label a to V0 (T )−
V0 (S) 1 . It is easy to see that, at the end of the above process, we have V0 (T ) −
V0 (S0 ) 1 = 0.
   Inductively, assume that, when we start the transformation at level i, we
have enough nodes at level i − 1; that is, Vi−1 (T ) − Vi−1 (Si−1 ) 1 = 0. We show
how to create Si using at most (2k−1) Vi (T )−Vi (Si ) 1 subtree-move operations.
Consider a node x at level i (again, to simplify the notation, we also use x to
denote the corresponding valid subtree). If Vi (S)[x] > Vi (T )[x], then we have
exactly Vi (T )[x] marked x nodes at level i of S’s parse tree that we will not
alter; the remaining copies will be split to form other level-i nodes as described
next. If Vi (S)[x] < Vi (T )[x], then we need to build an extra (Vi (T )[x] − Vi (S)[x])
copies of the x node at level i. We demonstrate how each such copy can be
built by using ≤ (2k − 1) subtree move operations in order to bring together
≤ k level-(i − 1) nodes to form x (note that the existence of these level-(i −
1) nodes is guaranteed by the fact that Vi−1 (T ) − Vi−1 (Si−1 ) 1 = 0). Since
(Vi (T )[x] − Vi (S)[x]) is exactly the contribution of x to Vi (T ) − Vi (Si ) 1 , the
overall transformation for level i requires at most (2k − 1) Vi (T ) − Vi (Si ) 1 edit
operations.
   To see how we form the x node at level i note that, based on our embedding
algorithm, there are three distinct cases for the formation of x from level-(i − 1)
nodes, as depicted in Figures 10(a)–10(c). In case (a), x is formed by “folding”
the (no-siblings) leftmost leaf child v2 of a node v1 into its parent; we can create
the scenario depicted in Figure 10(a) easily with two subtree moves: one to
remove any potential subtree rooted at the level-(i − 1) node v2 (we can place
it under v2 ’s original parent at the level-(i − 1) tree), and one to move the (leaf)
v2 under the v1 node. Similarly, for the scenarios depicted in cases (b) and (c),
we basically need at most k subtree moves to turn the nodes involved into
leaves, and at most k − 1 additional moves to move these leaves into the right
formation around one of these ≤ k nodes. Thus, we can create each copy of x
with ≤ (2k − 1) subtree move operations. At the end of this process, we have
  Vi (T ) − Vi (Si ) 1 = 0. Note that we do not care where in the level-i tree we
create the x node; the exact placement will be taken care of at higher levels of
the parsing structure. This completes the proof.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
308       •     M. Garofalakis and A. Kumar

5. SKETCHING A MASSIVE, STREAMING XML DATA TREE
In this section, we describe how our tree-edit distance embedding algorithm
can be used to obtain a small, pseudorandom sketch synopsis of a massive XML
data tree in the streaming model. This sketch synopsis requires only small
(logarithmic) space, and it can be used as a much smaller surrogate for the entire
data tree in approximate tree-edit distance computations with guaranteed error
bounds on the quality of the approximation based on the distortion bounds
guaranteed from our embedding. Most importantly, as we show in this section,
the properties of our embedding algorithm are the key that allows us to build
this sketch synopsis in small space as nodes of the tree are streaming by without
ever backtracking on the data.
    More specifically, consider the problem of embedding a data tree T of size
n into a vector space, but this time assume that T is truly massive (i.e., n far
exceeds the amount of available storage). Instead, we assume that we see the
nodes of T as a continuous data stream in some apriori determined order. In the
theorem below, we assume that the nodes of T arrive in the order of a preorder
(i.e., depth-first and left-to-right) traversal of T . (Note, for example, that this is
exactly the ordering of XML elements produced by the event-based SAX parsing
interface (sax.sourceforge.net/).) The theorem demonstrates that the vector
V (T ) constructed for T by our L1 embedding algorithm can then be constructed
in space O(d log2 n log∗ n), where d denotes the depth of T . The sketch of T is
essentially a sketch of the V (T ) vector (denoted by sketch(V (T ))) that can be
used for L1 distance calculations in the embedding vector space. Such an L1
sketch of V (T ) can be obtained (in small space) using the 1-stable sketching
algorithms of Indyk [2000] (see Theorem 2.2).

    THEOREM 5.1. A sketch sketch(V (T )) to allow approximate tree-edit dis-
tance computations can be computed over the stream of nodes in the preorder
traversal of an n-node XML data tree T using O(d log2 n log∗ n) space and
O(log d log2 n(log∗ n)2 ) time per node , where d denotes the depth of T . Then, as-
suming sketch vectors of size O(log 1 ) and for an appropriate combining function
                                      δ
 f (), f (sketch(V (S)), sketch(V (T ))) gives an estimate of the tree-edit distance
d (S, T ) to within a relative error of O(log2 n log∗ n) with probability of at least
1 − δ.

    The proof of Theorem 5.1 hinges on the fact that, based on our proof in Sec-
tion 4.4, given a node v on a root-to-leaf path of T and for each of the O(log n) lev-
els of the parsing structure above v, we only need to retain a local neighborhood
(i.e., influence region) of nodes of size at most O(log n log∗ n) to determine the
effect of adding an incoming subtree under T . The O(d ) multiplicative factor is
needed since, as the tree is streaming in preorder, we do not really know where
a new node will attach itself to T ; thus, we have to maintain O(d ) such influence
regions. Given that most real-life XML data trees are reasonably “bushy,” we
expect that, typically, d << n, or d = O(polylog(n)). The f () combining function
is basically a median-selection over the absolute component-wise differences of
the two sketch vectors (Theorem 2.2). The details of the proof for Theorem 5.1
follow easily from the above discussion and the results of Indyk [2000].
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                      •     309

6. APPROXIMATE SIMILARITY JOINS OVER XML DOCUMENT STREAMS
We now consider the problem of computing (in limited space) the cardinality
of an approximate tree-edit-distance similarity join over two continuous data
streams of XML documents S1 and S2 . Note that this is a distinctly different
streaming problem from the one examined in Section 5: we now assume mas-
sive, continuous streams of short XML documents that we want to join based
on tree-edit distance; thus, the limiting factor is no longer the size of an indi-
vidual data tree (which is assumed small and constant), but rather the number
of trees in the stream(s). The documents in each Si stream can arrive in any
order, and our goal is to produce an accurate estimate for the similarity-join
cardinality |SimJoin(S1 , S2 , τ )| = |{(S, T ) ∈ S1 × S2 : d (S, T ) ≤ τ }|, that is, the
number of pairs in S1 × S2 that are within a tree-edit distance of τ from each
other (where the similarity threshold τ is a user/application-defined param-
eter). Such a space-efficient, one-pass approximate similarity join algorithm
would obviously be very useful in processing huge XML databases, integrating
streaming XML data sources, and so on.
   Once again, the first key step is to utilize our tree-edit distance embedding
algorithm on each streaming document tree T ∈ Si (i = 1, 2) to construct a
(low-distortion) image V (T ) of T as a point in an appropriate multidimensional
vector space. We then obtain a lower-dimensional vector of 1-stable sketches of
V (T ) that approximately preserves L1 distances in the original vector space, as
described by Indyk [2000]. Our tree-edit distance similarity join has now essen-
tially been transformed into an L1 -distance similarity join in the embedding,
low-dimensional vector space. The final step then performs an additional level
of AMS sketching over the stream of points in the embedding L1 vector space
in order to build a randomized, sketch-based estimate for |SimJoin(S1 , S2 , τ )|.8
The following theorem shows how an atomic sketch-based estimate can be
constructed in small space over the streaming XML data trees; to boost ac-
curacy and probabilistic confidence, several independent atomic-estimate in-
stances can be used (as in Alon et al. [1996, 1999]; Dobra et al. [2002]; see also
Theorem 2.1).
   THEOREM 6.1. Let |SimJoin(S1 , S2 , τ )| denote the cardinality of the tree-edit
distance similarity join between two XML document streams S1 and S2 , where
document distances are approximated to within a factor of O(log2 b log∗ b) with
constant probability, and b is a (constant) upper bound on the size of each
document tree. Define k = k(δ, ) = O( log(1/δ) ) O(1/ ) . An atomic, sketch-based
estimate for |SimJoin(S1 , S2 , τ )| can be constructed in O(b + k(δ, ) log N ) space
and O(b log∗ b + k(δ, ) log N ) time per document, where δ, are constants < 1
that control the accuracy of the distance estimates and N denotes the length of
the input stream(s).

8 Assuming constant-sized trees, a straightforward approach to our similarity-join problem would be

to exhaustively build all trees within a τ -radius of an incoming tree, and then just sketch (the finger-
prints of) these trees directly using AMS for the similarity-join estimate. The key problem with such
a “direct” approach is the computational cost per incoming tree: given a tree T with b nodes and an
edit-distance radius of τ , the cost of the brute-force enumeration of all trees in the τ -neighborhood
of T would be at least O(bτ ), which is probably prohibitive (except for very small values of b and τ ).

                                       ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
310       •     M. Garofalakis and A. Kumar

    PROOF. Our algorithm for producing an atomic sketch estimate for the simi-
larity join cardinality uses two distinct levels of sketching. Assume an input tree
T (in one of the input streams). The first level of sketching uses our L1 embed-
ding algorithm in conjunction with the L1 -sketching technique of Indyk [2000]
(i.e., with 1-stable (Cauchy) random variates) to map T to a lower-dimensional
vector of O(k(δ, )) iid sketching values sketch(V (T )). This mapping of an input
tree T to a point in an O(k(δ, ))-dimensional vector space can be done in space
O(b + k(δ, ) log N ): this covers the O(b) space to store and parse the tree,9 and
the O(log N ) space required to generate the 1-stable random variates for each
of the O(k(δ, )) sketch-value computations (and store the sketch values them-
selves). (Note that O(log N ) space is sufficient since we know that there are
at most O(N b) nonzero components in all the V (T ) vectors in the entire data
stream.) A key property of this mapping is that the L1 distances of the V (T )
vectors are approximately preserved in this new O(k(δ, ))-dimensional vector
space with constant probability, as stated in the following theorem from Indyk
[2000].
   THEOREM 6.2 (INDYK 2000). Let f 1 and f 2 denote N dimensional numeric
                                                         j    j
vectors rendered as a stream of updates, and let {X 1 , X 2 : j = 1, . . . , k} denote
                                                                    j      N        j
k = k(δ, ) = O( log(1/δ) ) O(1/ ) iid pairs of 1-stable sketches X l = i=1 f l (i)ξi ;
                                                                                 j
also, define X l as the k-dimensional vector (X l , . . . , X l ) (l = 1, 2; {ξi } are
                                                      1         k

1-stable (Cauchy) random variates). Then, the L1 -difference norm of the k-
dimensional sketch vectors X 1 − X 2 1 satisfies
(1)    X1 − X2     1   ≥ f 1 − f 2 1 with probability ≥ 1 − δ; and
(2)    X1 − X2     1   ≤ (1 + ) · f 1 − f 2 1 with probability ≥ .

   Intuitively, Theorem 6.2 states that, if we use 1-stable sketches as a
dimensionality-reduction tool for L1 (that is, for mapping a point in a high,
O(N)-dimensional L1 -normed space to a lower, k-dimensional L1 -normed space,
instead of using median selection as in Theorem 2.2), then we can only provide
weaker, asymmetric guarantees on the L1 distance distortion. In short, we can
guarantee small distance contraction with high probability (i.e., 1 − δ), but we
can guarantee small distance expansion only with constant probability (i.e., ).
(Note that the exact manner in which the δ, parameters control the error and
confidence in the approximate L1 -distance estimates is formally stated in The-
orem 6.2.) The reason for using this version of Indyk’s results in our similarity-
join scenario is that, as mentioned earlier, we need to perform an (approxi-
mate) streaming similarity-join computation over the mapped space of sketch
vectors, which appears to be infeasible when the median-selection operator is
used.
   The second level of sketching in our construction will produce a
pseudorandom AMS sketch (Section 2.2) of the point-distribution (in the embed-
ding vector space) for each input data stream. To deal with an L1 τ -similarity
join, the basic equi-join AMS-sketching technique discussed in Section 2.2 needs
9 Ofcourse, for large trees, the small-space optimizations of Section 5 can be used (assuming pre-
order node arrivals).

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      311

to be appropriately adapted. The key idea here is to view each incoming “point”
sketch(V (T )) in one of the two data streams, say S1 , as an L1 region of points (i.e.,
a multidimensional hypercube) of radius τ centered around sketch(V (T )) in the
embedding O(k(δ, ))-dimensional vector space when building an AMS-sketch
synopsis for stream S1 . Essentially, this means that when T (i.e., sketch(V (T )))
is seen in the S1 input, instead of simply adding the random variate ξi (where,
the index i = sketch(V (T ))) to the atomic AMS-sketch estimate X S1 for S1 ,
we update X S1 by adding         j ∈n(i,τ ) ξ j , where n(i, τ ) denotes the L1 neighbor-
hood of radius τ of i = sketch(V (T )) in the embedding vector space (i.e.,
n(i, τ ) = { j : i − j 1 ≤ τ }). Note that this special processing is only carried
out on the S1 stream; the AMS-sketch X S2 for the second XML stream S2 is up-
dated in the standard manner. It is then fairly simple to show (see Section 2.2)
that the product X S1 · X S2 gives an unbiased, atomic sketching estimate for the
cardinality of the L1 τ -similarity join of S1 and S2 in the embedding O(k(δ, ))-
dimensional vector space.
   In terms of processing time per document, note that, in addition to time cost
of our embedding process, the first level of (1-stable) sketching can be done in
small time using the techniques discussed by Indyk [2000]. The second level of
(AMS) sketching can also be implemented using standard AMS-sketching tech-
niques, with the difference that (for one of the two streams) updating would
require summation of ξ variates over an L1 neighborhood of radius τ in an
O(k(δ, ))-dimensional vector space. Thus, a naive, brute force technique that
simply iterates over all these variates would increase the per-document sketch-
ing cost by a multiplicative factor of O(|n(i, τ )|) = O(τ k(δ, ) ) ≈ O((1/δ)k(δ, ) ) in
the worst case; however, efficiently range-summable sketching variates, as in
Feigenbaum et al. [1999], can be used to reduce this multiplicative factor to
only O(log |n(i, τ )|) = O(k(δ, )).

   Again, note that, by Indyk’s L1 dimensionality-reduction result (Theo-
rem 6.2), Theorem 6.1 only guarantees that our estimation algorithm ap-
proximates tree-edit distances with constant probability. In other words, this
means that a constant fraction of the points in the τ -neighborhood of a given
point could be missed. Furthermore, the very recent results of Charikar and
Sahai [2002] prove that no sketching method (based on randomized linear
projections) can provide a high-probability dimensionality-reduction tool for
L1 ; in other words, there is no analogue of the Johnson-Lindenstrauss (JL)
lemma [Johnson and Lindenstrauss 1984] for the L1 norm. Thus, there seems
to be no obvious way to strengthen Theorem 6.1 with high-probability distance
estimates.
   The following corollary √ shows that high-probability estimates are possible
if we allow for an extra O( b) multiplicative factor in the distance distortion.
The idea here is to use L2 vector norms to approximate L1 norms, exploiting the
fact that each V (T ) vector has at most O(b) nonzero components, and then use
standard, high-probability L2 dimensionality reduction (e.g., through the JL
construction). Of course, a different approach that could give stronger results
would be to try to embed tree-edit distance directly into L2 , but this remains
an open problem.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
312       •     M. Garofalakis and A. Kumar

   COROLLARY 6.3. The tree-edit distances for the estimation of the similarity-
join cardinality |SimJoin(S1 , S2 , τ )|√ Theorem 6.1 can be estimated with high
                                         in
probability to within a factor of O( b log2 b log∗ b).

7. EXPERIMENTAL STUDY
In this section, we present the results of an empirical study that we have
conducted using the oblivious tree-edit distance embedding algorithm devel-
oped in this article. Several earlier studies have verified (both analytically and
experimentally) the effectiveness of the pseudorandom sketching techniques
employed in Sections 5 and 6 in approximating join cardinalities and differ-
ent vector norms; see, for example, Alon et al. [1996, 1999]; Cormode et al.
[2002a, 2002b]; Dobra et al. [2002]; Gilbert et al. [2002a]; Indyk et al. [2000],
Indyk [2000]; Thaper et al. [2002]. Thus, the primary focus of our experimental
study here is to quantify the average-case behavior of our embedding algorithm
(TREEEMBED) in terms of the observed tree-edit distance distortion on realistic
(both synthetic and real-life) XML data trees. As our findings demonstrate, the
average-case behavior of our TREEEMBED algorithm is indeed significantly better
than that predicted by the theoretical (worst-case) distortion bounds shown ear-
lier in this article. Furthermore, our experimental results reveal several other
properties and characteristics of our embedding scheme with interesting impli-
cations for its potential use in practice. Our implementation was carried out in
C++; all experiments reported in this section were performed on a 1-GHz Intel
Pentium-IV machine with 256 MB of main memory running RedHat Linux 9.0.

7.1 Implementation, Testbed, and Methodology
   7.1.1 Implementation Details: Subtree Fingerprinting. A key point in
our implementation was the use of Karp-Rabin (KR) probabilistic finger-
prints [Karp and Rabin 1987] for assigning hash labels h(t) to valid subtrees
t of the input tree T in a one-to-one manner (with high probability). The KR
algorithm was originally designed for strings so, in order to use it for trees,
our implementation makes use of the flattened, parenthesized string repre-
sentation of valid subtrees of T to obtain the corresponding tree fingerprint
(treating parentheses as special delimiter labels in the underlying alphabet).
An important property of the KR string-fingerprinting scheme is its ability to
easily produce the fingerprint h(s1 s2 ) of the concatenation of two strings s1 and
s2 given only their individual fingerprints h(s1 ) and h(s2 ) [Karp and Rabin 1987].
This is especially important in the context of our data-stream processing algo-
rithms since, clearly, we cannot afford to retain entire subtrees of the original
(streaming) XML data tree T in order to compute the corresponding fingerprint
in the current phase of our hierarchical tree parsing—the result would be space
requirements linear in |T | for each parsing phase. Thus, we need to be able to
compute the fingerprints of valid subtrees corresponding to nodes v ∈ T i using
only the fingerprints from the nodes in T i−1 that were contracted by TREEEMBED
to obtain node v. This turns out to be nontrivial since, unlike the string case
where the only possible options are left or right concatenation, TREEEMBED can
merge the underlying subtrees of T in several different ways (Figures 3 and 4).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                  •      313




Fig. 11. Example of subtree fingerprint propagation. (“⊥” denotes an empty fingerprint and “∗”
separates the left and right part of an incomplete fingerprint.)

   The solution we adopted in our TREEEMBED implementation is based on the
idea of maintaining, for each node in the current phase v ∈ T i , a collection
of subtree fingerprints corresponding to the child subtrees of v in the original
data tree T . Briefly, all of v’s fingerprints start out in an empty state, and a fin-
gerprint becomes complete (meaning that it contains the complete fingerprint
for the corresponding child subtree) once the last node along that branch is
merged into node v. Child fingerprints of v can also be in an incomplete state,
meaning that the subtree along the corresponding branch has only been par-
tially merged into v; in order to correctly merge in the remaining subtree, an
incomplete fingerprint consists of both a left and a right part that will eventu-
ally enclose the fingerprint propagated up by the rest of the subtree. The key to
our solution is, of course, that we can always compute the fingerprint of a valid
subtree at phase i by simple concatenations of the fingerprints from nodes in
phase i − 1. (Note that a sequence of complete child fingerprints can always be
concatenated to save space, if necessary.) Figure 11 illustrates the key ideas in
our subtree fingerprinting scheme following a simple example scenario of edge
contractions. To simplify the exposition, the figure uses parenthesized nodela-
bel strings instead of actual numeric KR fingerprints of these strings; again,
the key here is that fingerprints for new nodes (obtained through contractions
in the current phase) are computed by simply concatenating existing KR fin-
gerprints. Fingerprinting and merging for subtrees rooted at unlabeled nodes
(Figure 4) can also be easily handled in our scheme.
   The KR-fingerprinting scheme maps each string in an input collection of
strings to a number in the range [0, p], where p is a prime number large enough
to ensure that distinct input strings are mapped to distinct numbers with suf-
ficiently high probability. Given that the total number of valid subtrees created
during our hierarchical parsing of an input XML tree T is guaranteed to be only
O(|T |), we chose the prime p for our subtree fingerprinting to be p = (|T |2 )—
this clearly suffices to ensure high-probability one-to-one fingerprints in our
scheme.

   7.1.2 Experimental Methodology. One of the main metrics used in our
study to gauge the effectiveness of our tree-edit distance embedding scheme is
the distance-distortion ratio which is defined, for a given pair of XML data trees
S and T , as the quantity DDR(S, T ) = V (S)−V (T ) 1 (where d (S, T ) is the tree-
                                            d (S,T )
edit distance of S and T and V (S), V (T ) are the vector embeddings computed
                                   ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
314       •     M. Garofalakis and A. Kumar

by our TREEEMBED algorithm). Based on our initial experimental results, we
also decided to employ a heuristic, normalized distance-distortion ratio metric
NormDDR(S, T ) in which the L1 vector distance V (S) − V (T ) 1 is normalized by
the maximum of the depths of the parse trees (produced by TREEEMBED) for S
and T ; in other words, letting ρ(S), ρ(T ) denote the number of TREEEMBED pars-
ing phases for S and T (respectively), we define NormDDR(S, T ) = max{ρ(S),ρ(T )} .
                                                                         DDR(S,T )

(We discuss the rationale behind our NormDDR metric later in this section.)
   Unfortunately, the problem of computing the exact tree-edit distance with
subtree-move operations (i.e., d (S, T ) above) turns out to be N P-hard—this is
a direct implication of the recent N P-hardness result of Shapira and Storer
[2002] for the simpler string-edit distance problem in the presence of substring
moves. Furthermore, to the best of our knowledge, no other efficient approxi-
mation algorithms have been proposed for our tree-edit distance computation
problem. Given the intractability of exact d (S, T ) computation and the lack of
other viable alternatives (the sizes of our data sets preclude any brute-force,
exhaustive technique), we decided to base our experimental methodology on the
idea of performing random tree-edit perturbations on input XML trees. Briefly,
given an XML tree T , we apply a script rndEdits() of random tree-edit opera-
tions (inserts, deletes, relabels, and subtree moves) on randomly selected nodes
of T to obtain a perturbed tree rndEdits(T ). Special care is taken in the creation
of the rndEdits() edit-script in order to avoid redundant operations. Specifically,
the key idea is to grow the rndEdits() script incrementally, storing a signature
for each randomly chosen (node, operation) combination inside a set data struc-
ture; then, once a new random (node, operation) pair is selected, we employ our
stored set of signatures together with a simple set of rules to check that the new
edit operation is not redundant before entering it into rndEdits(). Examples of
such redundant-operation checks include the following: (1) do not relabel the
same node more than once, (2) do not move the same subtree more than once, (3)
do not delete a previously inserted node, (4) do not insert a node in exactly the
same location as a previously-deleted node, and so on. Even though our set of
rules is not guaranteed to eliminate all possible redundancies in rndEdits(), we
have found it to be quite effective in practice. Finally, we compute an (approx-
imate) distance-distortion ratio DDR(T, rndEdits(T )), where d (T, rndEdits(T ))
is approximated as d (T, rndEdits(T )) ≈ |rndEdits()|, that is, the number of
tree-edit operations in our random script—since we explicitly try to avoid re-
dundant edits, this is bound to be a reasonably good approximation of the true
tree-edit distance (with moves) between the original and modified tree.

   7.1.3 Data Sets. We used both synthetic and real-life XML data trees
of varying sizes in our empirical study. These trees were obtained from (1)
XMark [Schmidt et al. 2002], a synthetic XML data benchmark intended to
model the activities of an on-line auction site (www.xml-benchmark.org/), and
(2) SwissProt, a real-life XML data set comprising curated protein sequences
and accompanying annotations (us.expasy.org/sprot/). We controlled the size
of the XMark data trees using the “scaling factor” input to the XMark data gener-
ator. SwissProt is a fairly large real-life XML data collection (of total size over
165 MB)—in order to control the size of our input SwissProt trees, we used a
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                    •      315




Fig. 12. TREEEMBED distance distortion ratios for small (a) XMark, and (b) SwissProt data trees.




Fig. 13. TREEEMBED distance distortion ratios for medium (a) XMark, and (b) SwissProt data trees.

simple sampling procedure that randomly selects a subset of top-level <Entry>
nodes from SwissProt’s full tree with a certain sampling probability (where, of
course, larger sampling probabilities imply larger generated subtrees). For both
data sets, we partitioned the set of input data trees into three broad classes: (1)
a small class comprising trees with sizes approximately between 400 and 1200
nodes; (2) a medium class with trees of sizes approximately between 2000 and
20,000 nodes; and (3) a large class with trees of sizes approximately between
100,000 and 600,000 nodes. The number of random tree-edit operations in our
edit scripts (|rndEdits()|) was typically varied between 20–200 for small trees,
20–600 for medium trees, and 200–20, 000 for large trees; in order to smooth
out randomization effects, our results were averaged over five distinct runs of
our algorithms using different random seeds for generating the random tree-
edit script. The numbers presented in the following section are indicative of our
results on all data sets tested.

7.2 Experimental Results
   7.2.1 TREEEMBED Distance Distortions for Varying Data-Set Sizes. The
plots in Figures 12, 13, and 14 depict several observed tree-edit distance-
distortion ratios obtained through our TREEEMBED algorithm for (a) XMark and
                                     ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
316       •     M. Garofalakis and A. Kumar




 Fig. 14. TREEEMBED distance distortion ratios for large (a) XMark, and (b) SwissProt data trees.

(b) SwissProt, in the case of small, medium, and large data trees (respectively).
We plot the distance-distortion ratio as a function of the number of random
tree edits in our edit script; thus, based on our discussion in Section 7.1, the
x axis in our plots essentially corresponds to the true tree-edit distance value
between the original and modified input trees. Our numbers clearly show that
the distortions imposed by our L1 vector embedding scheme on the true tree-
edit distance typically vary between a factor of 4–20 on small inputs, a factor of
5–30 on medium inputs, and a factor of 10–35 on the large XMark and SwissProt
trees. It is important to note that these experimental distortion ratios are obvi-
ously much better (by an order of magnitude or more) than what the pessimistic
worst-case bounds in our analysis would predict for TREEEMBED. More specifi-
cally, based on the size of the trees (n) in our experiments, it is easy to verify
that our worst-case distortion bound of log2 n log∗ n (even ignoring all the con-
stant factors in our analysis and those in Cormode and Muthukrishnan [2002])
gives values in the (approximate) ranges 230–300 (for small trees), 360–600
(for medium trees), and 850–1,100 (for large trees); our experimental distor-
tion numbers are clearly much better.
   An additional interesting finding in all of our experiments (with both XMark
and SwissProt data) is that our tree-edit distance estimates based on the L1 dif-
ference of the embedding vector images consistently overestimate (i.e., expand)
the actual distance; in other words, for all of our experimental runs, DDR(T, S) ≥
1. Furthermore, note that the range of our experimental (over)estimation errors
appears to grow quite slowly over a wide range of values for the tree-size pa-
rameter n (for instance, when moving from trees with n ≈ 4, 000 nodes to trees
with n ≈ 600, 000 nodes). These observations along with a closer examination
of some of our experimental results and the specifics of our TREEEMBED embed-
ding procedure motivate the introduction of our normalized distance-distortion
ratio metric (discussed below).
  7.2.2 A Heuristic for Normalizing the L1 Difference: The NormDDR Metric.
Our experimental distance-distortion ratio numbers clearly demonstrate that
our TREEEMBED algorithm satisfies the theoretical worst-case distortion guar-
antees shown in this article, typically improving on these worst-case bounds by

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                •      317

well over an order of magnitude on synthetic and real-life data. Still, it is not en-
tirely clear how to interpret the importance of these numbers for real-life, prac-
tical XML-processing scenarios. Distance overestimation ratios in the range
of 5–30 are obviously quite high and could potentially lead to poor sketching-
based query estimates (e.g., for a streaming XML similarity join). Based on our
experimental observations and the details of our TREEEMBED algorithm, we now
propose a simple heuristic rule for normalizing the L1 difference of the image
vectors that could potentially be used to provide more useful tree-edit distance
estimates.
    Consider an input tree T to our TREEEMBED procedure, and let ρ(T ) denote
the number of phases in the parsing of T . Now, assume that we effect a sin-
gle edit operation (e.g., a node relabel) at the bottom level (i.e., tree T 0 = T )
of our parsing to convert T to a new tree S. It is not difficult to see that this
one edit operation is going to “hit” (i.e., affect the corresponding valid subtree
of) at least one node at each of the ρ(T ) parsing phases of T , thus resulting
in an L1 difference V (T ) − V (S) 1 in the order of ρ(T ). In other words, even
though d (T, S) = 1, just by going through the different parsing phases, the
effect of that single edit operation on T is amplified by a factor of O(ρ(T ))
in the resulting L1 distance. Generalizing from this simple scenario, consider
a situation where T is modified by a relatively small number of edit opera-
tions (with respect to the size of T ) applied to nodes randomly spread through-
out T . The key observation here is that, since we have a small number of
changes at locations spread throughout T , the effects of these changes on the
different parsing phases of T will remain pretty much independent until near
the end of the parsing; in other words, the nodes “hit”/affected by different
edit operations will not be merged until the very late stages of our hierarchi-
cal parsing. Thus, under this scenario, we would once again expect the orig-
inal edit distance to be amplified by a factor of O(ρ(T )) in the resulting L1
distance.
    A closer examination of some of our experimental results validated the above
intuition. Remember that our rndEdits() script does in fact choose the target
nodes for tree-edit operations randomly throughout the input tree T ; further-
more, as expected, the impact of the parse-tree depth ρ(T ) on the approximate
tree-edit distance estimates is more evident when the number of edit opera-
tions in rndEdits() is relatively small compared to the size of T . This obvi-
ously explains the clear downward trend for the distance-distortion ratios in
Figures 12–14.
    Based on the above discussion, we propose normalizing the L1 distance of
the image vectors in our embedding by the maximum parse-tree depth; that
                                              V (S)−V (T ) 1
is, we estimate d (S, T ) using the ratio max{ρ(S),ρ(T )} . Figure 15 depicts our ex-
perimental numbers for the corresponding normalized distance-distortion ra-
tio NormDDR(S, T ) = max{ρ(S),ρ(T )} for several XMark and SwissProt data trees of
                         DDR(S,T )

varying sizes. Clearly, the normalized L1 distance gives us much better tree-
edit distance estimates in our experimental setting, typically ranging between
a factor of 0.5 and 2.0 of the true tree-edit distance. Such distortions could be
acceptable for several real-life application scenarios, especially when dealing

                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
318       •     M. Garofalakis and A. Kumar




Fig. 15. TREEEMBED normalized distance distortion ratios for (a) XMark, and (b) SwissProt data
trees.


with data collections with well-defined, well-separated structural clusters of
XML documents (as we typically expect to be the case in practice).
   Of course, we should stress that normalizing the L1 distance estimate by the
parse-tree depth is only a heuristic solution that is not directly supported by the
theoretical analysis of TREEEMBED (Section 4). This heuristic may work well for
the case of a small number of randomly-spread edit operations; however, when
such operations are “clustered” in T or their number is fairly large with respect
to |T |, dividing by ρ(T ) may result in significantly underestimating the actual
tree-edit distance (see the clear trend in Figure 15). Still, our normalization
heuristic may prove useful in certain scenarios, for example, when dealing with
streams of large XML documents that, based on some prior knowledge, cannot
be radically different from each other (i.e., they are all within an edit-distance
radius which is much smaller than the document sizes).

   7.2.3 Effect of Tree Depth. SwissProt is a fairly shallow XML data set (of
maximum depth ≤ 5); thus, to study the potential effect of tree depth on the
estimation accuracy of our embedding we concentrate solely on trees produced
from the XMark data generator. More specifically, our methodology is as follows.
We generate large (400,000-node) XMark data trees and, for a given value of
the tree-depth parameter, we prune all nodes below that depth. Then, we make
sure that the resulting pruned trees T at different depths all have the same
approximate target size t using the following iterative rule: while |T | is larger
(smaller) than t pick a random node x in T and delete (respectively, replicate)
its subtree at x’s parent (making, of course, sure that the tree resulting from
this operation is not too far from our target size and that the depth of the tree
does not change). Finally, we run our rndEdits() scripts on these pruned trees
with varying numbers of specified edit operations, and measure the observed
normalized and unnormalized distance-distortion ratios for each depth value.
   The plots in Figure 16 depict the observed unnormalized and normalized
distance-distortion ratios as a function of tree depth for a pruned-tree target size
of 100,000 nodes and for different numbers of tree-edit operations. Our experi-
mental numbers clearly indicate that the estimation accuracy of our embedding
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                   •      319




Fig. 16. TREEEMBED unnormalized (a) and normalized (b) distance distortion ratios as a function
of tree depth for 100,000-node (pruned) XMark data trees.


scheme does not have any direct dependence on the depth of the input tree(s)—
the key experimental parameters affecting the quality of our estimates appear
to be the size of the tree and the number of tree-edit operations.

   7.2.4 Using TREEEMBED for Approximate Document Ranking. We now ex-
perimentally explore a different potential use for our XML-tree signatures in
the context of approximate XML-document ranking based on the tree-edit dis-
tance similarity metric. In this setup, we are given a target XML document
T and a number of incoming XML documents that are within different tree-
edit distance ranges from T . The goal is to quickly rank incoming documents
based on their tree-edit distance from T , such that if d (S1 , T ) < d (S2 , T ) then
S1 is ranked “higher” than S2 . Since computing the exact tree-edit distances
can be very expensive computationally, we would like to have efficient, easy-to-
compute tree-edit distance estimates that can be used to approximately rank
incoming documents. Our idea is to use TREEEMBED to produce L1 vector signa-
tures for both the target document T and each incoming document Si , and use
the L1 distances V (T ) − V (Si ) 1 for the approximate ranking of Si ’s. The key
observation here, of course, is that, for effective document ranking, it is cru-
cial for our estimation techniques to preserve the relative ranking of individual
tree-edit distances (rather than to accurately estimate each distance). Our ex-
perimental results demonstrate that our embedding schemes could provide a
useful tool in this context.
   For our document-ranking experiments, we vary the size of the target docu-
ment T between 10,000 and 200,000 nodes. For a given target T and tree-edit
distance d , we generate 40 different trees Si at distance d from T (using dif-
ferent runs of our rndEdits() script). We vary the tree-edit distance d in three
distinct ranges (10–50, 100–500, and 1000–3000) and, for a given value of d ,
we measure the observed range of (a) L1 distances V (T ) − V (Si ) 1 , and (b)
                             V (T )−V (Si ) 1
normalized L1 distances max{ρ(T ),ρ(Si )} , over the corresponding set of 40Si trees.
Our experimental results for 50,000-node XMark and SwissProt data trees are
shown in Table I. Note that in almost all cases, the approximate tree-edit dis-
tance ranges provided by our two L1 -distance metrics for the Si sets (1) are
                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
320       •         M. Garofalakis and A. Kumar

               Table I. Approximate Document-Ranking Results: 50K-Node XMark and
                                      SwissProt Data Trees
                                  XMark Data                       SwissProt Data
                                         Normalized                         Normalized
          d (T, Si )     L1 Distance     L1 Distance         L1 Distance    L1 Distance
              10          386–822           22.7–48.3          325–562       20.3–35.1
              20          766–1083           45–63.7           688–883         43–55.2
              30          994–1417          58.4–83.3          901–1212      56.3–75.7
              40         1318–1580          77.5–92.9         1258–1542      78.6–96.4
              50         1499–1915          88.1–112.6        1400–1667      87.5–104.2
              100        2581–3194         151.8–187.8        2461–2792     153.8–174.5
              200        4519–4992         265.8–293.6        4278–4831     267.4–301.9
              300        6181–6571         363.5–386.5        6040–6294     377.5–393.4
              400        7700–8411         452.9–494.7        7437–8126     464.8–507.9
              500        8940–9653         525.8–567.8        8696–9246     543.5–577.9
          1000          15278–16083       898.7–946          14615–15443    913.4–965.2
          1500          20933–21393      1231.3–1258.4       19357–20171   1209.8–1260.7
          2000          25114–25974     1477.29–1527.9       27599–27916   1724.9–1744.7
          2500          29537–30251      1737.4–1779.4       28562–29331   1785.1–1822.2
          3000          33228–34199      1954.6–2011.7       30545–31452     1909–1965.7



completely disjoint and (2) preserve the ranking of the corresponding true edit
distances d (T, Si )—this, of course, implies that our L1 estimates correctly rank
all the Si input trees in most of our test cases. The only situation where our
observed L1 -estimate ranges show some (typically small) overlap is for very
small differences in tree-edit distance, that is, when |d (T, Si ) − d (T, S j )| = 10
in Table I. Thus, for such small edit-distance separations (remember that we
are dealing with 50,000-node trees), it is possible for our L1 estimates to mis-
classify certain input documents; still, a closer examination of our results shows
that, even in these cases, the percentage of misclassifications is always below
17.5% (i.e., at most 7 out of 40 documents).
   It is worth noting that our approximate document-ranking setup is, in fact,
closely related to a simple version of the approximate similarity-join scenar-
ios discussed in Section 6. In a sense, our goal here is to correctly identify the
“closest” approximate duplicates of a target document T in a collection of input
documents Si that are within different tree-edit distances of T —these closest
duplicates essentially represent the subset of Si documents that would join
with T (for an appropriate setting of the similarity threshold to account for the
distance distortion; see Theorem 6.1). Thus, assuming that the L1 /AMS sketch-
ing techniques developed in Section 6 correctly preserve the L1 -distance ranges
of the underlying image vectors, our ranking results provide an indication of
the percentages of false positives/negatives in the approximate similarity-join
operation (based on the overlap between different distance ranges), and the
required “distance separation” between document clusters in the joined XML
streams to suppress such estimation errors.

   7.2.5 Running Time and Space Requirements. Table II depicts the ob-
served running-times and memory footprints for our TREEEMBED embedding
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                •      321

       Table II. TREEEMBED Running Times and Memory Footprints: XMark Data Trees
                                             TREEEMBED             TREEEMBED
         Tree Size      Document Size       Running Time         Memory Footprint
          20K nodes         1.1 MB               2.5 s                 3.0 MB
          50K nodes         2.9 MB               6.3 s                 7.9 MB
         100K nodes         5.7 MB              12.7 s                16.7 MB
         150K nodes         8.9 MB              19.7 s                25.0 MB
         200K nodes        11.7 MB              26.5 s                33.3 MB
         250K nodes        14.5 MB              34.4 s                41.9 MB
         300K nodes        17.4 MB              41.2 s                49.1 MB
         350K nodes        20.5 MB              49.9 s                57.9 MB
         400K nodes        23.5 MB              58.2 s                66.4 MB


algorithm over XMark data trees of various sizes (the results for SwissProt are
very similar and are omitted). We should, of course, note here that our current
TREEEMBED implementation does not employ the small-space optimizations dis-
cussed in Section 5 (that is, we always build the full XML tree in memory);
still, as our numbers show, the memory requirements of our scheme grow only
linearly (with a small constant factor ≤3) in the size of the input document.
Furthermore, our embedding algorithm gives very fast running times; for in-
stance, our TREEEMBED code takes less than 1 min to build the L1 vector image
of a 400,000-node XML tree. Thus, once again, compared to computationally
expensive, exact tree-edit distance calculations, our techniques can provide a
very efficient, approximate alternative.

8. CONCLUSIONS
In this article, we have presented the first algorithmic results on the problem of
effectively correlating (in small space) massive XML data streams based on ap-
proximate tree-edit distance computations. Our solution relies on a novel algo-
rithm for obliviously embedding XML trees as points in an L1 vector space while
guaranteeing a logarithmic worst-case upper bound on the distance distortion.
We have combined our embedding algorithm with pseudorandom sketching
techniques to obtain novel, small-space algorithms for building concise sketch
synopses and approximating similarity joins over streaming XML data. An em-
pirical study with synthetic and real-life data sets has validated our approach,
demonstrating that the behavior of our embedding scheme over realistic XML
trees is much better than what would be predicted based on our worst-case dis-
tortion bounds, and revealing several interesting properties of our algorithms
in practice. Our embedding result also has other important algorithmic ap-
plications, for example, as a tool for very fast, approximate tree-edit distance
computations.

APPENDIX: ANCILLARY LEMMAS FOR THE UPPER-BOUND PROOF
In this section, we complete the upper bound proof for the distortion of our
embedding algorithm. Recall the terminology of Section 4.4. We have defined
sets M i and P i as subsets of T i and (T1 + T2 )i , respectively. We have assumed
by induction that for every x ∈ T i − M i there exists a unique f (x) ∈ (T1 + T 2 )i
                                ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
322       •     M. Garofalakis and A. Kumar

such that the trees x(T ) and f (x)(T1 + T2 ) look identical. In other words, if
we forget about the regions M i and P i , the remaining forests are basically
the same. Now, we want to prove this fact for the next stage of parsing. In
order to do this, we defined another region N i which encloses M i and in some
way represents the region which can get influenced by M i in the next stage of
parsing.
   So we pick a node w ∈ T i − N i and w is the corresponding node in (T1 + T2 )i .
So we know that w(T ) and w (T1 + T2 ) are the same trees. Now, w gets absorbed
in a node q in T i+1 and w in q in (T1 + T2 )i+1 . Our goal now is to show that q(T )
and q (T1 + T2 ) are identical trees. We will do this by following the a natural
procedure—we will show that w and w get parsed in exactly the same manner,
that is, if w gets merged with a leaf, then the same thing happens to w . But
note that it is not enough to just show this fact. For example, if w and w get
merged with leaves l and l , we have to show that l (T ) and l (T1 + T2 ) are also
identical subtrees.
   Thus, we first need to go through a set of technical lemmas, which show that
the mapping f preserves the neighborhood of w as well. For example, we need
to show facts like parent of w and parent of w get associated by f as well. So,
we need to explore the properties of f and first show that it preserves sibling
relations and parent-child relations. Once we are armed with these lemmas,
we just need to prove the following facts:

— If w is a leaf and if merged with its parent, then the same happens to w .
— If w is a leaf and is merged with some of its leaf siblings, then w is also a
  leaf and gets merged with the corresponding siblings.
— If w has a leaf child which gets merged into it, then w also has a correspond-
  ing leaf child which gets merged into it.
— If w is a degree-2 node in a chain, and gets merged with some other such
  nodes, then the same fact applies to w .

  Clearly, the above facts will be enough to prove the result we want. As we
mentioned before, we need to analyze some properties of the parsing algorithm
and the function f , so that we can set up a correspondence between neighbor-
hoods of w and w . We proceed to do this first.
  We first show the connection between a node w and the associated tree w(T ).
The following fact is easy to see.

  CLAIM A.1. Let x and y be two distinct nodes in T i . x is a parent of y iff
x(T ) contains a node which is the parent of a node in y(T ).

   PROOF. Proof is by induction on i. Let us say the fact is true for i −1. Suppose
x is a parent of y in T i . Let X be the set of nodes in T i−1 which got merged
to form x. Define Y similarly. Then there must be a node in X which is the
parent of a node in Y . The rest follows by induction on these two nodes. The
reverse direction is similar.

   LEMMA A.2. Suppose x ∈ T i is a leaf. Then x(T ) has the following property—
if y ∈ x(T ), then all descendants of y are in x(T ).
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                  •      323

  PROOF. Suppose not. Then there exist nodes a, b ∈ T such that a is the
                                  /
parent of b, and yet a ∈ x(T ), b ∈ x(T ). Suppose b is in a node y in T i . But then
the claim above implies that x is not a leaf, a contradiction.
   Now, we go on to consider the case when x has at least two children. Consider
the nodes in T i−1 which formed x. Only two cases can happen—either x was
present in T i−1 or T i−1 contains a node u with at least two children and x is
obtained by collapsing a leaf child into u. In either case, x corresponds to a
unique node with at least two children in T i−1 . Carrying this argument back
all the way to T 0 , we get the following fact.
   CLAIM A.3. Let x be a node with at least two children in T i . Then x(T ) is
a subtree of T which looks as follows: there is a unique node with at least two
children, call it x0 , such that all nodes in x(T ) are descendants of x0 . Further, if
y ∈ x(T ), y = x0 , then all descendants of y (in T ) are also in x(T ).
   The proof of the fact above is again by induction and using the previous two
claims.
   CLAIM A.4. Let x and y be two nodes in T i . Suppose x is a sibling of y.
Then, x(T ) contains a node which is a sibling of a node in y(T ). Conversely, if
x(T ) contains a node which is a sibling of a node in y(T ), then x and y are
either siblings or one of them is the parent of the other.
   PROOF. Suppose x is a sibling of y. Let w be their common parent. w has
at least two children. By Claim A.3, w(T ) contains a node w0 such that if z ∈
w(T ), z = w0 , then w(T ) contains all descendants of z.
   Claim A.1 implies that there is a node a ∈ x(T ) and a node b ∈ w(T ) such
that a is a child of b. We claim that b = w0 . Indeed, otherwise all descendants
of b, in particular, a, should have been in w(T ). Similarly, there is a node c in
 y(T ) whose parent is w0 . But then a and c are siblings in T .
   Conversely, suppose there is a node in x(T ) which is a sibling of a node in
 y(T ). Let the common parent of these two nodes in T be w. Let w be the node
in T i containing w. If w is x, then x is the parent of y. So, assume w is not x
or y. It follows from Claim A.1 that w is the parent of x and y. So, x and y are
siblings.
   Recall that we associate a set P i with M i . We already know that M i is a
connected subtree. Of course, we can not say the same for P i because T1 + T2
itself is not connected. But we can prove the following fact.
                                      i     i
  LEMMA A.5.       P i restricted to T1 or T2 is a connected set.
    PROOF. Suppose P i is not connected. Then there exist two nodes x, y in the
same component of (T1 + T2 )i , such that x, y ∈ P i but at least one internal
node in the path between x and y is not in P i . We can in fact assume that all
internal nodes in this path are not in P i (otherwise, we can replace x and y
by two nodes on this path which are in P i but none of the nodes between them
are not in P i ). Let this path be x, a1 , . . . , an , y. Let bi ∈ T i − M i be such that
 f (bi ) = ai .
                                  ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
324       •     M. Garofalakis and A. Kumar

    First observe that bi is adjacent with bi−1 . Indeed, suppose ai is the parent of
ai−1 (the other case will be ai is a child of ai−1 , which is similar). By Claim A.1,
there is a node in ai (T1 + T2 ) which is the parent of a node in ai−1 (T1 + T2 ). Since
ai (T1 + T2 ) = bi (T ) and ai−1 (T1 + T2 ) = bi−1 (T ), we see that there is a node in
bi (T ) which is the parent of a node in bi−1 (T ). Applying Claim A.1, we see that
bi is adjacent with bi−1 . Thus, b1 , . . . , bn is a path.
    Since x is adjacent with a1 , there is a node x0 in x(T1 + T2 ) which is adjacent
with a node c1 in a1 (T1 + T2 ) (again, using Claim A.1). If x0 is not the root of
T1 , then x0 is also a node in T . So, there is a node x ∈ T i , such that x0 ∈ x (T ).
Clearly x ∈ M i ; otherwise, f (x ) must be x (since f (x )(T1 + T2 ) and x(T1 + T2 )
will share the node x0 and so must be the same). Now, x must be adjacent with
b1 (because x0 is adjacent with c1 in T —note that c1 is a node in T as well).
The other case arises when x0 is the root of T1 . So, x0 is the parent of c1 . But
then v is the parent of c1 in T . Let x be the node containing v in the tree T i .
So x ∈ M i and is adjacent with b1 .
    Thus, we get a node x ∈ M i , such that x is adjacent with b1 . In fact, if x
is the parent (child) of a1 , then the same applies to x and b1 (and vice versa).
Similarly, there is a node y ∈ M i adjacent with bn . So if x and y are different
nodes in M i , then this contradicts the fact that M i is connected. So x = y .
To avoid any cycles in T i , it must be the case that b1 = · · · = bn . First observe
that, in this case, x and y are children of a1 . The only other possibility is that
x is the parent of a1 and y a child of a1 —but then x is the parent of b1 and y
a child of b1 and so x , y cannot be the same nodes.
    Thus, we have that x, y are children of a1 and x is a child of b1 . By Claim A.3,
a1 (T1 + T2 ) has a node a which is the parent of a node x0 ∈ x(T1 + T2 ) and a node
 y 0 ∈ y(T1 + T2 ). By definition of x (and y ), x0 , y 0 ∈ x (T ). Further, a ∈ b1 (T ).
Consider the largest integer i such that the nodes containing x0 and y 0 in the
tree T i were different—call these nodes x and y . Let b be the node containing
a . So x and y are children of b . Also i < i.
    When we parse T i , we merge x and y into a single node, x ∈ T i +1 .
However, in (T1 + T2 )i +1 , the nodes containing x0 and y 0 are different. So,
x ∈ M i +1 . So one of the nodes of T i which merged into x must have been in
N i . Since x and y are siblings and we merge them, it must be the case that
they are leaves. Thus, the only nodes in T i which get merged to form x are leaf
children of b . So one of these leaf children is in N i . Since N i is a connected
set and has size at least 2, b ∈ N i . But then b1 ∈ M i , a contradiction.

   We now show the fact that f preserves parent-child and sibling relations.

   LEMMA A.6. Suppose x and y are two nodes in T i − M i . If x is the parent of
y, then f (x) is the parent of f ( y). If x is a sibling of y, and f (x), f ( y) are in
the same component of (T1 + T2 )i , then f (x) is a sibling of f ( y). The converse of
these facts is also true.

   PROOF. Suppose x is the parent of y. By Claim A.1, there is a node x ∈ x(T )
and a node y ∈ y(T ) such that x is the parent of y in T . x and y are nodes in
T1 ∪ T2 as well. Unless x = v, x is the parent of y in T1 ∪ T2 as well. If x = v,
x will be in M i , which is not the case.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      325

   Thus, x is the parent of y in (T1 ∪ T2 ). Since f (x)(T1 + T2 ) = x(T ), f ( y)(T1 +
T2 ) = y(T ), another application of Claim A.1 to T1 ∪ T2 implies that f (x) is a
parent of f ( y).
   The other fact can be proved similarly using Claim A.4. The converse can be
shown similarly.

   LEMMA A.7. Suppose x is a leaf node in T i − M i . Then f (x) is a leaf in
(T1 + T2 )i − P i . The converse is also true.

   PROOF. Suppose f (x) has a child z in (T1 + T2 )i . So there are nodes a ∈
w (T1 + T2 ) and b ∈ z (T1 + T2 ) such that a is the parent of b in T1 ∪ T2 . Let z
be the node in T i which contains b. Since f (x)(T1 + T2 ) = x(T ), x should be the
parent of z, which is a contradiction. So f (x) is a leaf as well. The converse can
be shown similarly.

  LEMMA A.8. Suppose a node u in M i has at least two children. Let x be a
child of u. If there is a node in x(T ) which is an immediate sibling of a node in
u(T ), then x is a corner node.

    PROOF. By Claim A.3, u(T ) has a node u0 such that any other node in u(T )
has all its descendants in u(T ). Now, x(T ) has a node x0 and u(T ) has a node
u1 such that x0 and u1 have common parent. So this common parent must be
u0 . Consider the highest i for which the node in T i containing u0 was distinct
from the node in T i containing u1 —call these y and z, respectively. Clearly,
i < i. While parsing T i , we moved z up to its parent y. But then we must
have marked all immediate siblings of z as corner nodes. In particular, the
node containing x0 must have been a corner node. This implies that x must be
a corner node.

  We now state a useful property of the set M i .

   LEMMA A.9. Suppose x is a node in M i which has at least two children and
at least one child of x is not in M i . Then x is either ci or vi . Similarly, if x is a
node in N i which has at least two children such that at least one of them is not
in N i , then x is either vi or the center node in N i .

    PROOF. The proof follows easily by induction on i. When i = 1, M i is simply
vi , so there is nothing to prove. So assume the induction hypothesis is true for
some value of i. The only case when N i − M i will have a with more than two
children is when the center z of N i is different from ci . If ci = vi , we have
nothing to prove because the new center of M i+1 will be the node containing z.
So assume that ci is not same as vi . But then, all children of ci must be in M i
(only then we shall move the center for N i to a new node). So the node containing
ci in M i+1 will have all its children in M i+1 . This proves the lemma.

   Recall that w is defined to be a node in T i − N i and w = f (w) ∈ (T1 + T2 )i .
We want to show that w and w are parsed in the same manner, that is, q(T ) =
q (T1 + T2 ) (using the notation in Section 4.4). We now show that the two nodes
will be parsed identically.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
326       •     M. Garofalakis and A. Kumar




                            Fig. 17. Proof of lemma A.11 when w ∈ P i .


  LEMMA A.10. Suppose w is a leaf (so w is a leaf as well). If w is a lone leaf
child of its parent, then so is w . The converse is also true.

    PROOF. Suppose w is a lone leaf child of its parent, call it u. Let u be the
parent of w in (T1 + T2 )i . Suppose, for the sake of contradiction, that w is not a
lone child of u . So it has an immediate, say left, sibling w , which is also a leaf.
    We first argue that w ∈ P i . Suppose not. Now, w corresponds to a node x
in T i − M i , that is, w = f (x). So x is also a leaf and a left sibling of w. But w
is a lone child of u. So it must happen that there is a nonleaf child of u, call it
 y, between x and w (because x is a lone child). Observe that y ∈ M i ; otherwise
 f ( y) will lie between w and w . Since this holds for all nodes y between w
and x, we should have added w to N i (according to Rule (i)), a contradiction.
                                                                                      /
    So it follows that w ∈ P i . Let x be the immediate left sibling of w (if any), x ∈
    i
M ; otherwise we would have added w to N i . So f (x) is a left sibling of w (if they
are in the same component). So w lies between f (x) and w . Since there is no
node between x and w in T i , all nodes in w (T1 +T2 ) (we can think of this as nodes
of T ) must be part of u(T ). But then, by Lemma A.8, w should have been a corner
node and should have been added to N i . The converse can be shown similarly.

   LEMMA A.11. Suppose w is the leftmost lone leaf child of its parent u. Then
  /
u ∈ M i . Further, w is the leftmost lone leaf child of its parent u .

    PROOF. Suppose u ∈ M i . Then we would have added w to N i . Define u =
f (u). Then u is the parent of w (using Claim A.1). We already know that w is
a lone leaf child of u (using the lemma above). Suppose it is not the leftmost
such child. Let w be a lone leaf child of u which is to the left of w . First we
                /
argue that w ∈ P i .
    Suppose w ∈ P i (see Figure 17). Consider the nodes in w (T1 + T2 ). One
of these nodes must be a child of a node in u (T1 + T2 ). Let this node be z 0 .
z 0 is also a node in T . Further, z 0 is a child of a node in u(T ) because u(T ) =
u (T1 + T2 ). Let z be the node in T i containing z 0 . We claim that z ∈ M i .
Otherwise, f (z)(T1 + T2 ) = z(T ). So z 0 ∈ f (z)(T1 + T2 ). But z 0 ∈ w (T1 + T2 ).
                                                           /
Then, it must be the case that w = f (z). But then w ∈ P i , a contradiction. So
z ∈ M i . Note that z is a child of u.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
            XML Stream Processing Using Tree-Edit Distance Embeddings                 •      327

                                                      /
   Let y be a descendant of z in T i . Suppose y ∈ M i . Then f ( y) is a node in
(T1 + T2 )i such that y(T ) = f ( y)(T1 + T2 ). Since y is a descendant of u, f ( y) is
also a descendant of u . If all nodes between f ( y) and u were not in P i , then
we will get a path between y and u in T i such that no internal node is in M i .
But this is not true (since z ∈ M i ). Thus, there is an internal node in this path
which is in P i —call this node z .
   So we have the situation that u has a leaf child w and another descendant
                               /
z which are in P i . But u ∈ P i . This violates the fact that P i is connected
(Lemma A.5). So all descendants of z are in M i . But then we would have added
u to N i and, consequently, w to N i .
               /
   Thus, w ∈ P i . So there is a node a ∈ T i − M i such that f (a) = w . So a
is a leaf as well. Suppose a ∈ N i . Since N i has at least two nodes and N i is a
connected set, it must be the case that u ∈ N i . Note that we never add a node
with at least two children to N i using Rules (i)—(iv). So it must be the case that
u is the center of N i . But then we would have added w to N i , a contradiction.
                             /
   Thus, it follows that a ∈ N i . But then, by Lemma A.10, a is a lone leaf child
as well. This contradicts the fact that w has this property.
  The lemma above shows that if w is merged with its parent, then so is w
and q(T ) = q (T ).
   LEMMA A.12. Suppose w is a leaf node. Let its immediate siblings on the
right (left) be w0 = w, w1 , . . . , wk , where all nodes except perhaps wk are leaves.
Further, suppose k < . Then w0 , . . . , wk are not in M i . Moreover, the immediate
right (left) siblings of f (w) in (T1 + T2 )i are f (w0 ), f (w1 ), . . . , f (wk ).
   PROOF. Let u be the parent of w. If w j ∈ M i , then w will be added to N i .
So none of the nodes in w0 , . . . , wk are in M i . All we have to show now is
that f (w j ) is an immediate left sibling of f (w j +1 ). Suppose not. Let x be a
sibling between f (w j ) and f (w j +1 ). All nodes in x (T1 + T2 ) must be in u(T ).
Lemma A.8 now implies that w j must be a corner node. But then w should be
in N i , a contradiction again.
  The lemma above shows that, if w was merged with a set of siblings, then w
will be merged with the same set of siblings.
  LEMMA A.13. Suppose w is a node with at least two children. Then f (w) also
                                                                            /
has at least 2 children. Let y be the leftmost lone leaf child of w. Then y ∈ M i
and f ( y) is the leftmost lone child of f (w).
    PROOF. Suppose w has a child u such that all descendants of u are in M i .
Then w will be added to N i .
    So not all descendants of of u1 or u2 are in M i . Since M i is a connected set,
                                                             /
it can contain at most one of u1 and u2 . So suppose u1 ∈ M i . Then f (u1 ) is a
child of f (w). Further, let x be the descendant of u2 which is not in M i . Then
 f (x) is a descendant of f (w), but not a descendant of f (u1 ).
                                                           /
    So, f (w) has at least two children. We claim that y ∈ N i . Indeed, if y ∈ N i ,
and the fact that N i is a connected set, implies that N i = { y}. So M i = { y}.
                                              /
But then w ∈ N i , a contradiction. So y ∈ N i . But then Lemma A.11 implies
that f ( y) is a leftmost lone leaf child of w as well.
                                 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
328       •     M. Garofalakis and A. Kumar

  The lemma above implies that if w is merged with a leaf child, then w is also
merged with the same leaf child.
    LEMMA A.14. Suppose w is a degree-2 node. Let (w0 = w, w1 , . . . , wk ) be
an ancestor to descendant path of length at most               − 1, such that all nodes
except perhaps wk are of degree-2. Then w0 , . . . , wk−1 are in T i − M i . Further,
 f (w0 ), . . . , f (wk−1 ) forms a path of degree-2 nodes in (T1 + T2 )i .
                                             /
    If wk is a degree-2 node, then wk ∈ M i and f (wk ) is adjacent with f (wk−1 ).
If wk has degree at least 3, then the neighbor of f (wk−1 ) other than f (wk−2 ) is
of degree at least 3 as well.
   PROOF. First consider the case when w0 , . . . , wk are nodes of degree 2. None
of them can be in M i ; otherwise w ∈ N i . Thus, f (w0 ), . . . , f (wk ) is also a chain
in (T1 + T2 )i . Now we need to argue that these nodes have degree 2 as well. But
this is true from the fact that wi and f (wi ) represent the same tree in T .
                                                      /
   So suppose wk has at least two children. If wk ∈ M i , then f (wk ) also has at
least two children. Hence assume that wk ∈ M i . If w is an ancestor of wk , then
w will be added to N i . So assume w is a descendant of wk . So w is a descendant
of wk . Further, if all descendants of wk which are not descendants of wk−1 are
                                                                    /
in M i , then w ∈ N i . So wk has a descendant x such that x ∈ M i and x is not a
descendant of wk−1 .
   Now consider the tree wk (T )—since wk has at least two children, wk (T ) has
a unique highest node, z, such that all nodes in wk (T ) are descendants of z.
There is a node in wk−1 (T ) which is a child of z. Let wk be the node in (T1 + T2 )i
containing z. Then wk is the parent of f (wk−1 ). Further, f (x) is a descendant
of wk , but not a descendant of f (wk−1 ). So wk has degree at least 3.
  The lemma above shows that if w is a degree-2 node which is merged with
some other degree-2 nodes, then w will be merged with the same nodes. One
final case remains.
  LEMMA A.15. Suppose w is not merged with any node in T i . Then w is also
not merged with any node in (T1 + T2 )i .
   PROOF. First consider the case when w is a leaf. Let the parent of w be u.
We know that w is also a leaf. Let u be its parent. If w has siblings which are
leaves, then Lemma A.10 implies that w is also not a lone leaf. But then w will be
merged with a sibling, which is a contradiction. So w must be a lone child of u .
   Suppose w is the leftmost lone leaf child of u . Let x be the leftmost lone
leaf child of u. We know that x = w; otherwise w will merge with its parent. If
   /
x ∈ N i , then Lemma A.11 implies that f (x) is the leftmost lone leaf child of
u , which is not true because f (w) = w and f is 1-1. So x ∈ N i .
   Now we claim that u ∈ N i . Indeed, if not, the fact that N i is a connected set
implies that N i = {x}. But M i is a subset of N i , and so M i must also be {x}.
But then u will be added to N i .
   So we can assume u ∈ N i . Now, Lemma A.9 implies that u is either vi or the
center of N i . Suppose u is the center z of N i . Let y be the leftmost lone leaf
child of u which is not in M i — y cannot be same as w, because y gets added to
N i . But then, f ( y) is a lone leaf child to the left of w —a contradiction.
ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
             XML Stream Processing Using Tree-Edit Distance Embeddings                   •      329

   So now assume u is same as vi . Two cases can happen—either w is a
removed descendant of vi or not—and in either cases we can argue by using
Rule (ii) that it should be in N i . Thus, we have shown that if w is a leaf node,
then w also does not get merged with any other node.
                                                    /
   Now suppose w has a lone leaf child x . If x ∈ P i , then there is a leaf node
   /
x ∈ N i such that f (x) = x . But then, x is also a lone leaf child of w (Lemma
A.10). So w will also merge with one its leaf children, a contradiction.
   Finally, suppose w is a degree-2 node, and either the parent or the child
of w is of degree 2. First observe that w must also be of degree 2—otherwise
it has at least two children. Since f (w) = w , and w has only one child, both
these children must be in M i —but then due to connectedness of M i , w is also
in M i , a contradiction.
   So w also has only one child, call it x. Let the child of w be x . Suppose x
has only one child. x must have at least two children, otherwise w will also be
merged with a node. Now all but at most one child of x will be in M i . If x has
more than two children, then the fact that M i is connected implies that x is in
M i as well. But then w will be added to N i . So x has exactly two children—one
of these is in M i , the other not in M i . Now x ∈ N i ; otherwise x will also have
at least two children. So x is the new center node of N i . But then w will be
added to the set N i , a contradiction.
   Now let the parent of w be y and that of w be y . Suppose y has only one
child. Then, Lemma A.10 implies that y also has only one child. But then, w
will be merged with one of the nodes—a contradiction. Thus, w is not merged
with any node as well.
   Thus, we have demonstrated the invariant for T i+1 and (T1 + T2 )i+1 .

ACKNOWLEDGMENTS

Most of this work was done while the second author was with Bell Labs. The
authors thank the anonymous referees for insightful comments on the article,
and Graham Cormode for helpful discussions related to this work.

REFERENCES
ACHARYA, S., GIBBONS, P. B., POOSALA, V., AND RAMASWAMY, S. 1999. Join synopses for approximate
  query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Man-
  agement of Data (Philadelphia, PA). 275–286.
ALON, N., GIBBONS, P. B., MATIAS, Y., AND SZEGEDY, M. 1999. Tracking join and self-join sizes in
  limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium
  on Principles of Database Systems (Philadeplphia, PA).
ALON, N., MATIAS, Y., AND SZEGEDY, M. 1996. The space complexity of approximating the frequency
  moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing
  (Philadelphia, PA). 20–29.
ALTINEL, M. AND FRANKLIN, M. J. 2000. Efficient filtering of XML documents for selective dissem-
  ination of information. In Proceedings of the 26th International Conference on Very Large Data
  Bases (Cairo, Egypt). 53–64.
APOSTOLICO, A. AND GALIL, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press,
  Oxford, U.K.
ARASU, A., BABCOCK, B., BABU, S., MCALISTER, J., AND WIDOM, J. 2002. Characterizing memory
  requirements for queries over continuous data streams. In Proceedings of the Twenty-first ACM

                                    ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
330       •     M. Garofalakis and A. Kumar

  SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221–
  232.
BABCOCK, B., BABU, S., DATAR, M., MOTWANI, R., AND WIDOM, J. 2002. Models and issues in data
  stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium
  on Principles of Database Systems (Madison, WI). 1–16.
BAR-YOSSEF, Z., JAYRAM, T., KUMAR, R., SIVAKUMAR, D., AND TREVISAN, L. 2002. Counting dis-
  tinct elements in a data stream. In Proceedings of the 6th International Workshop on Ran-
  domization and Approximation Techniques in Computer Science (RANDOM’02), (Cambridge,
  MA).
CHAKRABARTI, K., GAROFALAKIS, M., RASTOGI, R., AND SHIM, K. 2000. Approximate query processing
  using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases
  (Cairo, Egypt). 111–122.
CHAN, C.-Y., FELBER, P., GAROFALAKIS, M., AND RASTOGI, R. 2002. Efficient filtering of XML doc-
  uments with XPath expressions. In Proceedings of the Eighteenth International Conference on
  Data Engineering (San Jose, CA).
CHARIKAR, M., CHEN, K., AND FARACH-COLTON, M. 2002. Finding frequent items in data streams.
  In Proceedings of the International Colloquium on Automata, Languages, and Programming
  (Malaga, Spain).
CHARIKAR, M. AND SAHAI, A. 2002. Dimension reduction in the l 1 norm. In Proceedings of the 43rd
  Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada).
CORMODE, G., DATAR, M., INDYK, P., AND MUTHUKRISHNAN, S. 2002a. Comparing data streams using
  hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very
  Large Data Bases (Hong Kong, China). 335–345.
CORMODE, G., INDYK, P., KOUDAS, N., AND MUTHUKRISHNAN, S. 2002b. Fast mining of massive tabular
  data via approximate distance computations. In Proceedings of the Eighteenth International
  Conference on Data Engineering (San Jose, CA).
CORMODE, G. AND MUTHUKRISHNAN, S. 2002. The string edit distance matching problem with moves.
  In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San
  Francisco, CA).
DASU, T. AND JOHNSON, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in
  Probability and Statistics. John Wiley & Sons, Inc., New York, NY.
DIAO, Y., ALTINEL, M., FRANKLIN, M. J., ZHANG, H., AND FISCHER, P. 2003. Path sharing and predicate
  evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4 (Dec.), 467–516.
DIAO, Y. AND FRANKLIN, M. 2003. Query processing for high-volume XML message broker-
  ing. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin,
  Germany).
DOBRA, A., GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Processing complex aggregate
  queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference
  on Management of Data (Madison, WI). 61–72.
DOBRA, A., GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2004. Sketch-based multi-query process-
  ing over data streams. In Proceedings of the 9th International Conference on Extending Database
  Technology (EDBT’2004, Heraklion-Crete, Greece).
FEIGENBAUM, J., KANNAN, S., STRAUSS, M., AND VISWANATHAN, M. 1999. An approximate L1 -difference
  algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on
  Foundations of Computer Science (New York City, NY).
FLORESCU, D., KOLLER, D., AND LEVY, A. 1997. Using probabilistic information in data integration.
  In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece).
GANGULY, S., GAROFALAKIS, M., AND RASTOGI, R. 2004. Processing data-stream join aggregates using
  skimmed sketches. In Proceedings of the 9th International Conference on Extending Database
  Technology (EDBT’2004, Heraklion-Crete, Greece).
GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Querying and mining data streams: you only
  get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data
  Bases (Hong Kong, China).
GAROFALAKIS, M. AND GIBBONS, P. B. 2001. Approximate query processing: Taming the terabytes
  (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma,
  Italy).

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
              XML Stream Processing Using Tree-Edit Distance Embeddings                    •      331

GAROFALAKIS, M. AND KUMAR, A. 2003. Correlating XML data streams using tree-edit distance
  embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium
  on Principles of Database Systems (San Diego, CA). 143–154.
GILBERT, A. C., GUHA, S., INDYK, P., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2002a. Fast,
  small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th An-
  nual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada).
GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2001. Surfing wavelets on
  streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th
  International Conference on Very Large Data Bases (Rome, Italy).
GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. J. 2002b. How to summarize the
  universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference
  on Very Large Data Bases (Hong Kong, China). 454–465.
GRAVANO, L., IPEIROTIS, P. G., JAGADISH, H., KOUDAS, N., MUTHUKSRISHNAN, S., AND SRIVASTAVA, D. 2001.
  Approximate string joins in a database (almost) for free. In Proceedings of the 27th International
  Conference on Very Large Data Bases (Rome, Italy).
GREENWALD, M. AND KHANNA, S. 2001. Space-efficient online computation of quantile summaries.
  In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data
  (Santa Barbara, CA).
GUHA, S., JAGADISH, H., KOUDAS, N., SRIVASTAVA, D., AND YU, T. 2002. Approximate XML joins.
  In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data
  (Madison, WI).
GUPTA, A. AND SUCIU, D. 2003. Stream processing of XPath queries with predicates. In Proceedings
  of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA).
INDYK, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream
  computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer
  Science (Redondo Beach, CA). 189–197.
INDYK, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual
  IEEE Symposium on Foundations of Computer Science (Las Vegas, NV).
INDYK, P., KOUDAS, N., AND MUTHUKRISHNAN, S. 2000. Identifying representative trends in massive
  time series data sets using sketches. In Proceedings of the 26th International Conference on Very
  Large Data Bases (Cairo, Egypt). 363–372.
IOANNIDIS, Y. E. AND POOSALA, V. 1999. Histogram-based approximation of set-valued query an-
  swers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh,
  Scotland).
JOHNSON, W. B. AND LINDENSTRAUSS, J. 1984. Extensions of lipschitz mappings into Hilbert space.
  Contemp. Math. 26, 189–206.
KARP, R. M. AND RABIN, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J.
  Res. Devel. 31, 2 (Mar.), 249–260.
KNUTH, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms).
  Addison-Wesley, Reading, MA.
LAKSHMANAN, L. V. S. AND PARTHASARATHY, S. 2002. On efficient matching of streaming XML doc-
  uments and queries. In Proceedings of the 8th International Conference on Extending Database
  Technology (EDBT’2002, Prague, Czech Republic).
MANKU, G. S. AND MOTWANI, R. 2002. Approximate frequency counts over data streams. In Pro-
  ceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).
  346–357.
MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press,
  Cambridge, U.K.
NOLAN, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at
  http://academic2.american.edu/˜ jpnolan/stable/stable.html.
POLYZOTIS, N. AND GAROFALAKIS, M. 2002. Statistical synopses for graph-structured XML
  databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management
  of Data (Madison, WI).
POLYZOTIS, N., GAROFALAKIS, M., AND IOANNIDIS, Y. 2004. Approximate XML query answers. In
  Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris,
  France).

                                      ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.
332       •     M. Garofalakis and A. Kumar

SCHMIDT, A., WAAS, F., KERSTEN, M., CAREY, M. J., MANOLESCU, I., AND BUSSE, R. 2002. XMark: A
  benchmark for XML data management. In Proceedings of the 28th International Conference on
  Very Large Data Bases (Hong Kong, China).
SHAPIRA, D. AND STORER, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th
  Annual Symposium on Combinatorial Pattern Matching (CPM’2002), Fukuoka, Japan). 85–98.
THAPER, N., GUHA, S., INDYK, P., AND KOUDAS, N. 2002. Dynamic multidimensional histograms.
  In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data
  (Madison, WI). 428–439.
UCHAIKIN, V. V. AND ZOLOTAREV, V. M. 1999. Chance and Stability : Stable Distributions and their
  Applications. VSP, Utrecht, The Netherland.
UKKONEN, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret.
  Comput. Sci. 92, 191–211.
VITTER, J. S. AND WANG, M. 1999. Approximate computation of multidimensional aggregates of
  sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference
  on Management of Data (Philadelphia, PA).
ZHANG, K. AND SHASHA, D. 1989. Simple fast algorithms for the editing distance between trees
  and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245–1262.