XML Source wrapper by bns26590


									    Enhancing Semistructured Data Mediators with Document Type
                             De nitions
                                    Yannis Papakonstantinou, Pavel Velikhov
                                      Computer Science and Engineering
                                      University of California, San Diego
                                            La Jolla, CA 92093-0114

                       Abstract                                         Because of the great similarity of XML with
                                                                    semistructured data PGMW95, BDHS96a, QRS+ 95
    Mediation is an important application of XML.                   we started with an architecture that is reminiscent
The MIX mediator uses Document Type De nitions                      of TSIMMIS PAGM96 , a mediator for semistruc-
DTDs to assist the user in query formulation and                  tured data. However, unlike OEM1 which is the
query processors in running queries more e ciently.                 semistructured data model used by TSIMMIS, and
We provide an algorithm for inferring the view DTD                  other semistructured data models XML data are typ-
from the view de nition and the source DTDs. We de-                 ically accompanied by a Document Type De nition
velop a metric of the quality of the inference algorithm's          DTD which describes the content and the structure
view DTD by formalizing the notions of soundness and                of the objects a.k.a. elements in XML terminology
tightness. Intuitively, tightness is similar to precision,          participating in a document. In this paper we focus
i.e., it deteriorates when many" objects described by               on valid XML documents, i.e. documents that always
the view DTD can never appear as content of the view.               have a DTD.
In addition we show that DTDs have some inherent de-                    DTDs are considered to be a kind of schema of a doc-
  ciencies that prevent the development of tight DTDs.              ument. However they are more versatile with respect
We propose DTDs with specialization" as a way to                    to how much structure they impose on the document.
resolve this problem.                                               At the very structured extreme of the structuredness"
                                                                    spectrum they may impose structure comparable to the
1. Introduction                                                     rigid structure of relational data. At the other extreme
   XML becomes the emerging standard for informa-                   they may allow any object type to contain any other
tion exchange. Information mediation is expected to                 object type. And in the middle of the spectrum they
be one of XML's most important applications.                        impose structures that are less restrictive and permit
   The MIX mediator project views XML as a database                 more variation in the data than conventional schemas
model as opposed to a document model and uses                     do.
the mediator concept, as known in the DB area                           We brie y discuss the bene ts realized by the use
 Wie92, LRO96, PAGM96 , to facilitate the implemen-                 of DTDs in an on-demand mediator. The main tech-
tation of the above applications. The MIX mediator                  nical contribution of this paper is the development of
provides to the user or to the application an XML                   an algorithm required in order to compute the view
view of the XML data exported by one or more ap-                    DTDs and hence realize many of the DTD bene-
plications or repositories. The mediator administrator                ts. The algorithm works for a limited class of XMAS
customizes the view to the user needs, i.e., the view               queries views. Finally we introduce a framework for
selects, consolidates, and ranks information according              measuring the quality of view DTDs. We believe that
to the user's preferences. The views are customized us-             this framework will be used in the future by works that
ing the mediator's query and view de nition language,               will use more complex view de nition and query lan-
called XMAS XML Matching And Structuring.                         guages.
                                                                        To illustrate the gains obtained by the DTD use we
    This work was supported by the NSF-IRI 9712239 grant and
equipment donations from Intel Corp.                                  1 OEM stands for Object Exchange Model.

walk thru the operation of the TSIMMIS mediator rst                                     DTD-Based Query Interface
recall, TSIMMIS does not use DTDs and the MIX
mediator, which does use DTDs, next.                                            View DTD

The TSIMMIS mediator and the Disadvantages                                                       The MIX Mediator

of Living Without Some Structure Wrappers                                    DTD Inference
                                                                                                      Query             Query

conceptually export the source data translated into                                                 Simplifier         Processor

the semistructured model OEM. The mediator exports
an integrated view of the wrapper data, based on a
view de nition, provided by the mediator administra-                      DTD                      DTD              DTD

tor. The view de nition is expressed in the Mediator
Speci cation Language MSL. During runtime the me-
                                                                            XML Source/              XML Source/     XML Source/
                                                                             wrapper                  wrapper         wrapper
diator receives queries, which refer to the view objects
and are expressed in MSL. It rst combines the in-                                      Figure 1. The MIX Mediator
coming query and the view into a query which refers
directly to the source data and not to the views any-
more. Then the optimizer nds a plan for executing                       Inference module derives the view DTD. There are
the latter query by sending queries also expressed in                   more than one view DTDs. We explain below which
MSL to the wrappers and combining their results in                      one is the best". The view DTD is passed to the
the mediator. The wrappers translate the queries they                    DTD-based query interface which displays the struc-
receive into queries understood by the sources.2                         ture of the view elements and also provides ll-in win-
   What makes this process challenging and often in-                    dows and menus that allow the user to place conditions
e cient is that MSL speci cations can be very loose"                    on the elements BGL+ .
on the amount of information they provide about the                         Once a query is formulated, with or without using
structures they integrate. The ability to work with                      the DTD-Based Query Interface, it is passed to the
 loose" speci cations is a valuable feature when deal-                   query processor. Then the query simpli er may em-
ing with dynamic semistructured sources. As a con-                       ploy the source DTDs to create a more e cient plan.
trived example, MSL allows the mediator administra-                      Finally, note that mediators can be stacked on top of
tor to create a view that unions the structures exported                 mediators Wie92 . In this case it is important that the
by 100 sites, without having any information about                       lower level mediators can derive and provide their view
the contents and the structure of the data exported by                   DTDs to the higher level ones.
these sites.                                                             Contributions
   There are two weak points in the above scenario.                       1. We develop a view DTD inference algorithm
First, the user does not know the structure of the un-                       see Section 4 for a limited class of XMAS
derlying data and this impedes his e orts to formulate                       queries views. Note that it is easy to compute
reasonable queries. This is a serious problem in en-                         a loose DTD for a view but the query interface
vironments with dynamic and unknown information.                             and the query processor need the ones that de-
The second problem is that the mediator may not have                         scribe the view as precisely as possible.3 These
complete or even any knowledge of the metadata and                         most precise" DTDs are captured by our formal
structure of each source. This results to a heavy loss of                    criterion which is outlined next.
performance. DTDs provide a solution to the above
problems as discussed next.                                               2. We introduce and formalize tightness as the crite-
                                                                             rion for judging the precision of a view DTD see
The MIX mediator and the Advantages of Liv-                                  Section 3.1. In particular, we say that a DTD
ing with DTD-provided Structure The MIX me-                                  d1 is tighter than a DTD d2 if every document
diator employs DTDs to assist the user in information                        described by d1 is also described by d2 . Given a
discovery, query formulation and to allow the query                          view and the source DTDs the view inference al-
processor to derive more e cient plans. In particular,                       gorithm attempts to derive the tightest DTD that
given the source DTDs and the view, the View DTD                             contains all the documents that may appear as
   2 Indeed, decomposing and translating queries is further com-             content of the view. We believe that the tightness
plicated because the sources, and consequently the wrappers,               3 Furthermore, the view DTD can have other applications as
have limited query processing capabilities. However, this issue is       well, besides the ones we develop in the XML mediator. For
orthogonal to the topic of this paper and will not be discussed          example, it may be used by a toolkit for generating XSL style
any further.                                                             sheets for presentation of the view.

     criterion can be a benchmark for other, more pow-                We denote by Lr the regular language described by
     erful, view de nition languages and view inference               r.
     algorithms.                                                      De nition 2.3 We say that an element e satis es a
  3. We provide simple examples where, unfortunately,                 DTD D, denoted as e j= D if the following hold:
     even the tightest DTD describes structures that                    1. namee 2 N , where N is the set of element
     can never appear as the view's content, i.e., even                    names.
     the tightest DTD is not tight enough". The                         2. if contente = e1 : : : em then namee1  : : :
     view DTD inference algorithm derives an extended                      nameem  2 Ltypenamee and ei j= D, 1 
     form of DTDs that typically does not have non-                       i  m:
     tightness problems.                                                3. else if contente is a string then typenamee =
   We start with a mathematical abstraction of the                         PCDATA.
XML model and the XMAS query language. Section 3                      De nition 2.4 Valid XML Document A valid XML
discusses the properties of view DTDs. Section 4 de-                  document consists of a DTD D, a document type dt ,
scribes the view inference algorithms. We conclude                    and a most probably nested element d such that dt =
with related work and future directions.                              named, i.e., dt is the name of the root element of the
                                                                      document, e j= D.
2. Model and Query Language Frame-                                    Remark 1 We have omitted listing ANY" BPSM
   work                                                               as another kind of type. However, ANY is merely a
                                                                      macro for the regular expression n1 j : : : jnk  where
   We present next a mathematical abstraction of XML                  N = fn1 ; : : : ; nk g.
and DTDs. Without loss of generality on our results                   Regular Expression Notation Staying in sync
on DTDs we focus on XML documents that meet the                       with the XML speci cation we use the following no-
following requirements:                                               tations in regular expressions:
  1. Always have DTDs, i.e., we focus on valid docu-                       r1 ; r2 stands for the concatenation of r1 and r2 .
      ments.                                                               r1 jr2 stands for the union occasionally mentioned
  2. Do not have attributes other than the ID attribute.                   as disjunction of r1 and r2 .
      Consequently, we do not include attribute type                       r stands for the Kleene closure of r.
      declarations in the DTDs.4 Furthermore, we make                      r+ stands for r; r.
      the simplifying assumption that all elements will                    r? stands for rj .
      have an ID.                                                     2.1. Query Language
  3. Do not have empty elements. Note that we still
      allow elements with empty content which, con-                     The part of the query view de nition language
      fusingly enough, are not the same with empty el-                XMAS we use in this paper is a subset of the recently
      ements BPSM .                                                  proposed XML-QL DFF+ . All semistructured query
  4. Do not have mixed content elements, i.e., we do                  languages have the functionality described by this sub-
      not capture elements whose content mixes strings                set. The same language is used for de ning queries and
      with elements.                                                  views. The only di erence between a query and a view
  5. Neglect physical aspects of XML, i.e., entities.                 is that a mediated view is assigned a URL thru which
                                                                      it will be accessed by queries.
   Given these assumptions XML is formalized as fol-                     Our view inference algorithm works with pick-
lows.                                                                 element XMAS queries, i.e., queries whose SELECT
De nition 2.1 Element An element e is a triplet                       clause has a single variable, called pick-variable, that
consisting of a name, denoted as namee, a unique                    binds to elements and the WHERE clause consists of a
ID attribute, and content, denoted as contente. The                 single condition that is applied to exactly one source.
content is either a sequence of elements or a PCDATA                  The only form of negation we allow is the ability to
value, i.e., a character string.                                      say that the id's of two elements are di erent. The el-
De nition 2.2 DTD A DTD is a set fhn :                                ements that bind to the pick-variable are grouped into
typenign N , where N is the set of names and typen                the view document whose name precedes the SELECT
                                                                      clause. The order in which they appear is the same

is either a regular expression over N or PCDATA.
  4 Notice that the IDREF attributes are also excluded from our       with the order in which they appear in the document
study. However, this exclusion does not signi cantly limit our        when we traverse the elements of the document in a
DTD related results since the DTD does not type the target of         depth- rst left-to-right order. We illustrate the seman-
an IDREF attribute.                                                   tics of the query language with the following example.
Q1 withJournals =                                                From now on we will refer to sound view DTDs sim-
       SELECT P                                                 ply as view DTDs.
       WHERE department name CS name                               The second property, called tightness, is motivated
             P: professor j gradStudent                         by the fact that view DTDs may describe document
                   publication id=Pub1 journal                  structures that cannot appear in a view see Exam-
                   publication id=Pub2 journal                  ple 3.1. We suggest that the view DTD inference algo-
                         AND Pub1 != Pub2                       rithm selects the tightest view DTDs, which, intuitively,
                                                                are the ones that describe the fewest" documents that
   The variable P binds to all professor or                     cannot appear in a view. This intuition is formalized
gradStudent elements that
                                                                by the following de nitions.
  1. are contained in a department element,
  2. the department contains an element name whose              De nition 3.2 A DTD D is tighter than a DTD D            0

     content is a string CS                                     if every document satisfying D satis es D .  0

  3. P contains two di erent publication elements               De nition 3.3 A type hn : ri is tighter than a type
     that contain journal subelements.                          hn : r i if Lr  Lr , i.e., every sequence of elements
                                                                     0               0

   Note that we use the notation V : helement i : : : h=i       described by r is also described by r .

instead of XML-QL's equivalent                                  De nition 3.4 A DTD DV is a tightest view DTD
helement i : : : h=i ELEMENT AS V                               for given source DTDs D1 ; D2 ; : : : ; Dn and a view def-
   We could as well have variables in the content               inition V if there is no view DTD DV such that DV is
                                                                                                         0           0

or element name position. For example, instead of               tighter than DV .
professor|gradStudent we could have a variable N                    For the class of pick-element queries the view infer-
that can bind to any name. Our view inference al-               ence algorithm can tighten" the view DTD in three
gorithm works for pick-element queries where in the             ways. First it includes in the view DTD only the types
element name position we may have a constant, or a              for the names that may appear in the view documents.
disjunction of constants or a variable that does not ap-        Second, it tightens the types of the names as illustrated
pear in other places in the condition. For simplicity we        in Examples 3.1 and 3.2. Finally, the order and cardi-
replace each element name variable with a disjunction           nality of the output elements is discovered as illustrated
of all names in the source DTDs at a preprocessing              in Example 3.2.
stage.                                                          EXAMPLE 3.1 Consider the following subset of the
3. View DTD Inference                                           department DTD and the query that retrieves profes-
                                                                sors or graduate students with at least two journal pub-
   In this section we provide algorithms for inferring          lications.
the DTD of a view from the source DTDs and the                   D1
view de nition. Before we proceed to describing the             fhdepartment : name ; professor ; gradStudent ;
view inference algorithm of our system we provide two                                                           course i
formal criteria against which view DTD inference algo-          hprofessor :       rstName ; lastName ; publication ;
rithms should be evaluated. Furthermore, Section 3.2                                                             teaches i
shows in a formal way that inherent DTD weaknesses              hgradStudent : rstName ; lastName ; publication i
decrease the precision" of view DTDs. Our special-              hpublication : title ; author ; journal jconference ig
ized DTDs Section 3.3 do not su er from such non-
tightness problems.                                             Q2 withJournals =
                                                                SELECT P
3.1. Soundness and Tightness                                    WHERE department name CS name
   We believe that the view DTD must satisfy two                     P: professor j gradStudent
properties. The rst one, called soundness, guaran-                        publication id=Pub1 journal
tees that every view document will be described by the                    publication id=Pub2 journal
view DTD.                                                                       AND Pub1 != Pub2
                                                                    A naive view inference algorithm may derive a view
De nition 3.1 A view DTD DV is sound if, given                  DTD by the following steps: First it adds the type
source DTDs D1 ; D2 ; : : : ; Dn and a view de nition V ,       de nition
for every tuple d1 : : : dn  of n documents such that                 hwithJournals : professor jgradStudent i
d1 j= D1 ; d2 j= D2 ; : : : ; dn j= Dn the view document
V d1 ; : : : ; dn  satis es DV .
in the DTD because P binds to elements named pro-                3.2. Structural Tightness
fessor or gradstudent. Then it declares withJournals to              In many practical cases even the tightest view DTDs
be the document type, and eliminates all type de ni-             describe view document structures that cannot be pro-
tions that correspond to names that are not referenced,          duced by the view. For example, the DTD D2
directly or indirectly, by withJournals.                         loses the information that at least two publications of
   It is easy to see that such a DTD is not as tight as          each professor student are in a journal. Consequently
the following DTD D2, which is actually the tightest           DTD D2 describes documents with students having
DTD for the query Q2 and the source DTD D1.                  conference publications only - though it is clear from
Notice that the professor and gradStudent types of               the view de nition that a student with conference pro-
DTD D2 have been re ned to re ect the constraint               ceedings only can not appear in the view.
that the corresponding elements have at least two pub-               We formalize this information loss phenomenon by
lications. Then the withJournals type of DTD2 shows            introducing the structural tightness property of view
that professors appear before gradStudents.                      DTDs. We present the sources of structural non-
 D2                                                            tightness for the case of pick-element queries and pro-
fhwithJournals :professor ; gradstudent i                      vide the means to detect non-tightness of inferred
hprofessor :       rstName ; lastName ; publication ;            DTDs.
                                 publication +; teaches i        Formalization of Structural Tightness Intu-
hgradStudent : rstName ; lastName ; publication ;                itively a view DTD is non-tight if it describes document
                                           publication +i          structures" that cannot be produced by the view.5
hpublication : title ; author ; journal jconference ig        First we formalize the notion of structural class. In-
                                                                 tuitively, the structural class of a document excludes
   The above example illustrated how a type can be               the string values of the document and thus abstracts
re ned by removing a `*' and forcing more than one               its element name structure.
instances of a name. Another very common case of
re nement is disjunction removal, as illustrated by the          De nition 3.5 A structural class of documents is a
following example.                                               set of documents such that for every two documents d1
                                                                 and d2 in the class there is a mapping that maps
EXAMPLE 3.2 Consider the query Q3 that oper-
ates on the source de ned by DTD D1 and collects                 1. every string of d1 into a string of d2 and vice versa,
all journal publications. It is clear that the disjunction         2. every id of d1 into an id of d2 and vice versa, and
journal jconference  can be removed from the type def-           3. if the mappings are applied to d1 , d1 becomes iden-
inition of publication.                                                tical to d2 and vice versa.
                                                                 De nition 3.6 A structural class of documents satis-
Q3 publist =                                                     es a DTD D if the documents of the class satisfy D.
      SELECT P
      WHERE department name CS name
               professor j gradStudent                           Notice that if one document of the class satis es D then
                 P: publication journal                          all documents of the class satisfy D. So in the above
                                                                 de nition we could replace the documents" with a
The view DTD is then:
                                                                 De nition 3.7 Given a set of source DTDs
D3     fhpublist :       publication i                        D1 ; : : : ; Dn and a view V , a DTD DV is structurally
         hpublication :    title ; author ; journal ig          tight if
   Notice that we could not remove the disjunction                 1. it is the tightest DTD of the view given the source
journal jconference  from the DTD D2 of Exam-                      DTDs,
ple 3.1 because the query retrieves many publications              2. for every structural class S that satis es DV there
and, except for two of them, the other ones may be                     is a view document I that satis es S and there
journals or proceedings. Hence we have to leave the                    are also source documents I1 ; : : : ; In , satisfying
de nition of publication in the view DTD2 as is and                    D1 ; : : : ; Dn and I = V I1 ; : : : ; In .
lose the information that at least two publications of
each professor student in the view are journal publica-             5 Requiring a tight view DTD to describe view documents
tions. Such a loss of structural information is intrinsic        exclusively is a property that cannot be achieved in any non-
                                                                 trivial case.
in DTDs and is discussed next.
Using the De nition 3.7 we characterize DTD D2 as                             if content e is a string then typeni  is PC-
non-tight because there is a structural class, say the                          DATA; or
class S of withJournals documents that have one pro-                            if content e            =        e1 : : : em then
fessor having no journal publications, that does not                            namee1  : : : nameem  2 image type ni ,
meet the second condition of the de nition. In par-                             and ei j= D, 1  i  m.
ticular, S satis es D2, yet there is no possible valid                 To avoid cluttering DTDs with the superscript no-
source document I1 such that the view Q2 when ap-                   tation from now on we assume that n is an acceptable
plied to I1 will result in a document that belongs to                 shortcut for n0 .
the structure S .                                                        To illustrate the use of specialized DTDs we show
   On the other hand DTD D3 is tight according to                   how the DTD from Example 3.1 can be turned into a
De nition 3.76.                                                       tight specialized DTD.
                                                                      EXAMPLE 3.4 Recall that the problem with the
3.3. Specialized DTDs                                                 view DTD for QueryQ2 was that every professor or
   Non-tightness reduces the precision" of DTDs and                   a gradStudent retrieved was required to have two jour-
also causes internal problems to our algorithms. To                   nal publications, but DTDs cannot represent such con-
alleviate the non-tightness problems we developed the                 straints. With specialized DTDs we create a new type
concept of specialized DTDs. Their important prop-                    publication1 that de nes journal papers only. Then
erty is that there is a structurally tight specialized DTD            we require each professor gradStudent to have exactly
for most views and source DTDs. Indeed, we conjec-                    two publication1 objects and optionally other publi-
ture that all pick element views without recursion have               cations. The full specialized DTD is:
a structurally tight specialized view DTD. For views                   D4
with recursive paths there are cases where there is no                fhwithJournals :professor ; gradstudent i
tight specialized DTD, simply because there is not even               hprofessor :       rstName ; lastName ; publication ;
a tightest DTD see PV99 .                                                              publication 1 ; publication ;
De nition 3.8 A specialized DTD s-DTD is a set                                         publication 1 ; publication ; teaches i
                  fhni : type ni igni N +                           hgradStudent : rstName ; lastName ; publication ;
                                                                                                 publication 1 ; publication ;
where N + = fni jn 2 N; i = 0; : : : ; specng and                                              publication 1 ; publication i
specn is a non-negative integer de ned for all n 2 N .              hpublication : title ; author ; journal jconference ig
The type is a regular expression over N + or it is PC-                hpublication 1 : title ; author ; journal ig
DATA. The superscripts attached to the names are                      4 Algorithms
called tags and the regular expression type ni  is called              In this section we describe how a tight specialized
a tagged regular expression.                                          DTD is computed for pick-element queries without re-
    We will need to convert the s-DTD to a regular                    cursive path conditions. First we show how to infer an
DTD. For this purpose we de ne the image:                             s-DTD for the type of elements that bind to a pick-
De nition 3.9 The image of a sequence hni11 : : : nim i m             variable X in queries of the form:
of members of N is the sequence hn1 : : : nm i of mem-
                    +                                                    Q5 SELECT X WHERE X:tree condition
bers of N i.e., the image is the sequence after pro-                 In Section 4.1 we describe how individual types are
jecting out the superscripts. Similarly the image of a               re ned. In Section 4.2 we outline the algorithm for
tagged regular expression r is the regular expression r       0       computing the type of the elements that bind to X as
derived if we replace each name ni of r with n.                       well as the types of the sub-elements. Finally in Sec-
EXAMPLE 3.3 For instance the image of the                             tion 4.4 we complete the presentation by describing the
tagged type htitle ; author 1 ; author 2 i is just                    computation of the type of the view's top element. The
htitle ,author ,author i.                                             detailed presentation of the algorithm can be found in
    Finally we need the ability to check whether an XML                PV99 .
object satis es the specialized DTD:                                  4.1. DTD Type Refinement
De nition 3.10 An element e satis es an s-DTD D                          The DTD tightening algorithm of Section 4.2 recur-
if the following hold                                                 sively tightens" each type of the initial DTD DTD
      n 2 N , where n = namee7 ,                                    of the source before the application of the query by
      there is an i; 0  i  specn such that                        means of the type re nement algorithm. We rst pro-
   6 Proving tightness for speci c views and DTDs is beyond the       vide a type re nement de nition and examples. We
scope of this paper                                                   assume that no two conditions in the query have the
   7 This condition stayed the same with plain DTDs                   same name.

De nition 4.1 The type re nement re ne r; n of a               Type Re nement When Conditions on Ele-
regular expression r given a name n is the regular ex-           ments with the Same Name When a tree con-
pression r that describes all strings of Lr that con-
                                                                 dition requires the existence of two or more di erent
tain at least one instance of n.                                 elements with the same name the tightening algorithm
   The algorithm that computes re ne r; n uses the             has to work with specialized DTDs in order to derive
special operators ` ` and `k` that extend the regular            the correct result. We extend below the type re ne-
expression operators `; ` and `j`:                               ment de nition to tagged regular expressions. Recall,
                                                                 the type de nitions of specialized DTDs are based on
                          ; r1 = fail
      r1 r2 = f failr if; otherwise or r2 = fail ;
                    r1 ; 2                                       tagged regular expressions.
                  fail ; if r1 = fail and r2 = fail ;
                                                                 De nition 4.2 The type re nement re ne r; nT  of a
                r1 ; if r1 6= fail and r2 = fail ;              tagged regular expression r given a tagged name nT is
      r1 kr2 = r ; if r = fail and r 6= fail ;                   the tagged regular expression r that describes all se-

                   2        1             2
                                                                 quences s where
                  r1 jr2 ; otherwise                               1. s is of the form s1 ; nT ; s2 and
                                                                   2. the sequence images1; n; images2 is a member
Type re nement algorithm for conditions in-                             of Lr.
volving di erent names:                                              The algorithm for the re nement of tagged regular
   function re ner,n                                           expressions di ers from the algorithm of Section 4.1 in
     if r = n then return n                                      the base casethe rst two lines
     if r = n where n is a name and n 6= n
              0           0                  0
                                                                      function re ner,nT                                  T 6= 0
                   then return fail                                   if r = n                       recall n is a shortcut for n0
     if r = r ? then return re ne r ; nkfail
              0                        0
                                                                         then return nT
     if r = g then return g  re ne g; n g                        if r = n T where n T is a tagged name and

     if r = r1 ; r then return re ne r1 ; n r k
                  0                              0
                                                                                                n 6= n _ T 6= 0 _ T 6= T 
                                                                                                     0               0         0

                                   r1 re ne r ; n
                                                                         then return fail
     if r = r1 jr then return re ne r1 ; nkre ne r ; n
                  0                                  0
                                                                  the rest is the same with the algorithm of Section 4.1
EXAMPLE 4.1 Consider the DTD D6 and the                        EXAMPLE 4.2 Consider again the DTD of Exam-
query Q4                                                       ple 4.1 but now assume that the query requests the
                                                                 existence of two di erent journal publications.
 D6 fhprofessor : name ; journal jconference ig             Q5 answer =
                                                                            SELECT X
Q4 answer =                                                               WHERE X: professor journal id=J1
      SELECT X                                                                                       journal id=J2
      WHERE X: professor           journal                                       AND J1 != J2
                                                                 The tightening algorithm will tag the two instances of
The tightening algorithm invokes the re nement algo-             journal as journal 1 and journal 2 . For brevity let us
rithm above to enforce that the type de nition of pro-           again use the rst letters of the names. First it re-
fessor will make the existence of a journal necessary.             nes the type n; j jc recall, this is a shorthand for
The following steps illustrate how the algorithm de-             n0 ; j 0 jc0  with j1 and the result is further re ned
composes the re nement of a sequence, of a loop, or of           with j 2 .
a disjunction into a composition of the re nements of                 re ne `n ; j jc  ; j 1 

the constituents of the sequence, the loop or the dis-                = n ; j jc ; j 1 ; j jc 
junction. Let us call name, journal, and conference by
their rst letter.                                                    re ne `n ; j jc ; j 1 ; j jc  ; j 2 

    re ne `n; j jc ; j 
                      0                                              = n ; j jc ; j 2 ;j jc ; j 1 ; j jc  j
    = re ne n; j  j jc  k n re ne j jc; j                                        n ; j jc ; j 1 ; j jc ; j 2 ; j jc 
    = fail k n; re ne j jc; j                             4.2. Tightening Algorithm
    = n; j jc  re ne j jc; j  j jc
    = n; j jc  re ne j; j  k re ne c; j  j jc           We discuss now how to combine the individual type
    = n; j jc  j k fail j jc                             re nements discussed in Section 4.1 into an algorithm
    = n; j jc; j; j jc                                     that computes the s-DTD for queries of the form Q5.
                                                                 The algorithm starts with an empty s-DTD and adds

re ned types to it up by traversing the tree constraints                  the view inference module will inform the user of non-
and re ning types of the original DTD. When two dif-                      tightness. The regular DTD after the merge is:
ferent tree constraints re ne the same DTD type, we                        D7
store the union of the content of the re nements. After                   fhwithJournals :professor ; gradstudent i
the algorithm terminates we insert type de nitions of                     hprofessor :     rstName ; lastName ; publication ;
types from the original DTD that occur in the content                                      publication ; publication ; publication ;
of the tightened s-DTD and were left unre ned. For                                                              publication ; teaches i
simplicity, we assume that no two sibling conditions                      hgradStudent : rstName ; lastName ; publication ;
can bind to the same element. The detailed descrip-                                        publication ; publication ; publication ;
tion can be found in PV99 .                                                                                           publication i
   Note that the tightening algorithm has a useful side                   hpublication : title ; author ; journal jconference  j
e ect. Given the tree condition c and the source DTD                                                   title ; author ; journal ig
d it decides whether the condition is                                     The resulting DTD can be simpli ed to the DTD D2
     valid, i.e., c will be satis ed by every document                    in Example 3.1
     that satis es d.
     satis able, i.e., c will be satis ed by some docu-                   4.4. Result List Type Inference
     ments that satisfy d.                                                    The tightening algorithm shows us how to compute
     unsatis able, i.e., there is no document satisfying                  the type of the elements that bind to the pick-variable
     both c and d. In this case the view DTD describes                    of a pick-element query. Recall from Example 3.1 that
     an empty answer.                                                       nding the names of the elements that bind to the pick-
4.3. Converting s-DTDs to DTDs
                                                                          variable and their types is not enough. In this section
                                                                          we complete the view inference by presenting the list-
   Once we have obtained a tightened s-DTD we may                         type inference algorithm that discovers the type of the
need to convert it into a regular DTD. The regular                        top-level element of the view.
DTDs do not support tagged types, so we need to do                            The result list type inference algorithm works incre-
the following: We rst need to obtain the images of                        mentally on the path ending at the pick variable. It
all types of the s-DTD see De nition 3.9 and then to                    introduces variables at every point in the path preced-
merge all images that have the same name. We also                         ing the pick-variable and computes the result list type
want to inform the user that a merging has occured,                       of each one of them by using the type of the previous
since merging inadvertently introduces non-tightness.                     list type. In particular, assume a query with a tree
   The algorithm is given below:                                          condition of the following form:
Algorithm Merge                                                           lk = SELECT Lk
INPUT: an s-DTD d                                                               WHERE
OUTPUT: d - the DTD in which the specialized types
                                                                                L0 :hd0 i L1 :hd1;1 i : : : Lk :hdk;1 icondition k;1 h=i
                 of d are merged                                                                                hdk;2 icondition k;2 h=i
    d  fg
    for each type de nition hnT : type nT i of d                                                               .
        if d contains the type de nition hn : type ni
                                                                                                                 hdk;ik icondition k;ik h=i
           replace hn : type ni with                                                                        h=i
                            hn : type njimage type nT i                                          ..
           signal the merge                                                                   hd1; 2icondition 1;2 h=i
        else                                                                                  ..
           insert in d hn : image type nT i
                                                                                               hd1;i1 icondition 1;i1 h=i
   We illustrate next how the above algorithm can con-                                        h=i
vert an s-DTD into a tightest DTD.
EXAMPLE 4.3 Consider the DTD D4 from Ex-                                   In the rst step the algorithm computes the type of
ample 3.4. Merging will collapse the publication and                      l0 = SELECT L0     WHERE : : : by invoking the tightening
publication1 de nitions into a single de nition and re-                   algorithm.
move the tags from all type de nitions.8 At this point                      1. If the tightening algorithm declares that the condi-
   8 Following the tightening algorithm step by step we can see                tion is unsatis able with respect to the DTD then
that three specializations of publication are introduced. The                  the type is hl0 : i.
third one, named publication 2 , has essentially the same type with         2. If the tightening algorithm declares that the con-
paublication 1 .                                                               dition is valid with respect to the DTD then the

     type is hl0 : dt i, where dt is the document type.                   Q9 papers = SELECT P
     Apparently dt must be d0 or one of the names ap-                           WHERE D: department
     pearing in the disjunction d0 .                                                           G: gradStudent
  3. If the tightening algorithm declares that the con-                                           X: publication
     dition is satis able with respect to the DTD, as is                                             P: title j author
     the case in the running example, then the type is                       The algorithm rst constructs a query that picks D
     hl0 : dt ?i.                                                         into a result l0 and computes the type of l0 . To do so
   In each of the subsequent steps the algorithm com-                     it calls the specialization algorithm which declares the
putes the type of li+1 ; i = 0; : : : ; k , 1 for the following           condition satis able10 and consequently the type of l0
query assuming that the type of the document type dt                      becomes department ?.
is the one-level extension see below of the type of li                     In the next step the list inference algorithm works
according to the DTD.9                                                    with the dummy query Q10 and the hypotheti-
                                                                          cal type hdt : xdepartment ?i or equivalently hdt :
li+1 = SELECT Li+1                                                        name ; professor +; gradStudent +; course ?i.
        WHERE                                                             Q10 l1 = SELECT G
        hdt i Li :hdi
                +1         ;
                        +1 1   i : : : Lk :hdk; icondition k; h=i
                                              1             1
                                                                                   WHERE dt
                                           hdk; icondition k; h=i
                                              2             2
                                                                                                  G: gradStudent
                                          .                                                           X: publication
                                         hdk;ik icondition k;ik h=i                                        title j author
                                         h=i                              Projecting the gradstudent condition on the type of
                               ..                                         dt we get note we keep only the rst letters of the
                     hdi+1;2 icondition i+1;2 h=i                         names
                     .                                                        project  n ; p +; g +; c ? ; g 
                                                                                        0                     0

                     .                                                        = project n ; g ; project p ; g +; project g ; g +;
                     hdi+1;ii+1 icondition i+1;ii+1 h=i                              project c ; g ? = g 
                                                                             Then, by considering the query
De nition 4.3 The one-level extension xr of a reg-                      Q11 l2 = SELECT P
ular expression r according to a DTD d is the regular                               WHERE dt
expression derived by replacing every name in r with                                                    X: publication
its type.                                                                                               P: title j author
                                                                          where the type of dt is xg = f; l; p. Doing the
   Then the specialized type of li+1 is computed by pro-                  projection of publication on this type we get hl2 : pi.
jecting on the type of li+1 the condition or conditions                 Finally, we project the disjunction title or author"
of the i + 1 level.                                                       on xp = t; a; j jc and this gives us the correct
   Let us illustrate projection and list inference with                   result.
the following example. The complete algorithm can be                          project `t ; a ; j jc  ; t ja 

found in PV99 .                                                               = project t ; t ja ; project a ; t ja ; project j ; t ja j
EXAMPLE 4.4 Consider the query Q9 that oper-                                       project c ; t ja  = t ; a 
ates on a source with the DTD D8 and picks all titles                   5 Related Work
and authors of student publications. We have intro-
duced the variables D and G for the sake of explaining                       Our work with DTDs is closely related to problems
the algorithm.                                                            in semistructured databases. In this section we de-
 D8                                                                     scribe the related work.
fhdepartment :name ; professor +; gradStudent +;                              GW97 introduces dataguides as OEM objects and
                                                  course i               studies problems of inference of dataguides from data
hprofessor : rstName ; lastName ; publication +;                          and their use in query formulation and optimization.
                                                  teaches i               The dataguides di er from DTDs in two important as-
hgradStudent : rstName ; lastName ; publication i                        pects, they do not capture constraints on order and
hpublication : title ; author ; journal jconference ig                 cardinality and they do not capture constraints on the
    9 Note that the one-level extension step of the algorithm makes         10 It cannot be valid because publications are optional for grad-
it inappropriate for queries with recursive path expressions.             uate students.

siblings. In this respect they are less powerful than the        DFF+   A.           Deutch,            M.          Fer-
DTDs. However dataguides do not require the same                        nandez, D. Florescu, A. Levy, and D. Suciu.
type name to de ne the same type, so in this respect                    XML-QL: A query language for XML. Sub-
dataguides are similar to s-DTDs.                                       mission to W3C. Latest version available at
                                                                        http: www.w3.org TR NOTE-xml-ql.
    BDFS97 de nes graph schemas and studies their
properties. The graph schemas are similar to                     FS98   M. Fernandez and D. Suciu. Optimizing regu-
dataguides but can include unary formulas on their                      lar path expressions using graph schemas. In
                                                                        Proc. of the International Conference on Data
edges. They discover that graph schemas are closed                      Engineering, 1998.
under application of UnQl queries BDHS96b . As in                GW97   R. Goldman and J. Widom. Dataguides: En-
the case of dataguides, graph schemas cannot capture                    abling query formulation and optimization in
order, cardinality and constraints on the siblings.                     semistructured databases. In Proc. VLDB,
    FS98 studies the problem of optimizing path ex-                     pages 436 45, 1997.
pression with the aid of graph schemas. They introduce           LRO96  A. Levy, A. Rajaraman, and J. Ordille. Query-
a query language that includes a limited form of tree                   ing heterogeneous information sources using
conditions and paths. For this language they present                    source descriptions. In Proc. VLDB, pages
algorithms for exact optimization of path queries.                      251 262, 1996.
They de ne an optimal query as a query that returns              NUWC97 S. Nestorov, J. Ullman, J. Wiener, and
a minimal answer. And they present algorithms for                       S. Chawathe. Representative objects: Con-
rewriting path queries into equivalent queries using                    cise representations of semistructured, hierar-
state extents. They also present a polynomial approx-                   chial data. In Proceedings of 13th International
                                                                        Conference on Data Engineering, 1997.
imation to the rewriting algorithm. Some of their re-            PAGM96 Y. Papakonstantinou, S. Abiteboul, and
sults are applicable to our query language and DTDs.                    H. Garcia-Molina. Object fusion in mediator
    NUWC97 studies the inference of dataguides from                     systems. In Proc. VLDB Conf., 1996.
data and approximations to dataguides. They intro-               PGMW95 Y. Papakonstantinou, H. Garcia-Molina, and
duce a concept of a representative object that allows                   J. Widom. Object exchange across hetero-
one to compute a continuation of an object by a path                    geneous information sources. In Proc. ICDE
expression. They then discuss various implementations                   Conf., pages 251 60, 1995.
of representative objects and their approximations and           PV99   Y. Papakonstantinou and P. Velikhov. En-
mention the utility of RO's in query optimization. In                   hancing semistructured data mediators with
comparison to DTDs, RO's have the same shortcom-                        document type de nitions extended version,
ings as Graph Schemas.                                                  1999. Available at http: feast.ucsd.edu
                                                                        publications medwithDTDs.ps.
References                                                       QRS+95 D. Quass, A. Rajaraman, S. Sagiv, J. Ull-
                                                                        man, and J. Widom. Querying semistructured
BDFS97  P. Buneman, S. Davidson, M. Fernandez, and                      heterogeneous information. In Proc. DOOD,
        D. Suciu. Adding structure to unstructured                      pages 319 44, 1995.
        data. In Proc. of the International Conference           Wie92  G. Wiederhold. Mediators in the architecture
        on Database Theory, 1997.                                       of future information systems. IEEE Com-
BDHS96a P. Buneman, S. Davidson, G. Hillebrand, and                     puter, 25:38 49, 1992.
        D. Suciu. A query language and optimiza-
        tion techniques for unstructured data. In Proc.
        ACM SIGMOD, 1996.
BDHS96b P. Buneman, S. Davidson, G. Hillebrand, and
        D. Suciu. A query language and optimiza-
        tion techniques for unstructured data. Techni-
        cal Report 96-06, University of Pennsylvania,
BGL+    C. Baru, A. Gupta, B. Ludascher, G. Mar-
        ciano, Y. Papakonstantinou, P. Velikhov,
        and A. Yannakopulos. XML-based informa-
        tion mediation with VAMP. Available at
            http: feast.ucsd.edu publications
            vamp.ps .
BPSM        T. Bray, J. Paoli, and C. Sperberg-McQueen.
            Extensible Markup Language XML 1.0.
            W3C Recommendation. Latest version avail-
            able at http: www.w3.org.TR REC-xml.


To top