XML Query Processing and Query Languages A Survey

Document Sample
XML Query Processing and Query Languages A Survey Powered By Docstoc
					                                               Property of Amikelive.com – Technical Paper Series

        XML Query Processing and Query Languages: A Survey
                              Mikael Fernandus Simalango
                   Graduate School of Information and Communication
            Department of Computer Engineering, Ajou University, South Korea

Abstract-Today’s database is associated       processing time is projected to be minimum
with interoperability between different       thus alluding efficient processing.
domains      and     applications.    This
consequently results in the importance of     However, as non binary format, performing
data portability in database. XML format      query over XML data pertaining to arbitrary
fits the requirements and it has been         applications is still an intriguing issue. In
increasingly used for serving applications    the past, most effort was put to design
across different domains and purposes.        query processor to support declaration in
However, querying XML document                query languages. These days, the issue has
effectively and efficiently is still a        shifted to relational XML storage and
challenging issue. This paper discusses       integration with data management system.
query processing issues on XML and
reviews proposed solutions for querying       The rest of this paper is arranged as follows.
XML databases by various authors.             We initially review the evolving path of
                                              XML query languages. Then, we provide
Keywords: xml data, query optimization,       different approaches for xml query
query language                                processing by extracting the ideas and
                                              comparing the proposals. Finally, we
   I.     INTRODUCTION                        provide possible direction for future xml
                                              database and sum up our conclusion.
XML[10] serves dual functionalities as
markup language and data format. It               II.      XML QUERY LANGUAGES
separates presentation and data thus
offering independency and flexibility for     Since 2007, XQuery[9] which is an
content association. Due to this nature of    extension of XPath[10] has been
flexibility, data interchanged between two    recommended by W3C as query language
very different systems can use XML as the     for XML document. However prior to the
data format. XML tree-like structure is       establishment of W3C standard, there had
intuitive, human readable, and easy to        been several researches proposing query
understand. With the help of XML schema       languages for XML.
or DTD, the type and attributes of each tag
usable for certain XML document can be        D. Maier elaborated desired characteristics
well defined.                                 of XML query language[12]. His criteria
                                              were massively used as reference for
An XML query language defines more            development of some XML query
comprehensible and structurized construct     languages. Important criteria in his proposal
for conducting operation on an XML            include xml output of the query,
document or various XML documents. For        independence      of     schema,      schema
processing the query, an XML query engine     exploitation if possible, and optimized
or processor translates the syntaxes and      query operations. The operations defined in
executing the operations hinted by the        the proposal are selection which is choosing
query. Output is returned after process and   document or document element, extraction
                                              which is pulling out elements of a
                                                    Property of Amikelive.com – Technical Paper Series

  Query Langs. Lang. Type         Input model       Class of query         Public recognition
  XML-QL       functional         XML               Pattern matching       1998
  Lorel        declarative        OEM               Path expressions       1997
                                                    within OQL
  Quilt            functional     XML               Quilt expressions      2000
  XQL              functional     XML               XQL based on           1999
                                                    path expressions
  XQuery           functional     XML               XQuery                 2007
                            Table 1 XML query languages in comparison

document, reduction that is realized as            are seven principal forms of Quilt
removing sub-elements, restructuring or            expressions which are path expressions,
constructing a new set of element instances,       element constructors, FLWR expressions,
and combination as merging operation               expressions with operator and functions,
carried out over two or more elements              conditional expressions, quantifiers, and
resulting in only single element.                  variable bindings. Besides join operations,
                                                   quilt also support nested expressions hence
XML-QL[14] is an XML query language                it basically support subquery within a single
which provides support for querying,               query. Significant features of Quilts were
constructing, transforming and integrating         used for the development of XQuery.
XML data. This language reflects XML as            Pros: robust functionalities, subqueries
semistructured data that have irregular or         Cons: no support for textual similarity
rapidly evolving structure. XML-QL uses
element patterns to match data in an XML           XQL[17] uses path expressions hence its
document. An extension of XML-QL                   basic constructs correspond directly to the
named Elixir[15] was proposed to support           basic structures of XML. Due to this nature,
ranked queries based on textual similarity.        XQL is closely related to XPath. In XQL,
Pros: schema aware, nested queries                 document nodes play a central role. Nodes
Cons: heavily pattern based, a priori              have identity and they retain their identity,
knowledge of data structure is usually             containment relationships, and sequence in
required, cumbersome syntax.                       query results. The nodes themselves may
                                                   come from variety of different sources.
Lorel[16] is early query language for              However, XQL does not specify how these
semistructured data. It uses OEM (Object           nodes are brought to the query. XQL also
Exchange Model) as the data model for              supports joins and some functions.
semistructured data. For querying the              Pros: shorter expressions
elements, Lorel extends OQL (Object                Cons: semantics may not be very intuitive
Query Language) by relying on coercion at
a number of levels to restrain the strong          XQuery[9] had been a moving target for
typing of OQL. Lorel also extends OQL              some time before it was established as W3C
with path expressions so that user can             recommendation in 2007. A big part of
specify patterns that are matched to actual        XQuery semantics adopts Quilt’s. XQuery
paths in referred data.                            uses XPath for path expressions and
Pros: easy syntax                                  FLWOR structure for describing the whole
Cons: dependant on OQL parser, limited             query. As a recommended standard, a lot of
functionalities                                    researches nowadays discuss the method of
                                                   optimizing XQuery translation           and
Quilt[13] is a functional language in which        processing by a query processor and
a query is represented as expression. There
                                                Property of Amikelive.com – Technical Paper Series

integrating XQuery into a full-fledged          Logical access model should implement
XML database management system.                 algebraic and non-algebraic procedure to
Pros: clear semantics, integration with         optimize the internal representation of the
XPath                                           query.      Non-algebraic      optimization
Cons: intersection with XSL                     minimizes      intermediary   results    by
                                                restructuring the query and executing most
Important characteristics of various XML        selective operations as early as possible.
query languages can be seen in Table 1.         Algebraic optimization will transform the
                                                internal expression into a more optimized
    III.    APPROACHES FOR XML                  expression in a semantics-preserving
            QUERY PROCESSING                    manner.

A query processor extracts the high level       Physical access model is related to system-
abstraction of declarative query and its        specific issue. At this level, each logical
procedural evaluation into a set of low-level   algebra operator will be decomposed into
operations[18].    Analogous      to    SQL     corresponding physical operators. The goal
processor, SQL query is translated at           of this step of optimization is a query
logical access model and then the logical       executing plan (QEP) which is arranged of
access prior to accessing and returning the     chosen physical operators and their
physical storage model. Levels of               sequences of execution.
abstraction in XML query processing in
comparison with SQL abstraction levels are      Finally, the storage model affects the rate of
depicted in Table 2.                            QEP. For optimized query processing,
                                                appropriate storage model should be
Level of          XDBS         RDBS             deployed in order to minimize I/O costs,
Abstraction                                     CPU costs, storage costs for intermediary
Language          XQuery       SQL              results, and communication costs. Currently
                                                used storage models comprise LOBs (Large
Logical access XML       query Relational
model          algebra          algebra         Objects),      certain   XML-to-relational
Physical accessPhysical XML Phyiscal DB-        mappings (shredded documents), or native
model          query algebra    operators       storage formats like Niagara[19] and
Storage model  XTC,      natix, Record-         Timber[20]. The relational XML data
               shredded         oriented DB-    model and native storage model attract
               documents, etc interface
    Table 2 XDBS vs. RDBS abstraction levels
                                                more attentions indicated by various
                                                proposals for respective overlying query
From Table 2, XDBS denotes XML                  processors.
database management system and RDBS
are Relational Database Management              Various XML query processors have been
System. The language model is designed to       proposed for more optimized query
meet the demands of [12] which are              processing. Referring to the abstraction
reflected in the language ability to perform    levels, we’ll divide the query processors
search functionality and document-order         into three categories based on their storage
awareness       hence      document-centric     models: flat-file processing, relational
characteristics and later on the data-centric   processing and native storage processing.
characteristics which is associated with
powerful selection and transformation. The      Query Processing on Flat File Scheme
semantic processing should then be able to
analyze the query and transform it into an      In flat file processing, for example when
international representation to be used         XML is saved as LOBs, query is executed
throughout subsequent optimization steps.       after all XML data is loaded and scanned by
                                                          Property of Amikelive.com – Technical Paper Series

                             Figure 1 Index creation in index-filter technique

the query processor. This surely results in              schema should be derived from the data.
poor performance when the size of file is                After schema exists relational schema will
big and temporary storage in memory is not               be created which contains relationship
feasible. However, some algorithms were                  among root element and all sub-elements.
authored to improve the query processing.
N. Bruno et al[21] studied different                     In [2], the authors divide relational scheme
techniques for processing XML queries: y-                into scheme-oblivious and scheme-
filter, index-filter, and pathstack. Y-filter is         conscious approach. Scheme-oblivious
query processing by augmenting prefix tree               approach maintains a fixed schema by
representation of input queries as an NFA                capturing the tree structure of XML
(Non-deterministic Finite Automaton)                     documents. In contrast, scheme-conscious
which will output all matches of the queries.            approach creates a relational schema based
The index filter technique uses indexes                  on DTD/schema of the XML first and based
built over certain tags of the input XML                 on the schema, primary-key foreign-key
document. PathStack which is a series of                 joins in relational database are set up to
linked stacks is later created for each query            model parent-child relationships in the
node in a path query in order to track the               XML tree. The authors built SUCXENT++
data nodes. Figure 1 shows how indexes for               and observed that schema-oblivious
an XML document are created using this                   approach could also outperform schema-
approach.                                                conscious approach.

Query     Processing        on      Relational           The authors in [2] also provided
Structure                                                comparisons for other different schemes
                                                         like EDGE, XRel, and XParent which are
In this approach, XML document or                        not discussed here for brevity.
information related to XML document is
stored in relational database. This step is              BEA/XQRL[4] is a query processor that
taken because relational database performs               implements relational scheme using
better indexing than simple index creation               XQuery. Query is parsed and optimized by
like in previous approach. RDBMS engine                  query compiler. For eliciting the query,
will instead perform the query processing                XDBC interface functions as an interface
by translating XQuery into SQL, running                  between frontend application and query
the SQL query and serialize the XML result.              processor. The compiler will then generate
                                                         a query plan to optimize the query. XML
Relational storage schemes for XML                       data is represented as stream and parsed as
documents can be classified into three                   input by the XML parser. Runtime
groups: no XML schema scheme, based on                   operators containing function and operator
XML schema, and user defined. In case                    libraries will process the stream and
there is no schema provided, relational                  provide output based on the query plan.
                                                Property of Amikelive.com – Technical Paper Series

Figure 2 depicts the overview of BEA            parent,       and       ancestor-descendant
streaming XQuery engine.                        relationships are recorded in this scheme.

                                                The XML document will later stored as
                                                persistent trees. If disk is used as storage
                                                means, XML nodes will be split among
                                                disk page. Node representation is optimized
                                                based on fixed page size.

                                                Efficient query processing in native storage
                                                is achieved by stack-based algorithms like
                                                StackTreeDesc[22] and holistic twig
                                                joins[23]. StackTreeDesc algorithm uses
          Figure 2 Overview of BEA
                                                stack structure to cache parent elements’
                                                label and when path to destination child
MonetDB[5] is another query processor for       node is reached, information from stack is
XQuery which is constituted by Pathfinder       combined with child label and returned as
compiler on top of MonetDB RDBMS. It            results in descendant order. Subsequently,
also has XQuery runtime module that             stack is emptied for the next operation. On
utilizes loop-lifted staircase join (a method   the other hand, holistic twig joins tries to
for evaluating XPath location in a single       avoid constructing intermediary results
sequential scan) as a physical operator so      when matching twig (search for predicate or
that the query processing can be improved.      label) patterns.

Query Processing on Native Storage              NaxDB[7] uses native approach and
Scheme                                          supports XQuery and XUpdate processing.
                                                In NaxDB, hierarchical tree of linked
Using this approach, XML elements are           objects from XML data is stored using
assigned label. The purpose of the labeling     object oriented extensions of MaxDB from
is to create unique identifiers that will be    MySQL. MaxDB system architecture are
useful for query processing. There are many     built on top of three subsystems: a database
labeling schemes which take into account        client that enables users to write queries and
trade-off between space occupancy,              receive results, a database server which is
information contents, and suitability to        the core subsystem, and persistent object
updates. The most frequently used is            manager which is responsible for
region-based labeling scheme. The idea of       persistently storing XML data.
this scheme is to label elements to reflect
nesting. Figure 3 shows the labeling scheme        IV.      TOWARD FUTURE XML
for simple nesting. The final label denotes                 DATABASE
(start, end, level) status for the node.                    MANAGEMENT SYSTEMS

                                                Future database management system is
                                                associated with application mash-up and
                                                versatility. It will operate across different
     Figure 3 Region-based labeling scheme      platforms thus it has to handle
                                                interoperability among data. Data can be
Another labeling scheme is ORDPATHS             static or in a form of stream and its flow
which is implemented in MS SQL server.          may vary from low-density stream to high-
This scheme labels each node by a               density stream. Database management
sequence of integer numbers. Order, depth,      system, should be aware of those
                                                     Property of Amikelive.com – Technical Paper Series

characteristics and be able to perform well               Proceedings of 22nd International Conference
by minimizing the costs.                                  on Data Engineering. 2006
                                                     [7] J. Hundling, J. Sievers, M. Weske. NaXDB –
                                                          Realizing Pipelined XQuery Processing in a
This paper has reviewed progress toward                   Native XML Database System. In 2nd
XML database management system.                           International      Workshop        on     XQuery
Current trends inclined to relational scheme              Implementation, Experience and Perspective.
where query for XML data is translated into               2005
                                                     [8] S. Wang et al. R-SOX: Runtime Semantic
declarative SQL to speed up the indexing                  Query Optimization over XML Streams. In
process and node solicitation.                            Proceedings of 32nd International Conference on
                                                          VLDB. 2006
Future researches can be targeted to design          [9] W3C XML Query Specification, Latest.
better pathfinding algorithm like in [3,6],               http://www.w3.org/TR/xquery
                                                     [10]W3C XML Path Language Specification, Latest.
alternative query processing like in [1], and             http://www.w3.org/TR/xpath
support for transactional XML databases.             [11] W3C XML1.0 Recommended Specification.
    V.      CONCLUSION                               [12] D. Maier. Database Desiredata for XML Query
Since XQuery is now a de facto standard                   tml
for query language over XML, nowadays a              [13] D. Chamberlin, J. Robie, D. Florescu. Quilt: An
lot of effort is put to achieve more efficient            XML Query Language for Heterogeneous Data
and optimized XML query processing.                       Source. In WebDB (Informal Proceedings),
Current trends are inclined to relational                 pages 63-62. 2000
                                                     [14] A. Deutsch, M. Fernandez, D. Florescu, A.
scheme which consolidates XML with                        Levy, and D. Suciu. XML-QL: A query
features of RDBMS. However, several                       language for XML. In Proceedings of 8th
challenges for the realization of scalable                International World Wide Web Conference.
XML database management system still                      1999
exist and future researches should address           [15] T. Chinenyanga and N. Kushmerick. An
                                                          Expressive and Efficient Language for XML
them pretty well.                                         Information Retrieval. In Journal of the
                                                          American Society for Inf. Sci. and Tech., 53(6):
                 REFERENCES                               438-453. 2002
                                                     [16] S. Abiteboul et al. The Lorel Query Language
[1] A. Halverson et al. Mixed Mode XML Query              for Semistructured Data. In International
    Processing. In Proceedings of the 29th VLDB           Journal on Digital Libraries, 1(1):68-88. 1997
    Conference. 2003                                 [17] J. Robie et al. XQL (XML Query Language).
[2] S. Prakash, S. B. Bhowmick, S. Madria.                http://www.ibiblio.org/xql/xql-proposal.html.
    Efficient Recursive XML Query Processing              August 1999
    Using Relational Database Systems. In            [18] C. Mathis and T. Harder. A Query Processing
    Proceedings of ER. 2004                               Approach for XML Database Systems. 2005
[3] Y. Chen, G. A. Mihaila, S. B. Davidson, S.       [19] J. Naughton et al. The Niagara Internet Query
    Padmanabhan. Efficient Path Query Processing          System. In IEEE Data Engineering Bulletin vol
    on Encoded XML. In Proceedings of                     24 issue 2. 2001
    International Workshop on High Performance       [20] H. V. Jagadish et al. A Native XML Database.
    XML Processing. 2004                                  In International Conference of VLDB. 2002
[4] D. Florescu et al. The BEA/XQRL Streaming        [21] N. Bruno, L. Gravano, N. Koudas, and S.
    XQuery Processor. In Proceedings of VLDB              Srivastava. Navigation vs. Index-Based XML
    Conference. 2003.                                     Multi Query Processing. In Proceedings of the
[5] P. Boncz et al. MonetDB/XQuery: A Fast                19th ICDE. 2003
    XQuery Processor Powered by a Relational         [22] S. A.-Khalifa et al. Structural Joins: A Primitive
    Engine. In Proceedings of ACM SIGMOD                  for Efficient XML Query Pattern Matching. In
    International Conference of Management of             International Conference of ICDE. 2002
    Data. 2006                                       [23] N. Bruno, N. Koudas, S. Srivastava. Holistic
[6] Y. Chen, S.B. Davidson, Y. Zheng. An Efficient        Twig Joins: Optimal XML Pattern Matching. In
    XPath Query Processor for XML Streams. In             SIGMOD. 2002