Document Sample
xml-database Powered By Docstoc
					XML Databases

                Department of Computer Science
           Cochin University if Science and Technology
                        Kochi – 682022

 Certified that this is a bonafied record of the seminar titled
                        XML Databases
                   Successfully presented by

                                        Sheena S

of first semester Software Engineering in the year 2007 in partial
fulfillment of requirement for the Degree of Master of
Technology(M-Tech) in Software Engineering of Cochin
University of Science and Technology.

Guided by                                          Dr.Paulose Jacob
Dr. Sumam Mary Idicula                             Director
Reader                                             Dept. of Computer Science

Date 06.11.07

Department of Computer Science, CUSAT                                          1
XML Databases


       XML Database is a database designed especially for storing XML document. It
understands the structure of the XML document and is able to perform queries and
retrieve the data taking advantage of the knowledge of its XML structure. The XML
Databases provide a logical model of grouping documents called ‘Collections’ which can
be created and managed one time. XML supports at least one form of querying systems
and also supports the management schemas. XML Database provides mechanisms to add
modify or delete contents. It optimizes the physical storage of the database. XML
Databases uses XQL, XPath or XQuery for querying. Document order is preserved if
XML Database is used. XML Databases may be inefficient if the document or a part is
required in a form different from which it is stored.

                The future of XML Database, particularly Native XML databases have
succeeded because of their query languages, the flexibility of the XML data model, and
their ability to handle schema less data. This does not imply that XML Databases will
become an instant standard. Because companies have invested a lot of time, effort, and
money into existing relational database systems. However, light weight applications
where information is stored using XML could greatly benefit from the creation of XML
Databases. XQuery is not going to replace SQL completely as the application query
language of choice in near future. But in ten years or so an extension of XQuery might
replace both.

       XML, Xquery, XPath, DTD, XSD, XML Schema, Database, XDR, XSL

Department of Computer Science, CUSAT                                               2
XML Databases


I thank GOD almighty for guiding me throughout the seminar. I would like to thank all
those who have contributed to the completion of the seminar and helped me with valuable
suggestions for improvement.

I am extremely grateful to Prof. Dr. K Poulose Jacob, Director, Department of Computer
Science, for providing me with best facilities and atmosphere for the creative work
guidance and encouragement.

I would like to thank my coordinator, Dr. Sumam Mary Idicula for all help and support
extend to me. I thank all staff members of my college and friends for extending their
cooperation during my seminar.

Above all I would like to thank my parents without whose blessings; I would not have
been able to accomplish my goal.

Department of Computer Science, CUSAT                                                3
XML Databases


I thank GOD almighty for guiding me throughout the seminar. I would like to thank all
those who have contributed to the completion of the seminar and helped me with valuable
suggestions for improvement.

I am extremely grateful to Prof. Dr. K Poulose Jacob, Director, Department of Computer
Science, for providing me with best facilities and atmosphere for the creative work
guidance and encouragement.

I would like to thank my coordinator, Dr. Sumam Mary Idicula for all help and support
extend to me. I thank all staff members of my college and friends for extending their
cooperation during my seminar.

Above all I would like to thank my parents without whose blessings; I would not have
been able to accomplish my goal.

Department of Computer Science, CUSAT                                                4
XML Databases

                             TABLE OF CONTENTS
1.     INTRODUCTION                                             1
2.     OVERVIEW OF XML DATABASES                                3
       2.1      The need to store XML                           5
       2.2      Structuring XML: DTDs and XML Schemas           6
                       2.2.1 Document Type Definitions (DTDs)   7
                       2.2.2 XML Data Reduced (XDR)             8
                       2.2.3 XML Schema Definitions (XSD)       8
       2.3      XML Presentation                                10
       2.4      XML Database Requirements                       11
       2.5      Designing an XML Database                       11
4.     CLASSIFICATION OF XML DATABASES                          16
                4.1 XML Enabled Databases                       16
                       4.1.1 Table-Based Mapping                16
                       4.1.2 Object-Relational Mapping          18
                4.2 Native XML Databases                        18
                       4.2.1 Text-Based Native XML Databases    19
                       4.2.2 Model-Based Native XML Databases   20
                4.3 Hybrid Database                             20
5.     XML QUERYING                                             21
                5.1 XPath                                       23
                5.2 XQuery                                      25
6.     APPLICATIONS OF XML DATABASES                            29
7.     CHALLENGES WITH XML                                      30
8.     CONCLUSION                                               31
9.     REFERENCES                                               32

Department of Computer Science, CUSAT                                5
XML Databases

                                  1. INTRODUCTION

   The eXtensible Markup Language (XML) is a meta-markup language developed by
the World Wide Web Consortium (W3C) to deal with a number of the shortcomings of
HTML. XML is derived from the Standard Generalized Markup Language (SGML).
XML, a subset of the Generic Standard Generalized Markup Language (SGML, ISO
8879:1986), was designed for interoperability with HTML. An SGML Working Group,
also organized by the W3C, assisted with the development of XML. The initial goal of
these groups was to allow for SGML to be “served, transmitted and processed” on the
Web as easily as HTML.            W3C published the resulting work, the W3C XML
Recommendation, in 1998.

       XML is not a new version or replacement for HTML. XML is concerned with the
description and representation of the data, rather than the way the data are displayed. The
main difference lies in two key areas: the semantics and syntax of tags, in XML the
author of the document is free to create tags whose syntax and semantics are specific to
the target application whereas in HTML it is fixed. Also the semantics of a tag is not tied
down but is instead dependent on the context of the application that processes the
document. The other significant difference between HTML and XML is that an XML
document must be well-formed.

       An XML document is a database; it is a collection of data. As a "database"
format, XML has some advantages like: it is self-describing, it is portable, and it can
describe data in tree or graph structures. XML documents fall into two broad categories:
Data-centric and document-centric. Data-centric documents are documents that use XML
as a data transport. They are designed for machine consumption. Examples of data-
centric documents are sales orders, flight schedules, scientific data, and stock quotes.
Data-centric documents are characterized by fairly regular structure, fine-grained data,
and little or no mixed content.

       Document-centric documents are (usually) documents that are designed for
human consumption. Examples are books, email, advertisements, and almost any hand-
written XHTML document. They are characterized by less regular or irregular structure,
larger grained data, and lots of mixed content. The basic functions of an XML database

Department of Computer Science, CUSAT                                                    6
XML Databases

are to store documents, query over documents and handle query results. Of course,
indexes are required to obtain acceptable query performance.

       Large repositories of XML data have begun to emerge.              The effective
management of XML in a database thus becomes a pressing issue. A central challenge in
this regard is the complex and heterogeneous structure of XML data. In spite of these
challenges, XML Databases are popular. There are no requirements for how an XML
Database is expected to physically store XML documents. Some XML Databases are
built on an object database, others might use compressed files with an indexing scheme,
and still others might be built on top of a relational database. Based on this there are
three types of XML Databases: XML – Enabled database, Native XML Databases and
Hybrid XML Databases, the most popular one being Native XML Databases. The
querying languages for XML include XPath and XQuery and the programming interfaces
for XML include SAX, DOM.

Department of Computer Science, CUSAT                                                 7
XML Databases

                    2. OVERVIEW OF XML DATABASES

       XML has emerged as the new standard for information representation and
exchange on the web, XML Databases are a new option for persisting your data. In most
XML Databases, the fundamental unit is the XML Document, which roughly corresponds
to a record in a traditional database. An XML Database (XEDB) is database that has an
XML mapping layer added on to it. Data manipulation in an XEDB happens with either
XML technologies (e. g. XPath) or database technologies (e.g. SQL). The heart of an
XML database, or any database, is how information is stored, indexed, and queried.
These design details may not be obvious from the outside, but are critical to the function,
performance, and scalability of a database system. An XML database exposes a logical
model of storing and retrieving XML documents; however, its internal storage model is
not necessarily equivalent to the document. The storage format, in many ways, drives the
features that can be supported by the database.

       Indexing is a crucial component of any database. Without the ability to
intelligently index stored information, a database system is little better than a file system
for information retrieval. Indexing XML is interesting because there are a number of
options on what, and how to index, some of which are dictated by the storage format.
Query processing builds on both storage format and indexes.            Implementation and
performance of query processing is affected not only by storage format and indexes, but
by the query language(s) available.

       Indexing and querying of XML data has been a major research issue in the
database world. XML data can be modeled as an ordered node labeled tree. XPATH and
XQUERY are proposed for querying XML. An update in an XML document can change
both the structure and content of the document.
Here are some specific advantages of XML over other data formats:
           •    XML is cross platform and is standards based.
           •    XML enjoys wide industry acceptance.
           •    XML documents are human readable.
           •    XML specifies content from presentation.
           •    XML is a middle-of-the road approach.

Department of Computer Science, CUSAT                                                      8
XML Databases

           •    One of the great strengths of XML is its flexibility in representing many
                different kinds of information from diverse sources.
           •    XML document has a hierarchical structure as shown in the figure 2.1

       The figure 2.1 shows an XML document and its hierarchical representation with
bib as the root element and the child elements include book. The attributes of the
elements are shown as their child elements.

   <title>HTML 4.0</title>

                                        Figure 2.1

Department of Computer Science, CUSAT                                                  9
XML Databases

2.1 The need to store XML

       As explained in the previous section, XML was not initially developed as a model
for storing information. Later some similarities were found between an XML document
and in databases like: storage (XML documents), schemas (DTDs, XML Schemas,
RELAX NG, and so on), query languages (XQuery, XPath, XQL, XML-QL, QUILT,
etc.), programming interfaces (SAX, DOM, JDOM). Also once an organization has
decided to adopt XML to underpin a key business process, the need to store the
information (and, of course, to retrieve it later) soon follows. There are essentially three
ways that this can be done:

   •   Store XML as files. In this approach, the XML document in its raw textual form
       is treated as any other text file, and is stored as such, either as a file in an
       operating system file store, or in some kind of general-purpose document
       management system, or perhaps as a “BLOB1” or “CLOB2” in a relational
       database. Some kind of external index is maintained to enable these files to be
       subsequently retrieved: Perhaps the only retrieval mechanism is to give each file a
       hierarchical name.
   •   Extract the data. In this approach, the XML document is parsed and the
       information it contains is extracted into some kind of database, typically a
       relational database. The original XML document is not retained. When the
       information is subsequently required, a new XML document is constructed by
       assembling all the relevant information items from the database.
   •   Use an XML database. With this approach, the XML document is stored in a
       database that understands the structure of the XML document and is able to
       perform queries and retrieve the data taking advantage of knowledge of its XML

   Here we concentrate on the third approach. The other two approaches cannot be
completely dismissed, and in fact there are many situations where they might be
appropriate. But they do have considerable disadvantages. The main disadvantage of the
file-based (or BLOB-based) approach is that the structured information inside the XML
document cannot be used for retrieval purposes. Separate indexes need to be created to

Department of Computer Science, CUSAT                                                    10
XML Databases

locate the required document based on one or more keys such as a date or employee
number. If these keys are not known, it is very hard to locate the information. In addition,
the only thing that can be retrieved is the original document. It isn’t possible, for
example, to do an aggregation query, such as “Which resorts have an average February
temperature above 25ºC?” The documents need to be individually retrieved, and the
information extracted from them, probably by hand. This means that the value of having
the XML markup in the documents is not being exploited, and the information asset is not
being used to its full potential. Extracting the data into a relational database has other
disadvantages. Firstly, it only works where the data fits comfortably into rows and
columns. This doesn’t work well for narrative documents such as news reports, and it
doesn’t work well for complex data structures such as medical records. Secondly, there is
no record of the original document, which might well be needed for archiving purposes
or for legal traceability requirements.

   In the real world, most information is semi-structured: consider a CV (résumé), an
accident report, an invoice, a software bug report, an insurance claim, a description of a
CD offered for sale on the Internet. Traditionally, in the IT world, we have handled the
structured part of the information and the unstructured part separately, using different
technologies and often using different business processes. The Internet has changed that:
Both parts now need to be handled as a whole, and a major reason for the success of
XML is that it is the first mainstream technology that can handle the whole spectrum of
information from the highly structured to the totally unstructured. Semi-structured data
has become the norm, and the limitations of technologies that can handle only one end of
the spectrum have become painfully apparent. The only reason to store anything is so
that the information can later be retrieved.

2.2 Structuring XML: DTDs and XML Schemas

       Since XML is a way to describe structured data there should be a means to
specify the structure of an XML document. Document Type Definitions (DTDs) and
XML Schemas are different mechanisms that are used to specify valid elements that can
occur in a document, the order in which they can occur and constrain certain aspects of
these elements. An XML document that conforms to a DTD or schema is considered to
be valid. Below is listing of the different means of constraining the contents of an XML

Department of Computer Science, CUSAT                                                    11
XML Databases

document. We will illustrate them using the following XML fragment. It shows the
details of a gatech_student whose gtnum is gt000x.


  <gatech_student gtnum="gt000x">
  <name>George Burdell</name>

2.2.1 Document Type Definitions (DTDs)

   DTDs were the original means of specifying the structure of an XML document and a
holdover from XML's roots as a subset of the Standardized and General Markup
Language (SGML). DTDs are like schema for XML documents. A DTD specifies the
structure of an XML element by specifying the names of its sub elements and attributes.
Sub-element structure is specified using the operators * (set with zero or more elements),
+ (set with one or more elements), ?(optional), and | (or). All values are assumed to be
string values unless the type is ANY in which case the value can be an arbitrary XML
fragment. There is no concept of a root of a DTD – an XML document conforming to a
DTD can be rooted at any element specified in the DTD. XML and are used to specify
the order and occurrence of elements in an XML document. Below is a DTD for the
above XML fragment. The following is DTD specification.

     <!ELEMENT gatech_student (name, age)>
     <!ATTLIST gatech_student gtnum CDATA>
     <!ELEMENT name (#PCDATA)>
     <!ELEMENT age (#PCDATA)>

         The DTD specifies that the gatech_student element has two child element The
DTD specifies that the gatech_student element has two child elements, name and age,
that contain character data as well as a gtnum attribute that contains character data. A
DTD defines the legal elements of an XML Document. The purpose of a DTD is to
define the legal building block of an XML document. It defines the document structure

Department of Computer Science, CUSAT                                                  12
XML Databases

with a list of legal elements. The problem with DTD’s is that they have their own weird
syntax that has nothing to do with XML
2.2.2 XML Data Reduced (XDR)
                   DTDs proved to be inadequate for the needs of users of XML due to a
number of reasons. As mentioned above, the main reasons behind the criticisms of DTDs
were the fact that they used a different syntax than XML and their non-existent support
for data types. XDR, a recommendation for XML schemas, was submitted to the W3C
by the Microsoft Corporation as a potential XML schema standard which but was
eventually rejected. XDR tackled some of the problems of DTDs by being XML based as
well as supporting a number of data types analogous to those used in relational database
management systems and popular programming languages. Below is an XML schema,
using XDR, for the above XML fragment.

<Schema name="myschema" xmlns="urn:schemas-microsoft-com:xml-data"
  <ElementType name="age" dt:type="ui1" />
     <ElementType name="name" dt:type="string" />
     <AttributeType name="gtnum" dt:type="string" />
     <ElementType name="gatech_student" order="seq">
          <element type="name" minOccurs="1" maxOccurs="1"/>
          <element type="age" minOccurs="1" maxOccurs="1"/>
         <attribute type="gtnum" />
         The above schema specifies types for a name element that contains a string as
its content, an age element that contains an unsigned integer value of size one byte (i.e.
btw 0 and 255), and a gtnum attribute that is a string value. It also specifies a
gatech_student element that has one occurrence each of a name and an age element in
sequence as well as a gtnum attribute.

2.2.3 XML Schema Definitions (XSD)

       The W3C XML schema recommendation provides a sophisticated means of
describing the structure and constraints on the content model of XML documents. W3C

Department of Computer Science, CUSAT                                                  13
XML Databases

XML schema support more data types than XDR, allow for the creation of custom data
types, and support object oriented programming concepts like inheritance and
polymorphism. Currently XDR is used more widely than W3C XML schema but this is
primarily because the XML Schema recommendation is fairly new and will thus take
time to become accepted by the software industry.

    XSD For Sample Xml Fragment
     <schema xmlns="" >
     <element name="gatech_student">
       <element name="name" type="string"/>
       <element name="age" type="unsignedInt"/>
       <attribute name="gtnum">
        <restriction base="string">
         <pattern value="gt\d{3}[A-Za-z]{1}"/>

       The above schema specifies a gatech_student complex type (meaning it can have
elements as children) that contains a name and an age element in sequence as well as a
gtnum attribute. The name element has to have a string as content, the age attribute has an
unsigned integer value while the gtnum element has to be matched by a regular
expression that matches the letters "gt" followed by 3 digits and a letter. The above
examples show that DTDs give the least control over how one can constrain and structure
data within an XML document while W3C XML schemas give the most.

Department of Computer Science, CUSAT                                                   14
XML Databases

2.3 XML Presentation

       One of the main benefits of XML is that it separates data structure from its
presentation and processing. By separating data and presentation, you are able to present
the same data in the different ways, which is similar to having views in SQL.

       The Extensible Style Language Specification provides the mechanism to display
XML data. Extensible Style Language (XSL) is a specification used to define the rules
by which XML data are formatted and displayed. The XSL specification is divided in
two parts. Extensible Style Language Transformations (XSLT) and XSL style sheets.

   •   Extensible Style Language

   XSL is a language for expressing style sheets. It is a language for transforming XML
documents. An XML vocabulary for specifying formatting semantics

   •   Extensible Style Language Transformation

   A language for transforming XML documents into other XML documents. It is
designed for use both as part of XSL and independently of XSL and not intended as a
completely general-purpose XML transformation language.

                                                 Figure 2.1

   XML                  XSL                               XSL style        HTML
   document             Transformation                    sheets
                                     - Extract                formatting   The process
                                     - Convert                rules to     can render
                                                              XML          different web
                                                              elements     pages for
                          New XML                                          purposes, such
                                                                           as one page
                          document                                         for a web
                                                                           browser and
                                                                           another for a
                        XSLT can be used to                                mobile device
                        transform one XML
                        document into another
                        XML document

Department of Computer Science, CUSAT                                                       15
XML Databases

       The figure 2.1 shown above illustrates the framework used by the different
components to translate XML documents into viewable Web pages, an XML document,
or some other document.

2.4 XML Database Requirements

     As well as supporting the kinds of query that people expect to perform on stored
XML data, an XML database system must meet many other requirements. Many of these
are not specific to XML. For example, a database must:
           Support the management of schemas to define the data structure, and the
           validation of input according to those schemas
           Provide mechanisms to add, modify or delete content.
           Offer facilities for multi-user access, transaction-based isolation and deadlock
           detection, backup, recovery and replication
           Provide bulk data import and export capabilities.
           Allow the physical storage of the database to be optimized, for example
           allowing user control over the creation of indexes and the allocation of disk
           space and other resources.
           Optimize queries to provide the most efficient possible execution, using the
           indexes and other access paths that have been made available
           Support of at last one form of querying syntax.
           Many XML databases provide a logical model of grouping documents, called
           ‘Collections’ which can be created and managed at one time.
       Because the development of an industrial – strength database system is a major
databases that were originally designed for a different purpose; typically a relational
database or an object database.

2.5 Designing an XML Database

       There is a wealth of experience, captured in books, methods, and tools, for
designing relational databases. This experience has evolved over thirty years. XML
databases are a much more recent phenomenon, and the database designer therefore has
to work much more from first principles. The traditional approach to database design
follows the steps:

Department of Computer Science, CUSAT                                                   16
XML Databases

   1. Collect information about the application domain, by interviewing users,
         collecting process descriptions and studying existing applications.
   2. Build a model of the objects, attributes and relationships in the domain, using a
         notation such as UML.
   3. Translate this into a relational schema by applying the principles of normalization.
   4. Refine this design as necessary to ensure that the performance requirements of the
         applications are met.

         In principle, a similar approach can be followed for an XML database, with the
exception of step three, where the model is translated into XML elements and attributes
instead of relational tables and columns. But there is one big difference. In many cases,
the XML documents are designed primarily for information interchange, not for holding
persistent data. The designer must therefore decide whether the XML database is to hold
the documents in the form they arrived (essentially, a repository of messages) or whether
to refactor the document contents for query purposes. The decision depends very much
on the particular application requirements.

         Documents in an XML database should ideally relate to individual things or
events that are familiar to people in the user community: A product, an inspection report,
a job application. Sometimes it makes sense to assemble related information into a single
document, for example, to collect together all the information relating to one medical
episode, even though it may arrive in small pieces over a period of time. A good guide is:
What is the packaging of information that is most often going to be requested in a single
         Another factor that requires some thought is the modeling of relationships.
Unlike the relational model, which essentially only offers one way to represent
relationships (the primary key / foreign key combination), XML offers a bewildering
variety of techniques. These range from the use of hierarchic nesting, to ID/IDREF
pointers, and URLs and XPointer hyperlinks to reference one document to another. And
of course relational-style foreign keys are also available as an option. This richness
derives from the variety of mechanisms used in written documents, but some of these
techniques are much more amenable to database queries than others.


Department of Computer Science, CUSAT                                                  17
XML Databases

The key features of XML that have led to its widespread adoption are:
   •   It is simple enough that parsers became available very quickly and were
       distributed in most cases as free for open-source software. This meant that the
       decision to use XML could be made in many organizations without a lengthy
       investment review.
   •   It largely gets rid of low level character encoding problems by adopting Unicode
       as its single character repertoire, allowing world wide deployment.
   •   It is sufficiently flexible to handle both narrative documents intended for a human
       readership( for example, web pages) and arbitrary hierarchy data structures
       intended to be processed by applications as well as the combinations of the two.
   •   The syntax itself is human-readable and self describing, allowing a simple
       documents to be created or read using nothing more than a standard text editor.
       Like HTML, an XML document contains tags that indicate what each type of data
       is. With good document design, it should be reasonably simple for a person to
       look at an XML document and say, “this contains customers, orders and prices”.
   •   Changes to your document won’t break the parser. Assuming that the XML you
       write is syntactically correct, you can add elements to your data structure without
       breaking backward compatibility with earlier versions of your applications.
   •   XML documents can be hierarchical. It is easy to add related data to a node in an
       XML document without making the document unwieldy.
   •   There is no turf war: XML is supported by the entire IT industry, and the products
       offered by different vendors are highly interoperable. There is nothing about
       XML that ties it to any particular operating system or underline technology. You
       don’t have to ask any ones permission or pay anyone money to use XML. If the
       computer you are working on has a text editor, you can use it to create an XML
       document. Several types of XML parsers exist for virtually every operating
       system in use today (even really weird once).

   These factors mean that today, XML is widely deployed not only for the original
application area of web content management, but for a large variety of other applications.
In many of this application the information needs to be stored somewhere, and this is
where the requirement for XML Databases comes from.

Department of Computer Science, CUSAT                                                     18
XML Databases

   •   Relational Database has a row( relation) as the storage unit while an XML
       Database uses an XML Document as a storage unit.
   •   Relational databases use SQL for querying a s against XQL, XPath or XQuery
       used by XML Databases.
   •   Document order is not preserved by a relational database as opposed to XML
   •   Relational Database are generally inefficient if the entire document is needed as it
       may be split across tables. XML Databases may be inefficient if document or a
       part is required in a form different from which it is stored. In this respect XML
       Databases are similar to hierarchical databases.
   •   Relational Databases use SQL for update while XML Databases use XUpdate
       (XML:DB initiative).      However, updates may cause fragmentation in XML
       database storage.
   •   Relational Databases provide concurrency at row, table or schema level. XML
       Databases usually provide concurrency at document level, however some
       implementation may provide it at document node level.

Department of Computer Science, CUSAT                                                   19
XML Databases

                Relational Model                       XML Model

       •   Tabular representation of your     •   Hierarchical representation of
           data                                   your data.

       •   Strongly structured.               •   Semi structured

       •   Static schema definition.          •   Flexible schema definition.

       •   The same schema applies to         •   An XML Schema may or may
           every row of table.                    not exist for some or all of your
                                                  XML documents. XML
                                                  Schemas are easily extended.

       •   All relationships are defined by   •   A document contains both data
           primary keys and foreign keys.         and relationship information
                                                  that describes how the data is

       •   Order is unimportant.              •   Order is significant.
           Information is organized in sets       Information is organized in
           which are unordered by                 sequences which are ordered by
           definition.                            definition.

       •   Strongly typed. Each column        •   Optionally typed. Types might
           has exactly one data type.             or might not be defined for
                                                  some or all elements and
                                                  attributes in an XML Schema.

       •   ANSI/ ISO standardization.         •   W3C standardization.

       •   Three-value logic :true, false,    •   Two-value logic: true, false

Department of Computer Science, CUSAT                                              20
XML Databases


    There exist large architectural differences among XML databases, however here we
classify them according to differences in the way the database engines store XML

4.1 XML Enabled Databases

        We call a database an XML-enabled database it its core storage and processing
model is not the XML data model. In many cases its core is the relational model and a
mapping between the XML data model and the relational data model is required. All
major relational database systems can be considered XML-enabled databases because
they support this mapping to manage XML. XML-enabled databases are useful when
publishing existing data as XML or importing data from an XML document into an
existing database. However, they are not a good way to store complete XML documents.
The reason is that they store data and hierarchy but discard everything else: document
identity, sibling order, comments, processing instructions, and so on. In addition, because
they require design-time mapping of schemas, they cannot store documents whose
schema is not known at design time.

        Mappings between document schemas and database schemas are performed on
element types, attributes, and text. Generally, the physical structure of the document is
lost because some parts (like white spaces, entities, CDATA sections, and encoding
information) of the document are ignored. The XML to relational database mapping
could be one of the following:

4.1.1 Table-Based Mapping

        The table-based mapping is used by many of the middleware products that
transfer data between an XML document and a relational database. It models XML
documents as a single table or set of tables. That is, the structure of an XML document
must be as follows, where the <database> element and additional <table> elements do not
exist in the single-table case:

Department of Computer Science, CUSAT                                                   21
XML Databases


         Here column data could be stored as child elements or attributes, as well as what
names to use for each element or attribute. In addition, products that use table-based
mappings often optionally include table and column metadata either at the start of the
document or as attributes of each table or column element. The table-based mapping is
useful for serializing relational data, such as when transferring data between two
relational databases. Its obvious drawback is that it cannot be used for any XML
documents that do not match the above format.

         The obvious advantage of this mapping is that it is simple and easy to understand.
Because it matches the structure of tables and result sets in a relational database, it is easy
to write code based on this mapping. This code is fast and scales well, and is quite useful
for certain applications, such as transferring data between databases one table at a time.

         The mapping has several disadvantages, the most notable of which is that it only
works with a very small subset of XML documents. In addition, it does not preserve
physical structure (such as character and entity references, CDATA sections, character

Department of Computer Science, CUSAT                                                        22
XML Databases

encodings, or the standalone declaration), document information (such as the document
type or DTD), comments, or processing instructions.

        Despite its limitations, the table-based mapping is widely implemented, such as in
middleware used to transfer data between XML documents and relational databases, in
Web application servers and XML-enabled databases to return result sets as XML, and in
many home-grown applications. An important use is in implementing XQuery over
relational databases, in which each table is treated as a virtual XML document (according
to the table-based mapping) and then queried through XQuery.

4.1.2 Object-Relational Mapping

   Because table-based mappings only work with a limited subset of XML documents,
some middleware, most XML-enabled relational databases, and most XML-enabled
object servers use a more sophisticated mapping, called an object-relational mapping.
This models the XML document as a tree of objects that are specific to the data in the
document. The model is then mapped to relational databases using traditional object-
relational mapping techniques. That is, classes are mapped to tables, scalar properties are
mapped to columns, and object-valued properties are mapped to primary key / foreign
key pairs. Here it may be noted that the object obtained here is different from DOM and
would be different for different XML schemas, which DOM is same for all XML

4.2 Native XML Databases

        Native XML databases are databases designed especially to store XML
documents. This can store complete documents and can store any document, regardless of
schema. Like other databases, they support features like transactions, security, multi-user
access, programmatic APIs, query languages, and so on. The only difference from other
databases is that their internal model is based on XML and not something else, such as
the relational model. Native XML databases are most commonly used to store document-
centric documents. The main reason for this is their support of XML query languages.
Native XML databases are also commonly used to integrate data. Native XML databases
also handle schema changes more easily than relational databases and can handle schema
less data as well. The third major use case for native XML databases is semi-structured

Department of Computer Science, CUSAT                                                   23
XML Databases

A native XML Database has the following three characteristics:

   1. Defines a (logical) model for an XML document. Data is stored and retrieved
       according to that model. The model must include elements, attributes, PCDATA,
       and document order.
   2. An XML document is the fundamental unit of (logical) storage, just as a relational
       database has a row in a table as its fundamental unit of (logical) storage.
   3. Is not required to have any particular underlying physical storage model. The
       XML data model is the fundamental unit of physical storage for all XML data.
       For example, it can be built on a relational, hierarchical, or object-oriented
       database, or use a proprietary storage format such as indexed, compressed files.

A native XML databases would be one of the following, classified on the way the XML
document is stored typically.

4.2.1 Text-Based Native XML Databases

       A text-based native XML database is one that stores XML as text. This might be a
file in a file system, a BLOB in a relational database. All text-based native XML
databases are indexes, which allow the query engine to easily jump to any point in any
XML document. This gives such databases a tremendous speed advantage when
retrieving entire documents or document fragments. This is because the database can
perform a single index lookup, position the disk head once, and, assuming that the
necessary fragment is stored in contiguous bytes on the disk, retrieve the entire document
or fragment in a single read.

       In this sense, a text-based native XML database is similar to a hierarchical
database, in that both can outperform a relational database when retrieving and returning
data according to a predefined hierarchy. Also like a hierarchical database, text-based
native XML databases are likely to encounter performance problems when retrieving and
returning data in any other form, such as inverting the hierarchy or portions of it.

4.2.3 Model-Based Native XML Databases

Department of Computer Science, CUSAT                                                     24
XML Databases

       The second category of native XML databases is model-based native XML
databases. Rather than storing the XML document as text, they build an internal object
model from the document and store this model. How the model is stored depends on the
database. Some databases store the model in a relational or object-oriented database. For
example, storing the DOM in a relational database might result in tables such as
Elements, Attributes, PCDATA, Entities, and Entity References. Other databases use a
proprietary storage format optimized for their model.

       Model-based native XML databases that use a proprietary storage format are
likely to have performance similar to text-based native XML databases when retrieving
data in the order in which it is stored. This is because most such databases use physical
pointers between nodes, which should provide performance similar to retrieving text.
(Which is faster also depends on the output format. Text-based systems are obviously
faster at returning documents as text, while model-based systems are obviously faster at
returning documents as DOM trees, assuming their model maps easily to the DOM.)

4.3 Hybrid Database

       A hybrid database is a relational database that is XML-enabled, but also offers
native XML capabilities as defined above.       It is a database that supports both the
relational data model and the XML data model in all its processing and storage
mechanisms. The planned release of DB2 Universal Database is a hybrid database.

Department of Computer Science, CUSAT                                                 25
XML Databases

                  5. XML QUERYING: XPATH AND XQUERY

   As more and more information is either stored in XML, exchanged in XML, or
presented as XML through various interfaces, the ability to intelligently query our XML
data sources becomes increasingly important.      It is sometimes necessary to extract
subsets of the data stored within an XML document.

   Most of the requirements for accessing XML information can be conveniently
classified under one of three headings:
   •   Get me the documents
   •   Give me the facts
   •   Tell me about X
   By ‘Get me the documents’ we mean queries whose aim is to locate one or more
documents. The documents that are returned are identical to documents that were stored
in the database at some time in the past. The information used to retrieve the document
may be something as simple as a unique reference, or it may be some combination of
properties that the document must possess. For example: “Give me the most recent
appraisal for employee E12345.”

       By ‘Give me the facts’ we mean queries that extract factual information from
documents. The required information may be all in one document, or it may be spread
across many documents. For example: “When was the last time employee E12345 was
recommended for promotion?” Or: “How many claims were made last year by policy-
holders in Durham, and what was their average value?”

   By ‘Tell me about X’ we mean information retrieval queries of the kind that people
submit to Web search engines. However much we take care to add markup to narrative
documents, there will always be cases where the only way to find relevant documents is
to search the text. For example, the best way to find an employee with connections in
Peru might simply be to search for the word Peru, appearing anywhere in any part of the
document. Searches that analyze the textual content and also take advantage of contextual
information based on the XML markup can be especially useful. A characteristic of this

Department of Computer Science, CUSAT                                                 26
XML Databases

kind of query is that there is no right answer. It’s up to the search engine to use as much
intelligence as it can to find the documents that are most relevant to the user’s enquiry.

       Traditionally these three kinds of queries have been handled by different kinds of
storage software. Get me the documents can be done using file-based storage with simple
keyword indexing, so long as the attributes that will be used for retrieval are known in
advance (WebDAV is a protocol that implements this idea, and is used by many content
management applications). Give me the fact is the traditional domain of relational
databases: Facts are extracted from the source documents and stored separately, so that
they can be searched and aggregated. Tell me about X is the domain of free text retrieval
packages and Internet search engines.       Sometimes one of these patterns of enquiry
dominates; in which case it makes sense to choose software that specializes in that kind
of enquiry. But because XML is semi-structured, a general-purpose XML database needs
to be good at dealing with all three kinds. XQuery a query language for accessing XML
Databases is designed to handle all the three kinds of enquires given above.

       In particular, the heart of any database product is the query execution engine,
which constructs, optimizes, and then executes a query execution plan to deliver the
results of a user query. The design of the primitive operators that make up the query
execution plan reflect the operational algebra that underpins the query language, whose
formal semantics are in turn based on the invariants of a particular data model.

       In a relational engine, these operators are the well-known relational primitives
such as restriction, projection and join. An XML engine has a different set of operators,
which are better suited to the recursive tree traversals required for efficient execution of
path expressions. The different set of operators is needed because the basic data model
for XML (that is, a recursive tree structure) is fundamentally different from the rows-and-
columns data model of SQL.

       Of course, XML queries can be mapped into operations in the relational algebra,
just as SQL queries can be mapped to operations in a tree-based algebra. But
the result is very unlikely to be optimal. (This is particularly true when mapping an XML
query language to relational operators, because of the difficulty in handling recursive
queries in SQL. This problem, which is sometimes referred to as the parts explosion

Department of Computer Science, CUSAT                                                        27
XML Databases

problem, has been known since the 1970s, and reflects a fundamental limitation in the
mathematical power of the first-order predicate calculus on which the relational model is

          A number of languages have been created for querying XML documents
including Lorel, Quilt, UnQL, XDuce, XML-QL, XPath, XQL, XQuery and YaTL. Since
XPath is already a W3C recommendation while XQuery is on its way to becoming one,
the focus of this section will be on both these languages. Both languages can be used to
retrieve and manipulate data from an XML document.

5.1 XML Path Language (XPath):

          XPath is a language for addressing parts of an XML document that utilizes a
syntax that resembles hierarchical paths used to address parts of a filesystem or URL.
XPath also supports the use of functions for interacting with the selected data from the
document. It provides functions for the accessing information about document nodes as
well as for the manipulation of strings, numbers and booleans. XPath is extensible with
regards to functions which allow developers to add functions that manipulate the data
retrieved by an XPath query to the library of functions available by default. XPath uses a
compact, non-XML syntax in order to facilitate the use of XPath within URIs and XML
attribute values (this is important for other W3C recommendations like XML schema and
XSLT that use XPath within attributes).

          With XPath, an objet-relational mapping must be used to do queries across more
than one table. This is because XPath does not support join across documents. Thus if
the table-based mapping was used, it would be possible to query only one table at a time.
It is set-based query syntax for extracting data from an XML document.
          The idea behind XPath is that you should be able to extract data from an XML
document using a compact expression, ideally on a single line of code. Using XPath is
generally a more concise way to extract information buried deep within an XML
document. The three most common XPath scenarios include:
              •   Retrieving a subset of nodes that match a certain value (for eg,all of the
                  orders associated with customer)
              •   Retrieving one o more nodes based on the value of an attribute

Department of Computer Science, CUSAT                                                    28
XML Databases

               •   Retrieving all the parent and child nodes where an attribute of a child
                   node matches a certain value
As said earlier, the goal of XPath is to define a language that addresses parts of XML
documents.         In order to accomplish this goal, the XPath is to define two main
components: an expression syntax that allows the description of paths to parts of the
XML document. And, in support of these expressions, a basic set of functions – such as
count(), known as the XPath core library.

       XPath models an XML document as a tree of nodes. There are different types of
nodes, including element nodes, attribute nodes and text nodes. XPath operates on an
XML document as a tree as shown in figure 5.1.

       A location path is the most common type of expression in XPath that refers to a
node or group of nodes. Location paths are constructed by concatenating steps separated
by’/’. They are very similar to normal directory paths in file systems. The following are
some of the XPath queries that use location paths.

Sample XPath Queries Against Sample XML Fragment

   •   /gatech_student/name

       Selects all name elements that are children of the root element gatech_student.

   •   //age

       Selects all age elements in the document.

   •   /gatech_student/*

       Selects all child elements of the root element gatech_student.

Department of Computer Science, CUSAT                                                    29
XML Databases

   •   /gatech_student[@gtnum]

       Selects all gtnum attributes of the gatech_student elements in the document.

   •   //*[name()='age']

       Selects all elements that are named "age".

   •   /gatech_student/age/ancestor::*

       Selects all ancestors of all the age elements that are children of the gatech_student
element (which should select the gatech_student element).

5.2 XML Query Language (XQuery):

       XQuery is a functional language where each query is an expression. XQuery
expressions fall into seven broad types; path expressions, element constructors, FLWR
expressions, expressions involving operators and functions, conditional expressions,
quantified expressions or expressions that test or modify datatypes. The syntax and
semantics of the different kinds of XQuery expressions vary significantly which is a
testament to the numerous influences in the design of XQuery.

5.2.1 Sample XQuery Queries and Expressions Path expressions: XQuery supports path expressions that are a superset of those
currently being proposed for the next version of XPath.

   •   //emp[name="Fred"]/salary * 12
       From a document that contains employees and their monthly salaries, extract the
       annual salary of the employee named "Fred".

   •   document("zoo.xml")//chapter[2 TO 5]//figure
       Find all the figures in chapters 2 through 5 of the document named "zoo.xml." element constructors: In some situations, it is necessary for a query to create or
       generate elements. Such elements can be embedded directly into a query in an
       expression called an element constructor.

   •   <emp empid = {$id}>

Department of Computer Science, CUSAT                                                    30
XML Databases

Generate an <emp> element that has an "empid" attribute. The value of the attribute and
the content of the element are specified by variables that are bound in other parts of the
query. FLWR expressions: A FLWR (pronounced "flower") expression is a query
construct composed of FOR, LET, WHERE, and a RETURN clauses. A FOR clause is an
iteration construct that binds a variable to a sequence of values returned by a query
(typically a path expression). A LET clause similarly binds variables to values but instead
of a series of bindings only one occurs similar to an assignment statement in a
programming language. A WHERE clause contains one or more predicates that are used
on the nodes returned by preceding LET or FOR clauses. The RETURN clause generates
the output of the FLWR expression, which may be any sequence of nodes or primitive
values. The RETURN clause is executed once for each node returned by the FOR and
LET clauses that passes the WHERE clause. The results of these multiple executions is
concatenated and returned as the result of the expression.

   •     FOR $b IN document("bib.xml")//book
                           WHERE $b/publisher = "Morgan Kaufmann"
                           AND $b/year = "1998"
                           RETURN $b/title

                   List the titles of books published by Morgan Kaufmann in 1998.

   •     <big_publishers>
                   FOR $p IN distinct(document("bib.xml")//publisher)
                   LET $b := document("bib.xml")//book[publisher = $p]
                   WHERE count($b) > 100
                   RETURN $p

List the publishers who have published more than 100 books.

Department of Computer Science, CUSAT                                                   31
XML Databases Conditional expressions: A conditional expression evaluates a test expression
         and then returns one of two result expressions. If the value of the test expression
         is true, the value of the first result expression is returned otherwise, the value of
         the second result expression is returned.

    •    FOR $h IN //holding
          IF ($h/@type = "Journal")
          THEN $h/editor
          ELSE $h/author
        SORTBY (title)
Make a list of holdings, ordered by title. For journals, include the editor, and for all
other holdings, include the author. Quantified expressions:

    XQuery has constructs that are equivalent to quantifiers used in mathematics and
logic. The SOME clause is an existential quantifier used for testing to see if a series of
values contains at least one node that satisfies a predicate. The EVERY clause is a
universal quantifier used to test to see if all nodes in a series of values satisfy a predicate.

    •    FOR $b IN //book
             WHERE SOME $p IN $b//para SATISFIES
             (contains($p, "sailing") AND contains($p, "windsurfing"))
             RETURN $b/title

         Find titles of books in which both sailing and windsurfing are mentioned in the
same paragraph.

    •    FOR $b IN //book
             WHERE EVERY $p IN $b//para SATISFIES
             contains($p, "sailing")

Department of Computer Science, CUSAT                                                         32
XML Databases

             RETURN $b/title

Find titles of books in where sailing is mentioned in every paragraph. Expressions involving user defined functions:

   Besides providing a core library of functions similar to those in XPath, XQuery also
allows user defined functions to be used to extend the core function library.

   •   NAMESPACE xsd = ""
         DEFINE FUNCTION depth($e) RETURNS xsd:integer
          # An empty element has depth 1
          # Otherwise, add 1 to max depth of children
          IF (empty($e/*)) THEN 1
          ELSE max(depth($e/*)) + 1

Find the maximum depth of the document named "partlist.xml."

       Like XML, XQuery is a case sensitive language. Keywords in XQuery use
lower-case characters and are not reserved - that is, names in XQuery expressions are
allowed to be the same as language keywords, except for certain unprefixed function-
names. XQuery returns an ordered sequence of nodes.

Department of Computer Science, CUSAT                                               33
XML Databases


       ‘Web Services’ is a new industry buzzword that refers to an XML representation
of objects, programs and messages available over the Internet for application to
application communication. Webservices technology will get lot of attention in the
coming year due to its promise of allowing a data-Independent means for coupling
disparate system toward supporting e-services for better productivity. The domain or
application specific XML developments include FpML-Financial Products Markup
Language,    ebXML-Electronic      Business   XML,PMML-Predictive       Model    Markup
Language in Data Mining.

       XML Document is suitable in an .ini file- that is a file that contains application
configuration information. Examples of more sophisticated data sets for which an XML
Document might be suitable as a database are personal contact list(names, phone
numbers, addresses etc) browser bookmarks etc.

       Berkeley DB:XML is an open source, embedded XML Database created by
Sleepy cat software. It’s built on top of Berkeley DB, ‘a key value’ database which
provides record storage and transaction management. Unlike relational databases, which
store data in relational tables, Berkeley DB XML is designed to store arbitrary trees of
XML data. These can then be matched and retrieved either as complete documents or as
fragments, via the XML query language XPath. Berkeley DB XML is return in C++
APIs for Berkeley DB XML exist for C/C++ Java Perl, Phython and TCL.

       X-Hive/DB is a powerful native XML Database design for software developer’s
who require advanced XML data processing and storage functionality within their
application. X-Hive/DB supports all major XML standards including XQuery, XPath,
XPointer, XLink, XSL-T, XSL-FO and XUpdate.              To ensure smooth and rapid
integration with existing application and systems X-Hive/DB provides an interface to
relational databases, bridges to XML editors and full text search engines and support for
J2EE WebDAV and FTP.

Department of Computer Science, CUSAT                                                 34
XML Databases

                           7. CHALLENGES WITH XML
        However, in spite of the domain or application specific XML developments (eg.
FpML- Financial Product Markup Language, ebXML – Electronic Business XML,
PMML - PMML-Predictive Model Markup Language in Data Mining), XML is plagued
with basic difficulties.    There are no efficient storage indexing and compression
mechanisms yet available, and the nature of the data model and its navigational languages
like XQuery are reminiscent of the hierarchical data model (of IMS) which was
comparatively much simpler, efficient, and easy to navigate. Native XML systems like
TAMINIO (from software AG) are yet to make an impact in the market. Although
Microsoft wants to convince the world that XML is going to replace everything as a
standard for data modeling, querying and interoperability, the fact is it has a long way to
go. The excessive complexity and proliferation of concepts and namespaces associated
with XML as defined by W3C (Worl Wide Web Consortium) does not make it ready for
                If your XML document contains an element that’s missing a closed tag,
the document won’t parse. This is a common source of frustration among developers
who use XML. Another kicker is the fact that (unlike HTML ) tag names in XML are
case sensitive. This means that <ORDERS> and <orders> are considered to be two
different and distinct tags. XML is a bit more rigid than HTML; a bracket out of place or
a mismatched close tag will cause the entire document to be unparsable. Constructing a
native XML Database system with all the features provided by the traditional RDBMS is
a difficult task.
        XML isn’t perfect, nothing is, and as a result suffers from several challenges.
These challenges are:
    •   XML documents are bulky.
    •   XML requires marshalling.
    •   XML standards are still evolving.
    •   XML business standards will prove elusive.
    •   It is verbose and access to the data is slow due to parsing and text conversion.
    •   It lags many of the thing found in real databases: efficient storage, indexes,
        security, transactions and data integrity, multi-user access, triggers, queries across
        multiple documents, and so on.

Department of Computer Science, CUSAT                                                      35
XML Databases

                                  7. CONCLUSION

       Over the last few years, XML has become popular as an exchange format for
large data collections, including scientific databases like the Georgetown Protein
Structure Database ( and UniProt ( as well as
bibliographic databases such as Medline ( and DBLP (dblp.uni- We refer to such documents as XML databases.

       The future of XML Database, particularly Native XML databases have succeeded
because of their query languages, the flexibility of the XML data model, and their ability
to handle schema less data. This does not imply that XML Databases will become an
instant standard. Because companies have invested a lot of time, effort, and money into
existing relational database systems.      However, light weight applications where
information is stored using XML could greatly benefit from the creation of XML
Databases. XQuery is not going to replace SQL completely as the application query
language of choice in near future. But in ten years or so an extension of XQuery might
replace both.

       Due to the high redundancy of XML's text representation, compression is clearly
needed for storing and transmitting XML databases efficiently. Although there is no
obstacle to using general-purpose text compressors to compress XML, several
researchers have proposed XML-specific compression techniques which take advantage
of the structure of XML to improve.

Department of Computer Science, CUSAT                                                  36
XML Databases

                                  8. REFERENCES



2.     Ronald Bourret. Xml and Databases. Website, January 2003.


       An Exploration of Xml In Database Management Systems

       By Dare Obasanjo






Department of Computer Science, CUSAT                                        37

Shared By: