The future of XML storage; Native XML Databases vs by hij8Bw3


									     The future of XML storage: Native XML Databases vs. Relational Databases

                               Oleksiy Prokhorov               Felix Annan

                                             Drexel University
                                            3141 Chestnut Street
                                           Philadelphia, PA 19104
                                         {oop23, foa24}

ABSTRACT                                                  hence the need to be able to actively store and
                                                          query them.
Extensible Markup Language (XML) is the de-               There are two general types of XML documents,
facto standard for information exchange between           data-centric documents and document-centric
business applications. The current methods of             documents [9]. Data-centric documents are
storing XML documents largely involve the use             usually well structured and strictly follow a
of XML enabled databases. This storage format             relatively fixed XML Schema whilst document-
relies heavily on the ability to transform XML            centric documents are loosely structured and
data into a form that can be stored in a                  may not conform strictly to an XML Schema.
relational database using well formed XML                 The first solution that most companies came up
Schemas. This procedure incurs high overhead              with for storing XML documents was to store
processing costs which contribute to an                   them in what are now called XML enabled
appallingly low performance of the entire                 databases. These are relational databases with
relational database system. Native XML                    extra processing facilities for XML. With some
databases have been proposed as a solution to             database vendors realizing the need for change,
this problem. This paper explains why dedicated           there has been the development over the last few
processing and storage set aside for XML                  years of a different kind of database for XML
formatted documents will be a better solution to          storage, the native XML database. This type of
the XML processing crisis that most companies             database is built from the ground up with
are currently facing.                                     ingrained dedicated facilities specially suited for
                                                          storing XML data. The following sections
1.   INTRODUCTION                                         discuss these two implementations for storing
                                                          XML. We then compare the various issues and
Extensible Markup Language (XML) is an open               conclude with a preference for the native XML
method of marking up (describing) data. It                databases.
utilizes tags called elements to describe the data
enclosed within the elements. With the ability to         2.   XML ENABLED DATABASES
create custom tags to describe ones own purpose
its use has quickly spread to the description of all      XML enabled databases are legacy relational
kinds of documents ranging from government                databases with XML processing built on.
documents to financial data. The burden of                Relational database systems store data in a
dealing with data stored in a proprietary format          format where the unit of storage is the table row.
has largely been lifted off the shoulders of many         The structure of a relational database is generally
organizations since they can now exchange data            decided once after which it hardly ever changes.
in a non-proprietary way. This allows institutions        The best data to be stored in a relational system
to create their own programs to work with shared          is well structured data.
data. The hierarchical nature of XML makes it             Fig. 1, derived from [2], indicates an overview of
the format of choice for certain applications like        the process that XML documents/queries go
taxonomic data storage in the life sciences and           through during processing.
documentation storage. XML data is usually                At the XQuery interface, an XML Query is
described as semi-structured. A specific element          received and validated for correctness. After the
could have multiple sub-elements of the same              query is validated, the document for insertion is
type. This semi-structured nature of XML makes            validated against an XML Schema (depends on
it suitable to describe data that contains many           the method of storage) and parsed into the
variations. With this explosive use has come the          various sections of a hierarchical structure.
stockpiling of thousand of XML documents and

                                                   relational database it will be used against thus
                                                   ensuring that the XML elements will map onto
                                                   the appropriate columns and tables. Adaptive
        Communication Entry Point                  XML shredding [5] can be used to automatically
                                                   generate an XML Schema and derive a mapping
                                                   between the various elements in the input XML
                                                   document and a generated set of tables in the
             XQuery Interface                      database.
                                                   Generally this entire processing of XML
                                                   documents is set up as layer invisible to the
                                                   relational database
         XML Schema Validation                     Similar processing is required in the reverse
                                                   direction shown on the diagram when requesting
                                                   information from the database, a process called
           XQuery/SQL Interface                    XML publishing [7]. SQL statements
                                                   decomposed from XQuery request statements
                                                   typically contain a lot of JOINs. When the data
                                                   to be joined is stored in multiple tables due to
               SQL Interface                       many branches occurring within the XML
                                                   Schema, there is the repeated need of the query
                                                   engine to create multiple views which represent
                                                   intermediary stages of the query process. These
              Database Engine                      various views are then merged and processed by
                                                   the SQL/XML interface to produce the required
                                                   document or document fragment. The returned
                                                   XML document is validated and then exposed as
              Relational Storage                   the result of the query. To improve the
                                                   performance of the JOIN statements, some
                                                   researchers have proposed using a denormalized
Fig. 1 XML process for a typical XML               database [6] which tends to increase redundancy
Enabled Database                                   whilst increasing speed.
There are generally two tools available for        3.   NATIVE XML DATABASES
parsing an XML document. These are the
Document Object Model (DOM) [3] parser and         There are a number of definitions for a native
the Simple API for XML (SAX) [3] parser.           XML database. It is a system that has an XML
The XQuery/SQL interface exists to convert the     document as its fundamental unit of (logical)
initial XQuerys and the parsed documents into a    storage. A typical process for native XML
set of SQL statements that can be applied to the   document storage is as shown in Fig 2 (derived
database. There are generally two methods by       from [2]). Any document presented at the
which XML data can be stored in a relational       XQuery Interface is parsed using either a SAX or
database. In the first, the entire XML document    DOM parser and then validated against an XML
is dumped into the database as an entry into a     Schema. The validation stage is an optional stage
column of a row whose data type is specified as    but is a good practice to ensure that multiple
a Large Object (LOB) [4]. This scenario works      documents conform to one another. The schema
well when all actions are performed on the         also serves as a documentation of what elements
document as a whole. It also eliminates the need   exist in the database. Each document can be
for shredding but comes with its own set of        stored with its own schema. When a schema is
problems. The second method of storing data in     set up for a set of documents, the schema can
an XML database lies in mapping the content of     evolve with the documents it describes. This
the document to a relational database structure.   loose content model allows a great amount of
The process of conversion is commonly called       flexibility in data storage in that when new
shredding. For this process to be successful the   elements need to be stored this can be done with
document must be validated against a, usually      minimal additional effort. This is further
manually created, XML Schema. This is done to      supported by the fact that the XML document
ensure that the document is valid for the          model naturally supports document/element

order, sibling order and the storage of comments      performance. Valuable processing time has to be
and processing instructions [7] which is              spent on mapping XML Schemas to database
important especially for generating document-         structures, translating XQueries into SQL [2] and
centric XML documents.                                also in parsing and reparsing XML documents
                                                      stored in LOBs. LOBs are inefficient when there
                                                      is the need to search for specific information
        Communication Entry Point                     between specific elements of an XML document.
                                                      Each document that is retrieved in a multi-row
                                                      operation will have to be parsed and the
                                                      information extracted [4]. Considering that XML
                                                      document parsing is very processor expensive,
               XQuery Interface
                                                      the document must be heavily indexed to
                                                      increase performance. These processes make
                                                      work more difficult for database vendors who
                                                      have to implement and test various algorithms
               XML Schema                             for implementing them. Native XML databases
            Validation (Optional)                     are able to insert and export data in XML more
                                                      efficiently than XML enabled databases whilst
                                                      preserving more information about the
                                                      document. Apart from using inefficient LOBs,
               Database Engine
                                                      relational databases have no facilities for storing
                                                      document order, sibling order, comments and
                                                      processing instructions for documents.
                                                      Relational databases are unable to perform full
              XML hierarchical                        text searches without loading the entire
                 storage                              document into memory. By using the XML-
                                                      aware system available in native XML databases,
                                                      XML documents can be searched without
Fig. 2 XML Process for a Native XML                   loading most of the document into memory.
Database                                              Further more XML databases can better answer a
                                                      question like “Find maintenance procedures for a
The information retrieved from an XML                 specific part of a specific airplane” from a set of
document is stored directly into a hierarchical       airplane manuals faster than relational databases
tree structure. This structure reflects the default   ever will. Though relational databases are very
nature of XML documents. Elements are                 efficient at storing data that has a specific
replaced by numerical IDs [2] in the hierarchical     structure, the maintenance of the system
structure and this prevents the repetition of         becomes more difficult when the structure needs
element names throughout the structure. A             to change. Denormalized database structures can
reference maintains the element-numerical ID          be more tolerant to change and reduce the
association. Any XQuery executed does not             number of joins required to retrieve data. They
require translation. Results retrieved from the       however end up with a lot of redundant data with
XML storage are generated directly into an XML        relational design anomalies like non-atomic
document.                                             values, functional dependencies and multi-valued
Native XML databases support full text searches       dependencies which make it more difficult to
of documents without the need for the entire          manage as more changes occur. Adaptive XML
document to be loaded in memory. They can             shredding adds processing overhead to XML-
perform these searches based on specified             schema based decomposition and does not
criteria provided the different documents use the     provide methods for automatic schema and
same elements in the same way.                        database augmentation should a schema change.
                                                      Native XML databases support data with a
4.   COMPARISON                                       changing structure. As the data evolves, the
                                                      XML schema evolves, making data management
It is clear from fig 1 and fig 2 that there are far   easier and providing a more efficient storage
more processing stages in XML enabled                 process. Performance tests show that native
Databases than there are in native XML                XML databases have a higher performance when
databases. This difference translates into            dealing with large XML documents and large

volumes of XML data [1]. Even in the domain of         professionals have also become very comfortable
well structured data, Native XML database              with the relational systems available and are
processing speeds are comparable to that of a          reluctant to change because of the amount of
relational databases processing and returning          work it could involve. Due to the fact that quite a
tabular data when the data is indexed [8].             few organizations need to store data both in a
                                                       relational and XML form, one can definitely
5.   INDUSTRIAL IMPLEMENTATION                         expect to find bundled (hybrid) systems where
                                                       both the relational and XML systems coexist but
The move to native XML databases is a slow             have native implementations within the system.
one. Irrespective of their implementations one         The industrial implementations are examples of
thing common to all is the existence of a “native”     this trend.
XML data type and an XML subsystem with
special functions built in for the data type.          7.   REFERENCES
Oracle users have the ability to select from two
general implementations of the Oracle                       [1] Akmal B.Chaudhri, Awais Rashid,
XMLType data type. Either storage as a                          Roberto Zicari, “XML Data
Character Large Object or as a shredded set of                  management: Native XML and XML-
database objects. These storage methods exist                   Enabled Database Systems, Addison-
within the XMLType data type and the database                   Wesley, 2003
has special functionality to work with them.                [2] Matthias Nicola, Bert van der Linden,
Microsoft SQL Server 2000 started off with the                  “Native XML Support in DB2
transformation of XML data into relational                      Universal Database”, ACM Digital
tables. Due to problems in scalability including                Library, 2005
issues with preserving document order [1],                  [3] W3 Schools Online Web Tutorial,
Microsoft SQL Server 2005 implements a                          “XML Tutorial”
“native” XML data type which stores XML in a          
Binary Large Object (BLOB). The data stored in              [4] Fiebig, T., “ Anatomy of a Native
the BLOB is heavily indexed to enable fast                      XML Base Management System “,
regeneration of the XML document.                               VLDB Journal 11(4), December 2002
IBM DB2 v9 also implements a special data type              [5] Juliana Freire and J´erˆome Sim´eon,
for storage of XML [2]. The internal structure of               “Adaptive XML Shredding:
the data type utilizes the hierarchical structure of            Architecture, Implementation and
a tree and is so far the closest implementation of              Challenges”, University of Toronto,
a true native database system of the three                      January 2003
vendors mentioned here. Though the                          [6] Balmin Andre, Yannis
implementations of Oracle and Microsoft are not                 Papakonstantinou, “ Storing and
actually native they are indicative of the fact that            querying XML data using denormalized
industry realizes the importance of developing                  rational databases ”, VLDB Journal,
dedicated native XML data types and subsystems                  2005
to better handle XML storage.                               [7] Michael Rys, Don Chamberlin, Daniela
                                                                Florescu “XML and Relational
6.   CONCLUSION                                                 Database Management Systems: the
                                                                Inside Story”, ACM Digital Library,
Though relational databases have been around                    2005
for a long time, the processing of XML                      [8] Atakan Kurtl, Mustafa Atay, “An
documents, especially semi-structured XML                       Experimental Study on Query
documents, which account for the majority, is                   Processing Efficiency of Native-XML
definitely an area best served by native XML                    and XML-enabled Relational Database
databases. With enough research and                             Systems”, ACM Digital Library, 2002
development of native XML systems they have                 [9] Ronald Bourret, "XML and Databases",
the potential to become the dominant database         
system in the future. The change over to native                 dDatabases.htm
XML systems is slow due to the fact that
database vendors want to squeeze all the
functionality and profits they can out of pre-
existing relational systems. Technical


To top