The future of XML storage; Native XML Databases vs
Document Sample


The future of XML storage: Native XML Databases vs. Relational Databases
Oleksiy Prokhorov Felix Annan
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
{oop23, foa24} @drexel.edu
ABSTRACT hence the need to be able to actively store and
query them.
Extensible Markup Language (XML) is the de- There are two general types of XML documents,
facto standard for information exchange between data-centric documents and document-centric
business applications. The current methods of documents [9]. Data-centric documents are
storing XML documents largely involve the use usually well structured and strictly follow a
of XML enabled databases. This storage format relatively fixed XML Schema whilst document-
relies heavily on the ability to transform XML centric documents are loosely structured and
data into a form that can be stored in a may not conform strictly to an XML Schema.
relational database using well formed XML The first solution that most companies came up
Schemas. This procedure incurs high overhead with for storing XML documents was to store
processing costs which contribute to an them in what are now called XML enabled
appallingly low performance of the entire databases. These are relational databases with
relational database system. Native XML extra processing facilities for XML. With some
databases have been proposed as a solution to database vendors realizing the need for change,
this problem. This paper explains why dedicated there has been the development over the last few
processing and storage set aside for XML years of a different kind of database for XML
formatted documents will be a better solution to storage, the native XML database. This type of
the XML processing crisis that most companies database is built from the ground up with
are currently facing. ingrained dedicated facilities specially suited for
storing XML data. The following sections
1. INTRODUCTION discuss these two implementations for storing
XML. We then compare the various issues and
Extensible Markup Language (XML) is an open conclude with a preference for the native XML
method of marking up (describing) data. It databases.
utilizes tags called elements to describe the data
enclosed within the elements. With the ability to 2. XML ENABLED DATABASES
create custom tags to describe ones own purpose
its use has quickly spread to the description of all XML enabled databases are legacy relational
kinds of documents ranging from government databases with XML processing built on.
documents to financial data. The burden of Relational database systems store data in a
dealing with data stored in a proprietary format format where the unit of storage is the table row.
has largely been lifted off the shoulders of many The structure of a relational database is generally
organizations since they can now exchange data decided once after which it hardly ever changes.
in a non-proprietary way. This allows institutions The best data to be stored in a relational system
to create their own programs to work with shared is well structured data.
data. The hierarchical nature of XML makes it Fig. 1, derived from [2], indicates an overview of
the format of choice for certain applications like the process that XML documents/queries go
taxonomic data storage in the life sciences and through during processing.
documentation storage. XML data is usually At the XQuery interface, an XML Query is
described as semi-structured. A specific element received and validated for correctness. After the
could have multiple sub-elements of the same query is validated, the document for insertion is
type. This semi-structured nature of XML makes validated against an XML Schema (depends on
it suitable to describe data that contains many the method of storage) and parsed into the
variations. With this explosive use has come the various sections of a hierarchical structure.
stockpiling of thousand of XML documents and
1
relational database it will be used against thus
ensuring that the XML elements will map onto
the appropriate columns and tables. Adaptive
Communication Entry Point XML shredding [5] can be used to automatically
generate an XML Schema and derive a mapping
between the various elements in the input XML
document and a generated set of tables in the
XQuery Interface database.
Generally this entire processing of XML
documents is set up as layer invisible to the
relational database
XML Schema Validation Similar processing is required in the reverse
direction shown on the diagram when requesting
information from the database, a process called
XQuery/SQL Interface XML publishing [7]. SQL statements
decomposed from XQuery request statements
typically contain a lot of JOINs. When the data
to be joined is stored in multiple tables due to
SQL Interface many branches occurring within the XML
Schema, there is the repeated need of the query
engine to create multiple views which represent
intermediary stages of the query process. These
Database Engine various views are then merged and processed by
the SQL/XML interface to produce the required
document or document fragment. The returned
XML document is validated and then exposed as
Relational Storage the result of the query. To improve the
performance of the JOIN statements, some
researchers have proposed using a denormalized
Fig. 1 XML process for a typical XML database [6] which tends to increase redundancy
Enabled Database whilst increasing speed.
There are generally two tools available for 3. NATIVE XML DATABASES
parsing an XML document. These are the
Document Object Model (DOM) [3] parser and There are a number of definitions for a native
the Simple API for XML (SAX) [3] parser. XML database. It is a system that has an XML
The XQuery/SQL interface exists to convert the document as its fundamental unit of (logical)
initial XQuerys and the parsed documents into a storage. A typical process for native XML
set of SQL statements that can be applied to the document storage is as shown in Fig 2 (derived
database. There are generally two methods by from [2]). Any document presented at the
which XML data can be stored in a relational XQuery Interface is parsed using either a SAX or
database. In the first, the entire XML document DOM parser and then validated against an XML
is dumped into the database as an entry into a Schema. The validation stage is an optional stage
column of a row whose data type is specified as but is a good practice to ensure that multiple
a Large Object (LOB) [4]. This scenario works documents conform to one another. The schema
well when all actions are performed on the also serves as a documentation of what elements
document as a whole. It also eliminates the need exist in the database. Each document can be
for shredding but comes with its own set of stored with its own schema. When a schema is
problems. The second method of storing data in set up for a set of documents, the schema can
an XML database lies in mapping the content of evolve with the documents it describes. This
the document to a relational database structure. loose content model allows a great amount of
The process of conversion is commonly called flexibility in data storage in that when new
shredding. For this process to be successful the elements need to be stored this can be done with
document must be validated against a, usually minimal additional effort. This is further
manually created, XML Schema. This is done to supported by the fact that the XML document
ensure that the document is valid for the model naturally supports document/element
2
order, sibling order and the storage of comments performance. Valuable processing time has to be
and processing instructions [7] which is spent on mapping XML Schemas to database
important especially for generating document- structures, translating XQueries into SQL [2] and
centric XML documents. also in parsing and reparsing XML documents
stored in LOBs. LOBs are inefficient when there
is the need to search for specific information
Communication Entry Point between specific elements of an XML document.
Each document that is retrieved in a multi-row
operation will have to be parsed and the
information extracted [4]. Considering that XML
document parsing is very processor expensive,
XQuery Interface
the document must be heavily indexed to
increase performance. These processes make
work more difficult for database vendors who
have to implement and test various algorithms
XML Schema for implementing them. Native XML databases
Validation (Optional) are able to insert and export data in XML more
efficiently than XML enabled databases whilst
preserving more information about the
document. Apart from using inefficient LOBs,
Database Engine
relational databases have no facilities for storing
document order, sibling order, comments and
processing instructions for documents.
Relational databases are unable to perform full
XML hierarchical text searches without loading the entire
storage document into memory. By using the XML-
aware system available in native XML databases,
XML documents can be searched without
Fig. 2 XML Process for a Native XML loading most of the document into memory.
Database Further more XML databases can better answer a
question like “Find maintenance procedures for a
The information retrieved from an XML specific part of a specific airplane” from a set of
document is stored directly into a hierarchical airplane manuals faster than relational databases
tree structure. This structure reflects the default ever will. Though relational databases are very
nature of XML documents. Elements are efficient at storing data that has a specific
replaced by numerical IDs [2] in the hierarchical structure, the maintenance of the system
structure and this prevents the repetition of becomes more difficult when the structure needs
element names throughout the structure. A to change. Denormalized database structures can
reference maintains the element-numerical ID be more tolerant to change and reduce the
association. Any XQuery executed does not number of joins required to retrieve data. They
require translation. Results retrieved from the however end up with a lot of redundant data with
XML storage are generated directly into an XML relational design anomalies like non-atomic
document. values, functional dependencies and multi-valued
Native XML databases support full text searches dependencies which make it more difficult to
of documents without the need for the entire manage as more changes occur. Adaptive XML
document to be loaded in memory. They can shredding adds processing overhead to XML-
perform these searches based on specified schema based decomposition and does not
criteria provided the different documents use the provide methods for automatic schema and
same elements in the same way. database augmentation should a schema change.
Native XML databases support data with a
4. COMPARISON changing structure. As the data evolves, the
XML schema evolves, making data management
It is clear from fig 1 and fig 2 that there are far easier and providing a more efficient storage
more processing stages in XML enabled process. Performance tests show that native
Databases than there are in native XML XML databases have a higher performance when
databases. This difference translates into dealing with large XML documents and large
3
volumes of XML data [1]. Even in the domain of professionals have also become very comfortable
well structured data, Native XML database with the relational systems available and are
processing speeds are comparable to that of a reluctant to change because of the amount of
relational databases processing and returning work it could involve. Due to the fact that quite a
tabular data when the data is indexed [8]. few organizations need to store data both in a
relational and XML form, one can definitely
5. INDUSTRIAL IMPLEMENTATION expect to find bundled (hybrid) systems where
both the relational and XML systems coexist but
The move to native XML databases is a slow have native implementations within the system.
one. Irrespective of their implementations one The industrial implementations are examples of
thing common to all is the existence of a “native” this trend.
XML data type and an XML subsystem with
special functions built in for the data type. 7. REFERENCES
Oracle users have the ability to select from two
general implementations of the Oracle [1] Akmal B.Chaudhri, Awais Rashid,
XMLType data type. Either storage as a Roberto Zicari, “XML Data
Character Large Object or as a shredded set of management: Native XML and XML-
database objects. These storage methods exist Enabled Database Systems, Addison-
within the XMLType data type and the database Wesley, 2003
has special functionality to work with them. [2] Matthias Nicola, Bert van der Linden,
Microsoft SQL Server 2000 started off with the “Native XML Support in DB2
transformation of XML data into relational Universal Database”, ACM Digital
tables. Due to problems in scalability including Library, 2005
issues with preserving document order [1], [3] W3 Schools Online Web Tutorial,
Microsoft SQL Server 2005 implements a “XML Tutorial”
“native” XML data type which stores XML in a http://www.w3schools.com
Binary Large Object (BLOB). The data stored in [4] Fiebig, T. et.al, “ Anatomy of a Native
the BLOB is heavily indexed to enable fast XML Base Management System “,
regeneration of the XML document. VLDB Journal 11(4), December 2002
IBM DB2 v9 also implements a special data type [5] Juliana Freire and J´erˆome Sim´eon,
for storage of XML [2]. The internal structure of “Adaptive XML Shredding:
the data type utilizes the hierarchical structure of Architecture, Implementation and
a tree and is so far the closest implementation of Challenges”, University of Toronto,
a true native database system of the three January 2003
vendors mentioned here. Though the [6] Balmin Andre, Yannis
implementations of Oracle and Microsoft are not Papakonstantinou, “ Storing and
actually native they are indicative of the fact that querying XML data using denormalized
industry realizes the importance of developing rational databases ”, VLDB Journal,
dedicated native XML data types and subsystems 2005
to better handle XML storage. [7] Michael Rys, Don Chamberlin, Daniela
Florescu “XML and Relational
6. CONCLUSION Database Management Systems: the
Inside Story”, ACM Digital Library,
Though relational databases have been around 2005
for a long time, the processing of XML [8] Atakan Kurtl, Mustafa Atay, “An
documents, especially semi-structured XML Experimental Study on Query
documents, which account for the majority, is Processing Efficiency of Native-XML
definitely an area best served by native XML and XML-enabled Relational Database
databases. With enough research and Systems”, ACM Digital Library, 2002
development of native XML systems they have [9] Ronald Bourret, "XML and Databases",
the potential to become the dominant database http://www.rpbourret.com/xml/XMLAn
system in the future. The change over to native dDatabases.htm
XML systems is slow due to the fact that
database vendors want to squeeze all the
functionality and profits they can out of pre-
existing relational systems. Technical
4
Get documents about "