Syntactical Compression of XML Data
Document Sample


Syntactical Compression of XML Data
e
Vojtˇch Toman
Department of Software Engineering, Faculty of Mathematics and Physics
Charles University, Prague
vtoman@ksi.ms.mff.cuni.cz
Abstract
One of the most palpable drawbacks of XML can be seen in its ex-
cessive storage requirements. In this paper, we address this problem
by proposing a syntactical XML compression scheme which makes use
of probabilistic modeling of XML structure. Our compression scheme
works sequentially and makes on-line processing of the data possible.
We describe the current state of development of the prototype com-
pressor and present some preliminary performance evaluation results.
The compressor is designed to be extensible, and intended to serve as
a platform for further research in the field of syntactical XML data
compression.
1 Introduction
The Extensible Markup Language (XML) [11] is rapidly becoming a standard
format for electronic data structuring, storage and exchange. However, due to
its inherent verbosity, the storage requirements of XML are often substantially
larger in comparison to other equivalent data formats. The problem can be
addressed if suitable data compression is applied. Because XML is text-based,
the most common approach is to use existing text compressors (such as Gzip)
and to compress XML documents as ordinary text files. Unfortunately, the
problem in these tools is that they are not able to discover and utilize the
redundancy present in the structure of XML, and therefore often yield only
suboptimal results. As a consequence, the compressed XML documents can
still remain larger than equivalent text or binary formats.
Another significant drawback of traditional text compressors is that the
data has to be decompressed first before it can be processed. It is neither
possible to use the standard XML processing APIs1 to access the compressed
1
Application Programming Interfaces
data, nor to run any sort of queries on it. This can be a serious problem that
has to be resolved in applications that use XML as exchange data format, or
in modern XML-enabled database environments.
Recently, a number of XML-conscious compressors have emerged that im-
prove on the traditional text compressors. Because they are designed to take
advantage of the information contained in the structure of XML, the results
they achieve are much more satisfactory.
Various compression techniques have been adopted in these tools. In XMill
[7], for example, so called container-based compression is employed: the data is
partitioned into containers depending on the element names, and the contain-
ers are then compressed using Gzip. The structure is encoded as a sequence
of numeric tokens that represent both the XML markup (start-tags, end-tags,
etc.) and the references to data containers. XMill achieves better compres-
sion ratios than Gzip and runs at about the same speed. The compression
performance of XMill can be further improved by additional user assistance.
A possible disadvantage of XMill is that it is an off-line compressor since the
data is scattered in the compressed document.
Another compressor that we are aware of is XMLPPM [1] which is based on
a modification of the PPM2 compression scheme. For the modeling of the XML
data, XMLPPM uses several PPM models and switches among them depending
on the context supplied by the SAX parser. The compressor operates on-line
and compresses better than XMill, but runs considerably slower because of the
use of arithmetic coding.
None of the above mentioned compressors makes it possible to query com-
pressed data. One of the first XML compressors which addresses this problem
is XGrind [10]. It encodes the data as a mixture of numeric tokens representing
the structure, and compressed character data. The structure of the original
document is preserved in the output. To make querying of the compressed
data possible, non-adaptive Huffman coding is used. Thanks to that, it is
possible to locate occurrences of a given string in the compressed document
without decompressing it. The non-adaptive coding is the main limitation of
XGrind since two passes over the data are required during the compression.
There are several other XML-conscious compressors available. XMLZip [14]
breaks the structural tree of the document in a specified depth, and compresses
the resulting components using traditional dictionary compression techniques.
For the compression and streaming of small XML files, there is Millau [3]
which uses an on-line compression scheme based on the WBXML [12] format.
In WBXML, the output data is a stream of tokens and uncompressed character
data. In this stream, the structure of the original XML document is preserved.
Millau improves on the WBXML scheme by making it possible to compress also
2
PPM [2] is a powerful coding method for compressing textual data.
the character data. Commercially available XML-Xpress [4] is a schema-aware
compressor that can make use of the schema information contained in the
DTD3 or XML Schema [13]. When the schema is known to XML-Xpress, the
structure can be encoded very efficiently. However, in the absence of a schema
XML-Express relies on the general-purpose compressors, and its outstanding
compression performance is lost.
Table 1 summarizes the features of above mentioned XML compressors.
It can be seen that only XMLPPM, Millau, and XML-Xpress can operate
on-line which is crucial for compressed XML data streaming and exchange.
Three of the compressors (XGrind, Millau, and XML-Xpress) can make use of
the associated schema in some way during the compression. And finally, only
XGrind makes it possible to query compressed data.
Gzip XMill XMLPPM XMLZip XGrind Millau XML-Xpress
Off-line Y Y Y Y Y Y Y
On-line - - Y - - Y Y
Schema - - - - Y Y Y
Queries - - - - Y - -
Table 1: Overview of existing XML compressors
Despite of the variety of existing XML compressors, we believe that there
are still many paths to be explored in the field of XML data compression. In
this paper, we present a novel syntactical compression scheme that is based on
probabilistic modeling of XML structure. It does not need the DTD4 since it
infers all necessary information directly from the input XML data. Moreover,
it operates incrementally, thus making on-line compression and decompression
possible. During the decompression, transparent parsing of the compressed
data using the SAX interface is possible. We believe that the stream-oriented
nature of the compressor makes it well-suited for use in XML streaming envi-
ronments. The system is implemented in a form of a C++ library which was
designed to be easy to use in custom applications.
The paper is organized as follows. In Section 2, we introduce our compres-
sion scheme and discuss its main principles. Some details on the prototype
implementation are unveiled in Section 3. In Section 4, preliminary perfor-
mance evaluation results are summarized. Finally, we draw some conclusions
and sketch directions for further research in Section 5.
3
DTD (Document Type Definition) is a standard way how to describe the grammar of
XML documents.
4
But we plan to add the schema support to our compressor in near future.
2 Proposed XML compression scheme
2.1 Syntactical compression
The fact that XML has a context-free nature motivated us to take a more
syntactical-oriented approach to XML compression. It can be seen that the
structure of XML documents—as defined in the DTD—can be described by a
fairly restrictive context-free grammar. This led us to a thought of employing
some kind of grammar-inferring technique to compress XML data.
The syntactical compression techniques that attracted our attention were
the Sequitur algorithm by Nevill-Manning and Witten [9], and grammar-based
codes proposed recently by Kieffer and Yang [5]. Both schemes work in a
similar manner: they parse the input data and infer a deterministic context-
free grammar that uniquely represents it. To compress the data, the grammar
is encoded—and if the grammar is compact enough, good compression can be
achieved.
The difference between Sequitur and the grammar-based codes can be seen
in the way the grammar is constructed and encoded. In both schemes, the
grammar is constructed incrementally during the processing of the data. How-
ever, the grammar-based coding scheme allows us to parse the data and to
encode the grammar simultaneously, whereas in Sequitur, complete grammar
has to be formed first before it can be encoded. Another difference is that in the
grammar-based codes, the grammar is encoded using adaptive arithmetic cod-
ing, while Sequitur uses its own encoding method called implicit rule encoding
which resembles LZ77 [6] in some aspects. Moreover, because the constraints
on the generated grammar are more restrictive in case of grammar-based codes,
a universal code is guaranteed for a wide variety of sources, while the Sequitur
yields code that is not guaranteed to be universal code at all [5].
We have experimented with the implementation of Sequitur that was de-
signed to be used on text data, and have found its results on XML to be very
promising. In many occasions, it greatly outperformed other general-purpose
text compressors such as Gzip, demonstrating its ability to identify the hier-
archical structure within the data.
However, instead of simply using the existing Sequitur algorithm, we de-
cided to employ the grammar-based coding in our scheme, although we were
not aware of any previous implementation of it. This step into the terra incog-
nita was motivated by the results observed with the Sequitur algorithm, and by
our belief in the potentials of the grammar-based codes. Moreover, it was also
an interesting test on how well this new and promising technique performed
in practice.
2.2 Probabilistic modeling of XML structure
There is a lot of redundant information present in the XML structure. In
our compression scheme, we use numeric tokens for representing the structure,
much in the fashion of XMill. The input XML data is transformed into a
sequence of characters and numeric tokens, and compressed by the grammar-
based coder. The numeric codes represent the units of the XML structure that
contain a lot of redundant information, such as repeated occurrences of known
tags and attributes, or the end-of-element tags. Furthermore, some tokens are
reserved to encode the events such as comments, processing instructions, entity
declarations, etc. Both the transformation and the compression are adaptive
and run incrementally.
While processing the input document we try to learn as much as possible
about its structure. We have observed that elements in XML documents tend
to have fairly regular structure in many occasions—and once we discover this
structure, we can attempt to predict their future structural “behavior”. In case
of successful prediction, the amount of the data that has to be compressed can
be reduced.
We assign each element in the document an acyclic finite state automa-
ton which describes the element structure. We call the automaton a model
of element. The states in the automaton represent the nested elements and
contained character data, while the transitions characterize their arrangement
within the element. Each transition maintains its frequency count which in-
dicates how many times the transition has been used. From these counts, the
probabilities of the transitions are calculated. For each state, we denote the
most probable outgoing transition as prediction transition.
5 5
5 author title /
20
book author
15 title /
15
Figure 1: Model of element book. Thick arrows denote the prediction transi-
tions.
To illustrate how the prediction of the structure works, refer to the sample
model of element book in Figure 1. The element has been seen 20 times so far.
In most cases (15 times) it contained one author subelement followed by one
title subelement, while in 5 other cases it contained two author subelements
and one title subelement. Suppose that we find ourselves in the first author
state in the model. In this state, the prediction transition ends in the title
state. In other words, element title is expected to occur. Now, if the XML
parser really encounters this element, the only thing we have to do is to move
along the prediction transition to the title state, increase the frequency count
of the transition, and to enter the model for the title element. The point is
that no information needs to be encoded.
However, the situation is not always that simple. The most important
problem is that we do not have any element models in the beginning. The
models are being created while processing the data. Each time a new element
is encountered, a simple initial model is assigned to it. This model gives
no predictions at all; it consists of only one initial state and no transitions.
During the processing of the element, new states and transitions are added to
the model to reflect the structure of the element. When the element is seen
again in the document, its structure is compared to that predicted by its model.
If the model does not describe the present structure accurately, it is updated
correspondingly, and appropriate information is sent to the decompressor. In
other words, the models are adaptive, and are constantly being refined to
reflect the structure of the elements.
Since the modeling is adaptive, the decompressor must be able to maintain
the same models as the compressor, and to make the same predictions. As
noted before, the prediction of the structure is realized by the movement along
prediction transitions in the element models. Thanks to these predictions,
the compressor can often operate without sending any data to the decompres-
sor. The decompressor therefore has to be intelligent enough to be able to
“simulate” the movement of the compressor in the models.
Another problem is that the models may not always give good predictions.
In case that the element has an irregular and varying structure, the model
may be easily mistaken. It may happen, for example, that a certain element is
expected, but a different one actually is encountered by the parser. Therefore,
a rather elaborate escape mechanism is required. Each time the model gives a
wrong prediction, the compressor sends an escape event to the decompressor
to inform it where the model failed, and what should be done to recover (select
different transition, create new state, etc.).
Besides the nested elements, the elements may contain also character data.
To deal with this, we use special-purpose character data states in the element
models. These states indicate the presence of character data at certain po-
sitions within the elements, and inform the decompressor that a sequence of
character data follows in the compressed stream.
3 Prototype implementation
We have implemented a prototype implementation of both the compressor and
the decompressor in a form of a C++ library named Exalt (An Experimental
XML Archiving Library/Toolkit). The library was designed to be component-
based, and easy to use in other C++ applications. The architecture of both
the compressor and the decompressor is sketched in Figure 2.
XML data Compressed stream
SAX parser Arithmetic decoder
XML structure modeling Grammar−based decoder
Grammar−based coder XML structure modeling
Arithmetic coder SAX event emitter
Compressed stream Client application
a) b)
Figure 2: Architecture of the compressor (a) and the decompressor (b)
For the XML parsing tasks, we rely on the Expat parser5 , an open-source
SAX parser. The implementation of arithmetic coding that we have chosen
originates from Moffat and Witten [8].
The core components of the library are the structure modeling module
and grammar-based coding module. The structure modeling module deals
with the tasks related to the modeling of the document structure, and the
grammar-based coding module implements the algorithms for the construction
and encoding of the grammar.
During the compression, the input XML document is processed by the SAX
parser, and the structure modeling module transforms the stream of incoming
SAX events into a form that is suitable for compression. The grammar-based
coding module reads the data supplied by the modeling component, and rep-
resents it as a deterministic context-free grammar. The grammar is encoded
incrementally during its construction using zero order arithmetic code with
dynamic alphabet.
The decompressor reads the compressed stream, and reconstructs the en-
coded grammar. In parallel with that, the structure modeling module processes
the data represented by the grammar, restores the original SAX events and
sends them to the client application. Thanks to this approach, the application
can process the data on-line as if uncompressed.
5
http://expat.sourceforge.net
4 Experimental results
To see how our prototype implementation performed in practice, we com-
pared its present compression performance to other XML-conscious compres-
sors (XMill and XMLPPM) and to Gzip on a wide variety of XML documents.
In our tests, we used the XML documents that came with the XMLPPM
compressor, and added several others to them to make the corpus more rich.
We decided to divide the documents into three categories depending on their
type.
Textual documents have a rather simple structure (small amount of ele-
ments and attributes) and relatively long character data content. We used four
Shakespearean plays and two computer tutorials. While the Shakespearean
plays are fairly structured (they contain several types of elements), the tutori-
als were constructed from non-XML text; they contain only one element with
long character data content, and were included in the corpus to evaluate the
performance of grammar-based codes on textual data.
Regular documents consist mainly of structured and regular mark-up, and
short character data content. Often, the documents have a repetitive structure.
Contained in this category of documents were a periodic table of elements in
XML, a TPC6 benchmark database, some baseball players statistics, and an
Apache web log transformed to XML.
Irregular documents have complex and irregular structure. Some of the doc-
uments contain quite a lot of character data, while the others—often computer-
generated and not very legible for humans—do not contain any character data
at all. We used a variety of XML documents, including parsed English sen-
tences from Wall Street Journal, a representation of the protein structure7 , a
set of formal proofs converted to XML, and a W3C specification of the XML
language in XML format.
Tables 2, 3, and 4 summarize the compression results of the individual
compressors involved in our tests on textual, regular, and irregular data, re-
spectively. The compression results are reported in bits per original XML
character. For each document, the best compression is in bold, and the worst
compression is in italic. At the bottom of each table, the average bit rate is
listed.
On textual data, XMLPPM achieved the best overall results, followed by
XMill.8 On Shakespearean plays, Exalt performed consistently about 5-10%
better than XMill, and about 15% better than Gzip. This was a rather op-
6
http://www.tpc.org
7
http://www.expasy.ch/sprot
8
It soon turned out that XMLPPM would be extremely difficult to beat for the other
compressors involved in the tests, which can be seen as another proof of the power of the
PPM compression model.
File Gzip XMill XMLPPM Exalt
antony 2.160 2.039 1.489 1.898
errors 2.141 2.058 1.566 2.018
hamlet 2.267 2.153 1.590 1.977
much ado 2.099 1.961 1.481 1.875
tut emacs 2.755 2.759 2.262 2.920
tut python 2.658 2.659 2.121 2.702
Average 2.347 2.272 1.752 2.232
Table 2: Performance on textual data
timistic result for us, since it indicated that the grammar-based compression
might be a rival to the dictionary-based techniques on textual data. From this
point of view, we were particularly interested in the results on the tutorial doc-
uments. It shows that our compressor can approach the compression rates of
the dictionary-based compressors on these documents. However, in a compar-
ison to XMLPPM, both the syntactical and the dictionary-based compressors
still lagged about 25% behind.
File Gzip XMill XMLPPM Exalt
periodic 0.603 0.421 0.370 0.487
stats1 0.798 0.425 0.314 0.422
stats2 0.750 0.403 0.288 0.384
tpc 1.475 1.144 1.092 1.313
weblog 0.337 0.217 0.190 0.251
Average 0.793 0.522 0.451 0.571
Table 3: Performance on regular data
While the compression rates ranged from 1.5 to 2.8 bits per character in
the case of textual documents, the compressors performed considerably better
on the regular documents, often achieving compression rates below 0.5 bits per
character. Again, XMLPPM performed the best. XMill yielded considerably
better results than on textual data. Exalt outperformed XMill in two occa-
sions, but performed about 9% worse on average. This was a sort of let-down
for us, namely because we expected that the impact of the structure model-
ing would be more visible on the regular documents. Nevertheless, we believe
that these results are not final since there is still a lot to improve in both the
modeling and the grammar-coding components of our compressor.
On irregular documents, XMLPPM performed the best on average again.
The compression rates tended to be surprisingly small, often below 0.3 bits
File Gzip XMill XMLPPM Exalt
pcc1 0.361 0.214 0.186 0.317
pcc2 0.311 0.164 0.168 0.269
qual2003 0.650 0.437 0.393 0.544
tal1 0.312 0.164 0.139 0.238
tal2 0.321 0.183 0.147 0.271
tal3 0.328 0.200 0.175 0.322
treebank 1.782 1.249 1.171 1.932
w3c 2.139 2.273 1.592 2.226
Average 0.776 0.610 0.497 0.765
Table 4: Performance on irregular data
per character. We believe that this is caused by the fact that most of the
documents contained only a small amount of character data content, if any.
From our point of view, we were interested in how the irregularities in these
documents impacted the performance of the modeling engine of our compres-
sor. As expected, since it was difficult to predict the structure of the doc-
uments, the escape mechanism was used frequently during the compression.
And because the escaping carries a penalty in coding effectiveness, the overall
compression performance deteriorated a little as compared to other types of
documents.
As to the compression times, Exalt turned out to be the slowest of the
compressors. On average, it compressed about 5 times slower than XMLPPM
(which itself ran substantially slower than Gzip or XMill). While the loss
was not so significant on regularly structured documents, the compression
times increased substantially on long documents and on documents with long
character data content. Despite of that, we believe that by optimizing the
modeling component of our compressor, and most notably the grammar-based
coding engine, we could decrease the running times substantially. It should
be noted that an efficient implementation (in the sense of time and memory
sparing issues) of the grammar-based coding scheme is an interesting task in
itself, and requires many practical problems to be solved. We think of our
present implementation more as a framework for future research rather than
as a fully-fledged solution.
5 Conclusions and further work
We have proposed an on-line syntactical compression scheme which makes use
of probabilistic modeling of XML structure. We have implemented a func-
tional prototype XML compressor called Exalt which can be seen as one of the
first compressors practically implementing the novel grammar-based coding
technique as introduced in [5]. The compressor is not a “black box” type of
application. Rather, it was intended to provide a platform for experiments and
further research in the field of syntactical compression of XML. The on-line
nature of Exalt makes it well suited for such tasks as XML data streaming or
exchange.
The present implementation of Exalt is by no means complete. At present
time, we work intensively on the redesign of the critical algorithms and data
structures used in the system. Our primary objective is to optimize the
grammar-based coding engine in order to improve the overall performance (in
terms of both compression effectiveness and running speed) of the compressor
and thus to improve its practical usability. Since the grammar-based codes
represent a rather novel topic in lossless data compression, there is still a lot
of work to be done.
Our compressor is not schema-aware at present time; it infers the element
models from the structure of processed XML document. However, if we look
closely at the element models used in our modeling strategy, it is obvious that
they describe the same kind of structural information as the DTD does. If we
used the schema to construct the element models, the compression performance
could improve substantially since there would be no “learning penalty” in the
compression. Thanks to the design of Exalt, the schema support can be added
in a natural way, and we plan to do so in our future work.
From a database point of view, our compression scheme contains no sup-
port for querying of the compressed data—the data can be accessed only via
the sequential SAX interface of the decompressor. The reason is that Exalt is
primarily a stream-oriented compressor, and the structure of its output is not
well suited for direct querying. Instead of modifying the current compressor
(which would be a rather laboured task), we think of developing a new com-
pression scheme in our future work. Although it will be different from Exalt in
many aspects (for example, it will probably operate off-line), we believe we can
utilize some useful concepts, most notably the idea of element models which,
when used in a clever way, might serve as indexing structures for efficient eval-
uation of XML path queries. With the continuing adoption of XML in the
database world, we think that query-friendly compression of XML is definitely
where we want to direct our attention in future research.
References
[1] J. Cheney. Compressing XML with Multiplexed Hierarchical PPM Mod-
els. In Proc. of IEEE Data Compression Conference, pages 163–72, 2001.
[2] J. G. Cleary and I. H. Witten. Data Compression Using Adaptive Coding
and Partial String Matching. IEEE Transactions on Communication,
32:396–402, 1984.
[3] M. Girardot and N. Sundaresan. Millau: An Encoding Format for Effi-
cient Representation and Exchange of XML Documents over the WWW.
In Proceedings of the 9th international World Wide Web conference on
Computer networks, pages 747–765, 2000.
[4] Intelligent Compression Technologies. XML-Xpress. URL: http://www.
ictcompress.com.
[5] J. C. Kieffer and E.-H. Yang. Grammar Based Codes: A New Class
of Universal Lossless Source Codes. IEEE Transactions on Information
Theory, 46:737–754, 2000.
[6] D. A. Lelewer and D. S. Hirschberg. Data Compression. ACM Computing
Surveys, 19(3), 1987.
[7] H. Liefke and D. Suciu. XMill: an Efficient Compressor for XML Data.
In Proceedings of 2000 ACM SIGMOD Conference, pages 153–164, 2000.
[8] A. Moffat and I. H. Witten. Arithmetic Coding Revisited. ACM Trans-
actions on Information Systems, 16:256–294, 1998.
[9] C. G. Nevill-Manning and I. H. Witten. Compression and Explanation
Using Hierarchical Grammars. Computer Journal, 40:103–116, 1997.
[10] P. Tolani and J. R. Haritsa. XGrind: A Query-friendly XML Compressor.
In Proceedings of 18th IEEE International Conference on Data Engineer-
ing, pages 225–234, 2002.
[11] World Wide Web Consorcium. Extensive Markup Language (XML) 1.0.
URL: http://www.w3.org/TR/2000/REC-xml-20001006.
[12] World Wide Web Consorcium. WAP Binary XML Content Format. URL:
http://www.w3.org/TR/wbxml/.
[13] World Wide Web Consorcium. XML Schema Part 0: Primer. URL:
http://www.w3.org/TR/2001/REC-xmlschema-0-20010502/.
[14] XML Solutions. XMLZip. URL: http://www.xmls.com.
Get documents about "