XP: A Simple XQuery Processor Implemented in Java
CS764 – Advanced Databases Uche Okpara (firstname.lastname@example.org) Sean McIlwain (email@example.com)
With the emergence of XML as solution to the problem of standardizing data across the Internet, many tools were proposed for handling, transformation, interpretation, and processing of XML files. XQuery, still in its preliminary stages as a standardized language, could very well be XML’s answer to SQL as regards relational databases. An XQuery processor would, in effect, take one or more XML documents, extract information from them and produce a result in user-specified XML format. Advanced functionalities such as nested queries and descriptive constraint definition have also been added to the language. This paper takes a subset of XQuery and attempts to formulate an implementation written in Java that will process simple queries on a single XML document and write the result in XML format. Though limited in the language, we hope that our design will allow future functionality to be added at a later date with little difficulty.
Three main classes, XQueryTokenizer, XQueryAnalyzer, and XPathProcessor make up our implementation of XQuery processor which we’ve called XP (see Figure 1). XQueryTokenizer takes as input, an XQuery file and parses it into a useful token list. This token list is reviewed by the XQueryAnalyzer, which parses it (syntactically and semantically), using different method calls for each production rule defined for the context-free language (see Appendix A). The XQueryAnalyzer makes corresponding calls to the static methods given in the XPathProcessor class to obtain the resulting output XML file. We have separated the processing of the XML file (mainly handled by XPathProcessor) and that of the XQuery file (mainly handled by XQueryTokenizer) to make future modifications or improvements easier.
Result File (XML)
Figure 1: XP’s interaction with input and output files
For parsing XML files, two dominant models exist: DOM (Document Object Model) and SAX (Simple API for XML parsing). SAX is an event driven API that calls event methods whilst parsing the document, i.e. it provides methods that can react to data in an XML document at the moment that data is read in. This is great if we’re only interested in a few parts of a document, and we know how to locate those parts within the stream of SAX events, or if we know we’re only interested in reading the data in sequence. DOM on the other hand, parses the entire document and creates a corresponding Document object that can be browsed using appropriate method calls. Since DOM creates the object from the XML file, it requires enough memory to hold the file. This may be a disadvantage if the file is very large and available memory is very small. In effect, DOM provides programmatic access to the entire document, in a non=liner order. Though the DOM model is easier to use, the SAX model allows faster parsing and requires less memory.
For this project, JDOM was selected because it attempts to provide a seamless XML parser and builder as one API. It is a hybrid of SAX and DOM, maximizing the benefits of both and minimizing their drawbacks. Another advantage is that it handles loading XML from the Internet (via http) and from normal files. It is freely downloadable and has been extensively documented (Ahmed et al, 2000).
XQueryTokenizer From the chosen subset of the XQuery language, a list of acceptable tokens is generated (see Table 1). A subclass of the XQueryToken object (containing basic information about the all the tokens generated from the XQuery file e.g. token line number, token column number, token name and token type) is written for each token type. Any sub-string of the XQuery file given that is not recognized by the tokenizer is assigned an UnknownToken object, which is passed to the XQueryAnalyzer to handle in its error reporting implementation. After this list is
generated, the tokens may be accessed in sequence via the provided assessor methods (getFirst(), getNext(), getPrevious(), getCurrent()).
Token Class Name ForToken InToken VariableToken DocToken PathToken TagToken RelationToken LabelToken WhereToken ReturnToken UnknownToken
Pattern FOR or for IN or in $(a-z)(a-z)* Document(“path”) or Doc(“path”) ((//|/)(a-z)*)((/)(a-z)*)* <(a-z)(a-z)*> | </(a-z)(a-z)*> < | > | <= | >= | = “(a-z,0-9)(a-z,0-9)*” where | WHERE return | RETURN Any unidentified string
Signals analyzer that an illegal string has been detected.
Signals analyzer that we are at the end of the token list, returned from getNext().
1 - The * has the same meaning as in regular expressions 2 - path is a file path or an Internet path (http).
Table 1 – Tokens generated by the XQueryTokenizer
XQueryAnalyzer The analyzer performs two roles in the system. Firstly, it uses the token list returned by XQueryTokenizer in conjunction with the production rules for the context free grammar (see Appendix A) to validate the syntax of the XQuery file. Secondly, XQueryAnalyzer has
methods (corresponding to the production rules) that make calls to the XPathProcessor to obtain JDOM Elements based on the paths in the FOR clause and WHERE clause (Converting the CFG to a context-sensitive language). The results are then iterated though the parsed RETURN statements to generate the resulting XML file. This class is the most complex of our implementation, since it ties the XQueryTokenizer, the XPathProcessor, and JDOM together to parse the XQuery file, process the XML files, and generate the resulting XML file.
Another complexity added to the analyzer is the handling and reporting of XQuery parsing and XML processing errors that are encountered during the execution. We wanted to give
meaningful information to the XQuery user that would enable him to detect and fix the problems within the XQuery and/or XML files. During the parse and processing of the XQuery and XML files, a vector is utilized to keep track of the location and type of errors that occurred during the XQuery processing.
XPathProcessor This class handles the processing of all path tokens provided by the XQueryAnalyzer. Illustrating with the sample query below (the paths are underlined):
FOR $p IN document("./xml_files/nutrition.xml")//food WHERE $p/sodium = "210" RETURN <name> $p/name/text(), $p/fiber/text(), $p/protein/text() </name>
The XPathProcessor class comprises three methods for handling the three different types paths in an XQuery file. These are: processForPath(): Returns the elements in the path specified in the FOR clause. processWherePath(): Returns the subset of elements returned by processForPath() satisfying the condition in the WHERE clause of the XQuery file. processReturnPath(): Returns the elements (usually descendants of the elements returned by processWherePath()) specified by the RETURN clause.
The XP.jar file, obtainable from http://www.cs.wisc.edu/~mcilwain/cs764, contains all the necessary classes to operate the XP program. Sample queries tested by this program during development are also provided (Appendix C). The command java XP <XQuery file> <output file> will invoke the processor once the correct ClassPath environment variable is set. The source codes for all of the classes are attached to this document (Appendix B). The JDOM library and the Xerces (www.apache.org) libraries used by the developers have been included to allow seamless execution. The required version of Java is 1.3 or greater, which is available from www.java.sun.com for many different platforms.
Suggested Future Additions
XP in its current form has useful but limited functionality as regards to the complexity of the queries handled. One future addition that would be useful is the use of conjunctive and disjunctive clauses in the where statement of the XQuery language. Conjunctive clauses would be rather easy to implement by making multiple calls to XPathProcessor’s processWherePath() method, getting subsets of the current results for each conjunctive clause. Disjunctive clauses would require a bit more work, because it would involve merging two result subsets into one for each conjunction. The interesting aspect would be identifying and eliminating duplicates. Handling nested and multiple-file queries would be a powerful addition to XP, allowing multiple files to be joined as in relational table join queries in SQL. The current
implementation of XP could be extended quite easily using compiler scope tables and a nested loop implementation. Finally, the SORT BY operative is another useful functionality that could be added to the processor.
Support for DTD and XML Schema validation would be useful in deciding how to treat different tags under relation statements in the WHERE clause, i.e. comparing numbers and strings should and would be handled differently. Knowing this additional information would help make XP more robust in type checking (XML schema) and XML format validation (DTD and/or XML schema). Another enhancement would be to provide full support for XPath in the XPathProcessor code e.g. supporting a statement with the distinct keyword and more complex paths:
FOR $p IN DISTINCT document(www.bookstore.com/books.xml)/Booklist/Book[Published = 1991]/Author
A major change in the code would be to implement XP using the SAX model in order to streamline the processing of the XML files. The event-driven model of SAX as stated above, allows for faster parsing with smaller memory requirements. This would enable XP to process larger XML files at a faster rate.
Overall, XP is a nice start in implementing an XQuery processor for XML, but has a long way to go before providing a complete implementation of XQuery. Alas, the current XQuery implementation is still in its development and standardization phases, so a true full implementation may not be possible until then.
Gould, S. Laddad, R. Li, S. Macmillan, B. Rivers-Moore, D. Skubal, J. Watson, K. Williams, S. Hart, J. Java XML. Wrox Birminham, UK (2002).
1. Ahmed, K. Ancha, S. Cioroianu, A. Cousins, J. Crosbie, J. Davies, J. Gabhart, K.
2. Hunter, J. McLaughlin, K. JDOM. http://www.jdom.org. 3. Louden, K. C. Compiler Construction. PWS Publishing Co. Boston, MA (1997). 4. Ramakrishnan, R. Gehrke, J. Database Management Systems. McGraw-Hill (1998). 5. Xerces: XML parsers in Java and C++: The Apache Software Foundation http://xml.apache.org. 6. XPath 1.0 Specifications: W3C Consortium http://www.w3.org/TR/xpath. 7. XQuery 1.0 Specifications: W3C Consortium http://www.w3.org/TR/xquery/.