A Parallel Approach to XML Parsing Wei Lu #1 , Kenneth Chiu ∗2 , Yinfei Pan ∗3 # Computer Science Department, Indiana University 150 S. Woodlawn Ave. Bloomington, IN 47405, US 1 email@example.com ∗ Department of Computer Science, State University of New York -Binghamton P.O. Box 6000, Binghamton, NY 13902, US 2 3 firstname.lastname@example.org email@example.com Abstract— A language for semi-structured documents, XML software pipelining is often hard to implement well, due to has emerged as the core of the web services architecture, synchronization, load-balance and memory access costs. and is playing crucial roles in messaging systems, databases, More promising is a data-parallel approach. Here, the XML and document processing. However, the processing of XML documents has a reputation for poor performance, and a number document would be divided into some number of chunks, and of optimizations have been developed to address this performance each thread would work on the chunks independently. As the problem from different perspectives, none of which have been chunks are parsed, the results are merged. entirely satisfactory. In this paper, we present a seemingly To divide the XML document into chunks, we could simply quixotic, but novel approach: parallel XML parsing. Parallel treat it as a sequence of characters, and then divide the XML parsing leverages the growing prevalence of multicore architectures in all sectors of the computer market, and yields document into equal-sized chunks, assigning one chunk to signiﬁcant performance improvements. This paper presents our each thread. This requires that each thread begin parsing from design and implementation of parallel XML parsing. Our design an arbitrary point in the XML document, however, which is consists of an initial preparsing phase to determine the structure problematic. Since an XML document is the serialization of a of the XML document, followed by a full, parallel parse. The tree-structured data model (called XML Infoset ) traversed results of the preparsing phase are used to help partition the XML document for data parallel processing. Our parallel in left-to-right, depth-ﬁrst order, such a division will create parsing phase is a modiﬁcation of the libxml2  XML parser, chunks corresponding to arbitrary parts of the tree, and thus which shows that our approach applies to real-world, production the parsing results will be difﬁcult to merge back into a quality parsers. Our empirical study shows our parallel XML single tree. Correctly reconstructing namespace scopes and parsing algorithm can improved the XML parsing performance references will also be challenging. Furthermore, most chunks signiﬁcantly and scales well. will begin in the middle of some string whose grammatical I. I NTRODUCTION role is unknown. It could be a tag name, an attribute name, an attribute value, element content, etc. This could be resolved XML’s emergence as the de facto standard for encoding by extensive backtracking and communication, but that would tree-oriented, semi-structured data has brought signiﬁcant in- incur overhead that may negate the advantages of parallel teroperability and standardization beneﬁts to grid computing. parsing. Apparently, instead of the equal-sized physical de- Performance, however, is still a lingering concern for some composition, the ability of decomposing the XML document applications of XML. A number of approaches have been used based on its logical structure is the key toward the efﬁcient to address these performance concerns, ranging from binary parallel XML parsing. XML to schema-speciﬁc parsing to hardware acceleration. The results of parsing XML can vary from a DOM-style, As manufacturers have encountered difﬁculties to further data structure representing the XML document, to a sequence exponential increases in clock speeds, they are increasingly of events manifest as callbacks, as in SAX-style parsing. utilizing the march of Moore’s law to provide multiple cores Our parallel approach in this paper focuses on DOM-style on a single chip. Tomorrow’s computers will have more cores parsing, where a tree data structure is created in memory rather than exponentially faster clock speeds, and software will that represents the document. Our targeted application area is increasingly have to rely on parallelism to take advantage of scientiﬁc computing, but we believe our approach is broadly this trend . applicable. Our implementation is based on the production In this paper, we investigate the seemingly quixotic idea of quality libxml2  parser, which shows that our work applies parsing XML in parallel on a shared memory computer, and to real-world parsers, not just research implementations. develop an approach that scales reasonably well to four cores. Current programming models for multicore architectures Concurrency could be used in a number of ways to improve provide access to multiple cores via threads. Thus, in the rest XML parsing performance. One approach would be to use of the paper, we use the term thread rather than core. To avoid pipelining. In this approach, XML parsing could be divided scheduling issues that are outside the scope of this paper, we into a number of stages. Each stage would be executed by assume that each thread is executing on a separate core. a different thread. This approach may provide speedup, but The rest of the paper is organized as follows. Section II ... root XML Parsing <root xmlns="www.indiana.edu"> xmlns foo bar <foo id="0">hello</foo> chunk XML <bar> document skeleton <!−− comment −−> chunk XML Parsing <?name pidata ?> id hello comment pidata a Preparsing PXP Parallel threads <a>world</a> </bar> chunk XML Parsing </root> world Fig. 1. The PXP architecture ﬁrst uses a preparser to generate a skeleton of the XML document. This is then used to guide the partitioning of the <root xmlns="www.indiana.edu"> root document into chunks, which are then parsed in parallel. <foo id="0">hello</foo> <bar> xmlns foo bar <!−− comment −−> <?name pidata ?> <a>world</a> id hello comment pidata a describe the general architecture of our approach, PXP. Then </bar> </root> in the section III and IV we present the algorithm design and world implementation details. We present in Section V performance results. Related work is discussed in Section VI. Fig. 2. The top diagram shows the XML Infoset model of a simple XML II. PXP document. The bottom diagram shows the skeleton of the same document. Any kind of parsing is based on some kind of machine abstraction. The problems of an arbitrary division scheme arise from a lack of information about the state of the parsing III. P REPARSING machine at the beginning of each chunk. Without this state, The goal of preparsing is to determine the tree structure the machine does not know how to start parsing the chunk. of the XML document so that it can be used to guide the Unfortunately, the full state of the parser after the N th data-parallel, full parsing. character cannot be provided without ﬁrst considering each of the preceding N − 1 characters. This thus leads us to the PXP (Parallel XML Parsing) A. Skeleton approach presented in this paper. We ﬁrst use an initial pass Conceptually the XML Infoset represents the tree structure to determine the logical tree structure of an XML document. of the XML document. However since only internal nodes (i.e., This structure is then used to divide the XML document such the element item) determine the topology of the tree, which is the divisions between the chunks occur at well-deﬁned points meaningful for XML data decomposition, those leaf nodes in in the XML grammar. This provides enough context so that the XML Infoset, such as attribute information items, comment each chunk can be parsed starting from an unambiguous state. information items, and even character information items, can This seems counterproductive at ﬁrst glance, since the be ignored by the skeleton. Further the element tag names primary purpose of XML parsing is to build a tree-structured are also ignored by the skeleton since they don’t affect the data model (i.e, XML Infoset) from the XML document. topology of the tree at all. So as shown in the Figure 2, the However the tree structure needed to guide the parallel parsing skeleton essentially is a tree of unnamed nodes, isomorphic to can be signiﬁcantly smaller and simpler than that ultimately the original XML document, and constructed from all start- generated by a normal XML parser, and does not need to tag/end-tag pairs. To facilitate the XML data decomposition, include all the information in the XML Infoset data model. We Our skeleton records the location of the start tag and end tag of call this simple tree structure, speciﬁcally designed for XML each element, the parent-child relationships, and the number data decomposition, the skeleton of the XML document. of children of every element. To distinguish from the actual XML parsing, the procedure to parse and generate the skeleton from the XML document B. Implementation is called preparsing. Once the preparsing is complete and we know the logical tree structure of the XML document, we Well-formed XML is not a regular language , and it can- are able to divide the document into balanced chunks and not be parsed by a ﬁnite-state automaton, but rather requires at then launch multiple threads to parse the chunks in parallel. least a push-down automaton. So even determining the funda- Consequently, this parallelism can signiﬁcantly improve per- mental structure of the XML document, just for preparsing, formance. Our overall architecture is shown in Figure 1. requires executing a push-down automaton. However since For simplicity and performance, PXP currently maps the preparsing is an additional processing step for parallel parsing, entire document into memory with the mmap() system call. it is an additional overhead not normally incurred during XML Nothing precludes our general approach from working on parsing. Furthermore, since it is sequential, it fundamentally streamed documents, or documents too large to ﬁt into mem- limits the parallel parsing performance. Hence, a fundamental ory, but the design and implementation would be signiﬁcantly premise of our work is that preparsing can build the skeleton more complex. at minimal cost. > According to the XML speciﬁcation  a non-validating 1 XML parser must determine whether or not a XML document < / is well-formed. A XML document is considered well-formed Content lt EndTag if it satisﬁes both requirements below: ? or ! PI, Comment 1) It conforms to the syntax production rules deﬁned in the XML speciﬁcation. CDATA 2) It meets all the well-formedness constraints given in the > " or ’ StartTag AttVal speciﬁcation. However, since preparsing will be followed by a full-ﬂedged " or ’ / XML parsing stage, the preparsing itself can ignore many er- rors. That is, for a well-formed XML document, the preparser > must generate the correct result, but for a ill-formed XML EmptyTag document, the preparser does not need to detect any errors. Thus, our preparser only detects weak conformance to the Fig. 3. This automaton accepts the syntax needed by preparsing. (To emphasize the major states, we omit the states for the PI, Comment, and XML speciﬁcation, and hence is simpler to implement and CDATA productions by enclosing them in the dashed line box.) optimize. As the skeleton only contains the location of the element nodes in the XML document, preparsing only needs to con- sider the element tag pairs, and can ignore other syntactic units In addition to the simpliﬁed syntax preparsing also beneﬁts and production rules for such as comments, character data, and from omitting other well-formedness constraints. Usually in attributes. Consequently, the preparsing has a much simpler set order to check the well-formedness constraints, a general of production rules compared to standard XML. For example XML parser will perform a number of additional comparisons, the production rule of the start tag in XML 1.0 is deﬁned as: transformations, sorting, and buffering, all of which can re- sult in signiﬁcant performance bottlenecks. For instance, the STag ::= ’<’ Name (S Attribute)* S? ’>’ fundamental well-formedness constraint is that the name in Attribute ::= Name Eq AttValue Name ::= (Letter | ’_’ | ’:’) (NameChar)* the end-tag of an element must match the name in the start- AttValue ::= ’"’ ([ˆ<&"] | Reference)* ’"’ tag. To check this constraint, the general XML parser might | "’" ([ˆ<&’] | Reference)* "’" push the start tag name onto a stack whenever a start tag is Because preparsing can ignore Attribute and encountered, and pop the stack to match the name of the end AttValue, and even the entire Name production rule, tag. The preparser, however, treats the XML document as a the syntax could seemingly be simpliﬁed to just: sequence of unnamed open and close tag pairs. Therefore, it can merely increment the top pointer of the stack for any start STag ::= ’<’ ([ˆ>])* ’>’ tag, and decrement for any end tag. Finally, if the top pointer However the above simpliﬁed production rule is incorrect points to the bottom of the stack, the preparser considers due to ambiguity, because AttValue allows the > character the XML document to be correct without an expensive string by its production rule, which, if it appears, will cause the comparison. preparser to misidentify the location of the actual right angle Another well-formedness constraint example is that the bracket of the tag. Therefore, the correct rules are: attribute name must not appear more than once in the same STag ::= ’<’([ˆ’"])* AttValue* ’>’ start-tag. To verify that, a full XML parser must perform an AttValue ::= ’"’ ([ˆ’"])* ’"’ | "’" ([ˆ’"])* "’" expensive uniqueness test, which is not required for prepars- With same concern of possible ambiguity, the PI, ing. Comment, and CDATA elements should be preserved in the Finally, preparsing obviously does not need to resolve preparsing production rules set because they are allowed to namespace preﬁxes, since it completely ignores the tag name. contained any string, including the < character, which would However, a full XML parser supporting namespaces, requires otherwise cause the preparser to misidentify the location of expensive lookup operations to resolve namespace preﬁxes. the end tag. The rest of production rules of standard XML are The only constraint the preparsing requires is that the open ignored by the preparsing. tag has to be paired with a close tag. A simple stack is adopted The simpliﬁed preparsing syntax results in a much simpler for this checking, and the skeleton nodes are generated as the parsing automaton (Figure 3), which only requires six major result of pushing and popping of the stack. states, than the one needed by complete XML parsing. Pre- Another important source of performance advantages of dictably, the preparsing automaton runs much faster than the preparsing compared to full parsing is that the skeleton is general XML parsing automaton. much lighter-weight than the DOM structure. Thus, preparsing 1 DTD and validating XML parsing are not supported by our current system is able to generate the skeleton substantially faster than full for simplicity. Also DTD is being replaced by the XML Schema validation, XML parsing is able to generate the DOM. When compared which is usually a separate process after the XML parsing. to SAX, the preparser beneﬁts from avoiding callbacks. IV. PARALLEL PARSING That is because a linear array can easily be divided into equal- sized ranges (i.e., subgraphs) without an expensive graph- During the parallel parsing phase, we use the structural partitioning step. The division is based on the left to right information in the skeleton to divide the document into chunks, order, so every range is contiguous in the XML document. each of which contains a forest of subtrees of the XML We have developed the static PXP algorithm, a simple document. Each chunk is parsed by a thread. For any data static partitioning and parallel parsing algorithm capable of parallel technique to be effective, load-balancing must be used parsing XML document with array structures. This serves to prevent idle threads. Ideally, we could divide the document to provide a baseline against which we can compare more into chunks such that there is one chunk for each thread realistic techniques. Conveniently, we are able to leverage a and such that each chunk takes exactly the same amount function from libxml2 , which is a widely-used and efﬁcient of time to parse. Depend on when and how the partitioning XML parsing library written in C, to perform the parsing. is performed, we have two strategies: static partitioning and xmlParseInNodeContext(xmlNodePtr node, dynamic partitioning. const char * data, int datalen, A. Static Partitioning int options, xmlNodePtr * lst) Naturally, we can statically partition a tree into several This function can parse a “well-balanced chunk” of an XML equally-sized subparts by using a graph partitioning tool (e.g., document within the context (DTD, namespaces, etc.) of the Metis ), which can divide the graph/tree into N equally- given node. A well-balanced chunk is deﬁned as any valid sized parts. The advantage of static partitioning is it can content allowed by the XML grammar. Since the regions generate a very well-balanced load for every thread, thus generated by our static partitioning are well-balanced, we can leading to good parallelism. use the above function to parse each chunk. Obviously any However since the static partitioning occurs before the element range generated by static array partitioning is a well- actual XML parsing, it knows little about the parsing context balanced chunk. Then the static PXP algorithm consists of the (e.g., namespace declarations). In other words, cuts made by following steps: the static partitioning will create following problems: 1) Construct a faked XML document in memory containing 1) The characters of the XML document corresponding to just an empty root element. by copying the open/close the subgraph may no longer be contiguous. Metis will tag pair of the root element from the original XML create connected subgraphs, but a connected subgraph of document, Since we assume that the size of the root the logical tree structure does not necessarily correspond element is much smaller than the whole document, the to a contiguous sequence of characters in the XML cost of any memory operations used by this step are document. In order to parse the resulting characters, acceptable. we must either reconstruct a contiguous sequence by 2) Call the libxml2 function xmlParseMemory() to memory copying, or modify the XML parser to han- parse the faked XML document, thus obtaining the dle non-contiguous character sequences, which may be root XML node. This node contains the namespace challenging. declarations required by its children, and will be treated 2) The namespace scope may be split between subgraphs, as the context for the following parse of the ranges of which means a namespace preﬁx may be used in one the array. subgraph, but deﬁned in another. These inter-chunk 3) The number of elements in each chunk is calculated references will create strong memory and synchroniza- by simply dividing the total number of elements in the tion dependencies between threads, which will degrade array, which was calculated during the preparsing stage, performance. by the number of available threads, so that every thread The static partitioning strategy also suffers because the has a balanced work load. The start positions and data static partitioning algorithm must be executed sequentially length of the chunk can be inferred from the location before the parallel parsing, thus the performance gained by information of its ﬁrst element and the last element. the parallelism will very easily be offset by the cost of the 4) Create a thread to parse each chunk in parallel. static partitioning algorithm, which usually is not trivial. Each thread invokes xmlParseInNodeContext() However for XML documents representing an array struc- to parse and build the DOM structure. ture, such as 5) Finally the parsed results of each thread will be spliced back under the root node. <data> <item>....</item> In summary, the static partitioning strategy is not really ... practical for XML documents with irregular tree structures, <item>....</item> due to strong dependencies between the different processing </data> steps. However for those XML documents containing an array, which are responsible for the bulk of most large XML docu- it provides an upper bound on the performance gain of parallel ments, static partitioning is able to provide the best parallelism. parsing, and is useful for evaluation of other parallel parsing scheme is referred to as a donator initiated subtask distribution scheme. For parallel XML parsing, we desire that the parsing ... ... ... thread will parse as much XML data as possible without any current interruption, unless other threads are idle and asking for tasks, node first half second half so as to achieve a better performance. Also, any thread can be the donator or the requester. We adopt the requester initiated Parsing Parsing subtask distribution as the partition strategy in the PXP. Task Task To implement parallel parsing with dynamic partitioning, Fig. 4. The left diagram illustrates the general node splitting strategy. Each we again use libxml2. Since dynamic partitioning requires node becomes a subtask. The right diagram illustrates the split-in-half strategy. the parser do the task partitioning and subtask generation The nodes of the current parsing task are split in half, with the ﬁrst half given to the requesting task, while the current task ﬁnishes the second half. during the parsing, however, we cannot simply apply the libxml2 xmlParseInNodeContext() function as in the static partitioning scheme. Instead, we need to change the approaches as the guideline. xmlParseInNodeContext()2 source code to integrate the dynamic partitioning and generation logic into the original B. Dynamic Partitioning parsing code. The modiﬁed algorithm is called dynamic PXP In contrast with static partitioning, the dynamic partitioning and its basic steps are: strategy partitions the XML document and generates the 1) Create multiple threads, and assign the root node of subtasks during the actual XML parsing. After the preparser skeleton as the initial parsing task to the ﬁrst thread. generates the skeleton, the tree structure is traversed in parallel Other threads are idle. to complete the parsing. Whenever a node is visited by a thread 2) When a thread is idle, it posts its request on an request its corresponding serialization (start tag) will be parsed and the queue, and waits for the request be ﬁlled by some related DOM node will be built. donator thread. The parallel tree traversal is equivalent to a complete, 3) Every thread, once it begins parsing, parses normally as parallel depth-ﬁrst search (DFS) (in which the desired node is libxml2 does, except when an open tag is being parsed. not found), which partitions the tree dynamically and searches At that time, it checks the request queue for threads that for a speciﬁc goal in parallel using multiple threads. need work. If such a requester thread exists, the thread After Rao , dynamic partitioning consists of two phases: splits the current workload (i.e., the unparsed sibling nodes) into two regions. The ﬁrst half is donated to • Task partitioning the requester thread, and the thread resumes parsing at • Subtask distribution the beginning location of the second half region. Since Task partitioning refers to how a thread splits its current task every skeleton node records the number of its children into subtasks when another thread needs work. A common elements, as well as its location information, it is easy strategy is node splitting , in which each of the n nodes to ﬁgure out the begin location and data length of the spawned by a node in a tree are themselves given away as subtask. Also to avoid excessively small tasks, the user individual subtasks. However for parallel XML parsing, node can set a threshold to prevent task partitioning if the splitting may generate too many small tasks since most of remaining work is less than the threshold. nodes represents a single leaf element in the XML document, 4) Once the requester thread obtains the parsing task, it be- thus increasing the communications cost. gins the parsing at the beginning location of the donated Since XML is a depth-ﬁrst, left-to-right serialization of a subtask. Due to the dynamic nature, the donator is able tree, a sequence of sibling element nodes in the skeleton to pass its current parsing context (e.g., the namespace corresponds to a contiguous chunk of the XML document. declarations) to the requester as the requester’s initial Therefore, if each parsing task covers a sequence of sibling parsing context, which will in turn makes a clone of element nodes, this will maximize the size of each workload, the parsing context for itself before parsing to avoid with little communication cost. In dynamic partitioning, we the synchronization cost. Also the donator will create adopt a simple but effective policy of splitting the workload a dummy node as the “placeholder” for the parsing in half, as shown in Figure 4. That is, the running thread task, the subtrees generated by the requester will be splits the unparsed siblings of the current element node into inserted under the placeholder and once the parsing task two halves in the left-to-right order, whenever the partitioning is completed, the placeholder will be spliced within the is requested. entire DOM tree. Subtask distribution refers to how and when subtasks are 5) This process continues until all threads are idle. distributed for the donator thread to the requester thread. In summary, dynamic partitioning load-balances during the If work splitting is performed only when an idle processor parsing, and it can be applied to any irregular tree structure requests for work, it is called requester initiated subtask distribution. In contrast if the generation of subtasks is in- 2 In fact the actual modiﬁed function is xmlParseContent(), which is dependent of the work requests from idle processors the invoked by xmlParseInNodeContext() to parse the XML content. 2.5 without the need of the extra partitioning algorithm. However, Libxml2 DOM the dynamic nature incurs a synchronization and communica- Libxml2 SAX with empty handler tion cost among the threads, which is not needed by the static Preparsing 2 partitioning scheme. V. M EASUREMENT 1.5 We ﬁrst performed experiments to measure the performance Second of the preparsing, and then performed experiments to measure the performance improvement and the scalability of the paral- 1 lel XML parsing (static and dynamic partition) algorithm over the different XML documents. The experiments are running on a Linux 2.6.9 machine which has two 2 dual-core AMD 0.5 Opteron processors and 4GB of RAM. Every test is run ﬁve times to get the average time and the measurement of the ﬁrst time is discarded, so as to measure performance with the ﬁle data already cached, rather than being read from disk. The 0 0 5 10 15 20 25 30 35 40 45 programs are compiled by g++ 3.4.5 with the option -O3. and Size (MB) the libxml2 library we are using is 2.6.16. Fig. 5. Performance comparison of preparsing. During our initial experiments, we noticed poor speedup during a number of tests that should have performed well. We attributed this to lock contention in malloc(). To avoid According to Figure 5, we see that preparsing is nearly this, we wrote a simple, thread-optimized allocator around 12 times faster than sequential parsing with libxml2 to build malloc(). This allocator maintains a separate pool of mem- DOM. Even for libxml2 SAX parsing, preparsing is over 6 ory for each thread. Thus, as long as the allocation request times faster. Even though the preparser builds a tree, the tree can be satisﬁed from this pool, no locks need to be acquired. is simple and does not require expensive memory management. To ﬁll the pool initially, we simply run the test once, then free These results show that even the preparsing does not occupy all memory, returning it to each pool. much time, and the time left for actual parallel parsing is Our allocator is intended simply to avoid lock contention. enough to result in signiﬁcant speedup. A production allocator would use other techniques to reduce lock contention. One possibility is to simply use a two-stage B. Parallel XML Parsing Performance Measurement technique, where large chunks of memory are obtained from a global pool, and then managed individually for each thread Speedup measures how well a parallel algorithm scales, and in a thread-local pool. is important for evaluating the efﬁciency of parallel algorithms. It is calculated by dividing the sequential time by the parallel A. Preparsing Performance Measurement time. For our experiments, the sequential time refers to the Preparsing generates the skeleton which is necessary for time needed by libxml2 xmlParseInNodeContext() to PXP. However, this is an additional step compared to normal parse the whole XML document. To be consistent, static PXP, XML parsing, which, unfortunately, also needs to be per- dynamic PXP, and the sequential program are all conﬁgured to formed sequentially before the actual parallel parsing. Thus, use the thread-optimized memory allocator. Each program is to help determine whether or not this cost is acceptable, run ﬁve times and the timing result of the ﬁrst time is discarded and understand the overall PXP performance, we measured to warm the cache. preparsing time and also compared it to full libxml2 parsing. We ﬁrst measure the upper bound of the speedup that the Since preparsing linearly traverses the XML document with- PXP algorithms could achieve. To do that, we select a big out backtracking or other bookkeeping, the time complexity XML document used in the previous preparsing experiment as is linear in the size of the document, and independent of the parsing test document. The array in the XML document the structural complexity of the document. We thus designed has around 50,000 elements and every element includes up to the preparsing test to maximize the performance of a full 28 attributes and the size of the ﬁle is 35 MB. Since the test sequential parser, and used a simple array of elements which document just contains a simple array structure, we are able varied in size. The test document is shown in the Appendix. to apply both static PXP and dynamic PXP algorithms on it. First, we varied elements in the array to increase document Figure 6 shows how the static/dynamic PXP algorithms scales size. Then for the comparison, we measured the costs of two with the number of threads when parsing this test document. widely-used parsing methods: building DOM with libxml2, The diagonal dashed line shows the theoretical ideal speedup. and parsing with the SAX implementation in libxml2. In From the graph we can see that when the threads number is addition, for the libxml2 SAX implementation, we used empty one or two the speedups of the PXP is sublinear, but if we callback routines. Thus, libxml2 SAX is expected to be subtract the preparsing time from the total time the speedups extremely fast. The results are shown in Figure 5. of static PXP is close to linear. This indicates the preparsing 4 4 static pxp w/o including preparsing non−array w/o including preparsing 3.5 dynamic pxp w/o including preparsing 3.5 array w/o including preparsing static pxp non−array dynamic pxp array 3 3 linear 2.5 2.5 Speedup Speedup 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Threads Threads Fig. 6. This graph shows the upper bound of the speedup of the PXP Fig. 7. This graph shows the speedup of the dynamic PXP for up to four algorithms for up to four threads, when used to parse a big XML document threads, when used to parse two same-size XML documents, one with irregular which only contains an array structure. tree shape and one with regular array shape. dominates the overhead, and the static PXP presents the upper contents. In a typed parsing scenario, where schema or other bound of the parallel performance. information can be used to interpret the element content, we would obtain even better scalability. For example, if we are The speedups of dynamic PXP are slightly lower than the parsing a large array of doubles including the ASCII-to-double ones of the static PXP, which indicates the cost of communica- conversion, each thread has an increased workload relative to tion and synchronization starts to be a factor, but is relatively the preparsing stage and other overheads, and thus speedup minor. When the threads number is increased the speedup of would be improved. the PXP (dynamic or static) become less, that is because when the work load of every thread decreases, the overhead of the VI. R ELATED WORK preparsing becomes more signiﬁcant than before. Also the As mentioned earlier, parallel XML parsing can essentially dynamic PXP obtains less speedup than the static PXP due be viewed as a particular application of the graph partition- to the increasing communication cost. Furthermore, even the ing  and parallel graph search algorithms . But the speedup of the static PXP omitting the preparsing cost starts to document parsing and DOM building introduces some new drop away from the theoretical limit. We speculate that shared issues, such as preparsing, namespace reference, and so on, memory or cache conﬂicts are playing a role here. which are not addressed by those general parallel algorithms. Unlike the static PXP, dynamic PXP is able to parse the There are a number of approaches trying to address the XML documents with any tree shape. So to further study performance bottleneck of XML parsing. The typical software the performance improvement of dynamic PXP, we modiﬁed solutions include the pull-based parsing , lazy parsing  the previous XML document with big array structure to be and schema-speciﬁc parsing , , . Pull-based XML irregular tree shape, which consists of a ﬁve top-level elements parsing is driven by the user, and thus provides ﬂexible under the root, each with a randomly chosen number of performance by allowing the user to build only the parts of children. Each of these children is an element from the array the data model that are actually needed by the application. of the ﬁrst test, and so the total number of these child elements Schema-speciﬁc parsing leverages XML schema information, in the modiﬁed document is same as the one of the original by which the speciﬁc parser (automaton) is built to accelerate document. the XML parsing. For the XML documents conforming to We compare the dynamic PXP on this modiﬁed XML the schema, the schema-speciﬁc parsing will run very quickly, document against the dynamic PXP on the original array XML whereas for other documents the extra penalty will be paid. document. This comparison can show how the dynamic PXP Most closely related to our work in this paper is lazy parsing scales for the XML documents with irregular shape or regular because it also need a skeleton-similar structure of the XML shape. From the results shown in Figure 7 we can see there is document for the lazy evaluation. That is ﬁrstly a skeleton little difference between two XML documents, which imply is built from the XML document to indicate the basic tree that dynamic PXP (and our task partitioning of dividing the structure, thereafter based on the user’s access requirements, remaining work in half) is able to effectively handle the large the corresponding piece of the XML document will be located XML ﬁle with irregular shape. by looking up the skeleton and be fully parsed. However, the These tests did not actually further parse the element purpose of the lazy parsing and parallel parsing are totally different, so the structure and the use of the skeleton in the both  K. Chiu and W. Lu, “A compiler-based approach to schema-speciﬁc xml algorithms differs fundamentally from each other. Hardware parsing,” in The First International Workshop on High Performance XML Processing, 2004. based solutions,  also are promising, particularly in  W. M. Lowe, M. L. Noga, and T. S. Gaul, “Foundations of fast the industrial arena. But by our best knowledge, there is no communication via xml,” Ann. Softw. Eng., vol. 13, no. 1-4, 2002. such work leveraging the data-parallelism model as PXP.  R. van Engelen, “Constructing ﬁnite state automata for high performance xml web services,” in Proceedings of the International Symposium on Web Services(ISWS), 2004. VII. C ONCLUSION AND F UTURE W ORK  J. van Lunteren, J. Bostian, B. Carey, T. Engbersen, and C. Larsson, “Xml accelerator engine,” in The First International Workshop on High In this paper, we have described our approach to parallel Performance XML Processing, 2004. XML parsing, and shown that it performs well for up to four  “Datapower,” http://www.datapower.com/. cores. An efﬁcient parallel XML parsing scheme needs an effective data decomposition method, which implies a better APPENDIX understanding of the tree structure of the XML document. Structure of the XML document ns att test.xml Preparsing is designed to extract the minimal tree structure <xml xmlns:rs=’urn:schemas-microsoft-com:rowset’ (i.e., skeleton) from the XML document as quickly as possible. xmlns:z=’#RowsetSchema’ xmlns:tb0=’table0’ xmlns:tb1=’table1’ The key to the high performance of the preparsing is its xmlns:tb2=’table2’ xmlns:tb3=’table3’> highly simpliﬁed syntax as well as the obviation of full well- <z:row tb1:PRODUCT=... tb0:CCIDATE=... formedness constraints checking. Aided by the skeleton, the tb0:CLASS=... tb2:ADNUMBER=... tb0:PRODUCTIONCATEGORYID_FK=... algorithm can partition the XML document into chunks and tb3:ADVERTISERACCOUNT=... parse them in parallel. Depending upon when the document tb1:YPOSITION=... tb2:CHEIGHT=... is partitioned, we have the static PXP and dynamic PXP tb2:CWIDTH=... tb2:MHEIGHT=... tb2:MWIDTH=... tb2:BHEIGHT=... algorithms. The former is only for the XML documents tb2:BWIDTH=... tb3:SALESPERSONNUMBER=... with array structures and can give the best case beneﬁt of tb3:SALESPERSONNAME=... parallelism, while the latter is appliable to any structures, tb1:PAGENAME=... tb1:PAGENUMBER=... tb2:BOOKEDCOLOURINFO=... tb1:EDITION=... but with some communication and synchronization cost. Our tb1:MOUNTINGCOMMENT=... tb1:TSNLSALESSYSTEM=... experiments shows the preparsing is much faster than full tb1:TSNLCLASSID_FK=... tb1:TSNLSUBCLASS=... XML parsing (either SAX or DOM), and based on it the tb1:TSNLACTUALDEPTH=... tb1:XPOSITION=... tb0:TSNLCEESRECORDTYPEID_FK=... parallel parsing algorithms can speedup the parsing and DOM tb0:PRODUCTZONE=... ROWID=.../> building signiﬁcantly and scales well. Since the preparsing <z:row ... /> becomes the bottleneck as the number of threads increase, our <z:row ... /> ... future work will investigate the feasibility of the parallelism </xml> between the preparsing and real parsing. Also new approaches for very large XML documents will be studied under the shared memory model. ACKNOWLEDGMENT We would like to thank professor Randall Bramley for his insightful suggestion and help on the graph partition and Metis. We also thank for Zongde Liu and Srinath Perera for the useful comment and discussion. R EFERENCES  D. Veillard, “Libxml2 project web page,” http://xmlsoft.org/, 2004.  H. Sutter, “The free lunch is over: A fundamental turn toward concur- rency in software,” Dr. Dobb’s Journal, vol. 30, 2005.  W3C, “Xml information set (second edition),” http://www.w3.org/TR/ xml-infoset/, 2003.  J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 2000.  W3C, “Extensible Markup Language (XML) 1.0 (Third Edition),” http: //www.w3.org/TR/2004/REC-xml-20040204/, 2004.  G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning scheme for irregular graphs,” in Supercomputing, 1996.  V. N. Rao and V. Kumar, “Parallel depth ﬁrst search. part i. implemen- tation,” Int. J. Parallel Program., vol. 16, no. 6, pp. 479–499, 1987.  V. Kumar and V. N. Rao, “Parallel depth ﬁrst search. part ii. analysis,” Int. J. Parallel Program., vol. 16, no. 6, pp. 501–519, 1987.  A. Slominski, “Xml pull paring,” http://http://www.xmlpull.org/, 2004.  M. L. Noga, S. Schott, and W. Lowe, “Lazy xml processing,” in DocEng ’02: Proceedings of the 2002 ACM symposium on Document engineering, 2002.
Pages to are hidden for
"A Parallel Approach to XML Parsing"Please download to view full document