Docstoc

PowerPoint Presentation - Lu Jiaheng's homepage

Document Sample
PowerPoint Presentation - Lu Jiaheng's homepage Powered By Docstoc
					APWeb 2004 Hangzhou, China




Labeling and Querying Dynamic XML Trees



                  Jiaheng Lu and Tok Wang Ling

                       School of Computing
                  National University of Singapore


                                                     1
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Methods
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         2
APWeb 2004 Hangzhou, China




Introduction to Structural Join
XML employs tree-structured model for representing data

XML query can be decomposed into a set of basic
structural ( parent-child or ancestor-descendant )
relationships between pairs of nodes




                                                          3
     APWeb 2004 Hangzhou, China                                                   book
      <book title=“XML”>
         <allauthors>
             <author>John</author>
                                                                  title       allauthors     year       chapter
             <author>Tom</author>
          </allauthors>
        <year>2003</year>
       <chapter>
                                                     XML           author         author     2003 head       section
         <head>….</head>
        <section>…</section>
      </chapter>
      </book>                                                       John            Tom                ….         ….
                                                                               b) XML tree
parent-child              a) XML source                              Any node in XML tree may be an element,
                    book                  ancestor-descendant             attribute, value of XML source.


            Title            author                     book          Title        book       Author




            XML              John                       Title         XML          Author      John
                                                                                                                       4
           c) Xpath Tree Pattern                                d) Basic Structural relationship
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Method
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         5
APWeb 2004 Hangzhou, China




Labeling scheme
In order to perform structural join, each node in an XML tree
is assigned an unique label.
We can determine the ancestor-descendant (or parent-child)
relationship for any two nodes from their labels.
The method of assigning the labels is called as labeling
scheme.




                                                            6
APWeb 2004 Hangzhou, China




Range labeling scheme
In the range labeling scheme, the label of a node v is
interpreted as a pair of numbers <av, bv>: av is called as the
start position, while bv is the end position. A node v (<av, bv> )
is an ancestor of u (<au, bu> ) iff av≤au≤bu≤ bv. In other
words, range <au, bu> is contained in range <av, bv>.




                                                                 7
APWeb 2004 Hangzhou, China




Range labeling scheme
                                  Book
                                 <1,12>


        Title <1,2>          Allauthors   Year      Chapter <9,11>
                               <3,6>      <7,8>


                         Author Author    2003    Head    Section
         XML             <3,4> <5,6>      <7,7>   <9,9>   <10,10>
         <1,1>


                         John     Tom
                         <3,3>    <5,5>
                                                                     8
APWeb 2004 Hangzhou, China




Range labeling scheme
Pros: Ancestor-descendants relationship can be decided in
  constant time.

Cons: This method lacks of flexibility. There is a renumbering
  problem for insertion nodes. To get around this problem,
  some papers propose to leave some “gaps” between the
  numbers of the leaves. However, if one part of the documents
  is heavily updated, the available numbers may be still not
  enough and the tree needs to be renumbered. So this approach
  cannot ultimately solve this problem.


                                                            9
APWeb 2004 Hangzhou, China




Prefix labeling scheme

 Edith Cohen in PODS 2002 proposes a simple prefix labeling scheme,
   which can avoid renumbering in any case.

 For any new node v, the label L(v) = L(u) + 1…10

 where u is the parent of v, i is the number of labeled children of u. Root
   node is labeled as an empty string.

 (1) Edith Cohen, Haim Kaplan, Tova Milo Labeling Dynamic XML
    Trees ACM PODS 2002.


                                                                              10
APWeb 2004 Hangzhou, China




How to generate simple prefix label
L(v)=L(u) + 1…10

u is the parent of v and i is the number of labeled children of u.

                             Book       “
                                        ”
                                        “10
                             “0”
                                        ”
                  Title                Authors



                              Author         Author   Author

                              “100          “1010”     “10110
                              ”                        ”             11
APWeb 2004 Hangzhou, China




Simple Prefix labeling scheme
Pros: Compared other labeling scheme, such as range
  labeling scheme, simple prefix scheme does not
  need renumbering for any insertion sequence.
Cons: The index size is too large. The tight bound of
  size is O(N2) in the worst case, where N is the
  number of nodes in an XML tree.



                                                    12
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Method
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         13
  APWeb 2004 Hangzhou, China




                               Group
Definition: Given a XML tree T, a group is a set of subtrees. All root
  nodes of subtrees in this set have the common parent node in T.


                                       A

                                               F

                           B                                   H
                                                   G

               C                D
                                                   I               J
      One
     Subtree                                           Two subtrees
                       E
                                           Group                         14
APWeb 2004 Hangzhou, China




                             Group

   Property 1: Given a XML tree T and a group S, for any
                             
      node n∈T, but n S, one of the following two conditions
      must be satisfied: (1) n is an ancestor of all nodes in S;
      (2) n is not an ancestor of any node in S.
   In other words, it is impossible that n is an ancestor of part
      of nodes in S.




                                                                    15
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Method
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         16
APWeb 2004 Hangzhou, China




GRP Labeling Scheme

  Group based Prefix (GRP) labeling scheme associates
  each node n in the XML tree with a pair of number
  <groupID, prefix-label>, where groupID is a nonegtive
  integer and prefix-label is a binary string. All nodes on
  the same group have the same groupID, and are
  distinguished by their prefix-label .




                                                              17
APWeb 2004 Hangzhou, China




GRP label example:

                             Root     1,“0”

                                        1,“010”
                    A                  B
               1,“00”

                               C                D      E
                              2,“0”           2,“10”   2,“110”


   In this example, the maximal number of each group is three.
                                                                 18
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Method
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         19
APWeb 2004 Hangzhou, China



   Group based Structural Join (GRJ) Algorithm

   The main idea in GRJ is to divide the join operations
   into two classes, one is intra-group join, and the other is
   inter-group join.
    Intra-group join means the join happens among the
      elements in the same group.
    Inter-group join means the join happens among the
      elements in different groups.



                                                                 20
APWeb 2004 Hangzhou, China




Intra-group Join
Intra-group join is easy to understand. There are two
   alternative methods to perform this join:
(i) simply comparing the prefix labels of any two elements to
   identify their relationship, like nested-loop join in RDBMS.
(ii) A clever method is first to sort the prefix-label, then use a
   stack to cache the potential ancestor and only scan the join
   data once, like sort-merge join in RDBMS.




                                                                 21
APWeb 2004 Hangzhou, China




Inter-group Join algorithm

The key point of inter-group join is to use a hash table to
cache the ancestor nodes of each group.
 A key of hash table is a group ID of the descendant set and
a value of hash table is the parent element of this group.




                                                               22
 APWeb 2004 Hangzhou, China
Algorithm. GRJ algorithm
Input: A is the ancestor list and D is the descendant list
Output: Pairs of ancestor-descendant elements
1.Scan A,D list once to assign every element to their respective group bucket;
2. Initialize DgroupHash as a hash table, where keys are group IDs of Dlist, and each value is
     initialized as an empty set.
     /* DgroupHash will cache the ancestors of any group in Dlist. */
3. For i:=2 to max group do
/*since group 1 only contains root node, here begin from group 2 *
4.     Output each elements in set DgroupHash(i) as the ancestor of each element in Dlist of
       group i ;
5.     Delete key i from DgroupHash;
6.      Perform Intra-group join for group i;
7. Perform Inter-group join for group i (join result is stored in the hash table DgroupHash).
8.     End For
                                                                                         23
APWeb 2004 Hangzhou, China




Contents
Introduction
    Introduction to structural join
    Introduction to labeling scheme
Our Method
    Preliminary definition
    Group based prefix labeling scheme
    Group based join algorithm
Our Experiments


                                         24
 APWeb 2004 Hangzhou, China




 Experiment setup
Comprehensive experiments were conducted to study the
effectiveness and efficiency of GRJ algorithm.
We use synthetic and real-life data including XMARK,
IBM XML generator and DBLP.




                                                        25
APWeb 2004 Hangzhou, China




Query performance
  For GRP scheme, we use GRJ algorithm.
  For SP scheme, we first use block nested loop (BNL)
  algorithm. Because if the label of node is given directly
  according to their inserted order, they are usually
  unsorted, we cannot use more efficient algorithm.
  Experiment result: GRJ is much efficient than BNL.




                                                              26
APWeb 2004 Hangzhou, China




                           GRJ algorithm        BNL algorithm


                      30
 Elapsed Time(#sec)



                      25
                      20
                      15
                      10
                      5
                      0
                           3         10    20      25           30
                                   Number of nodes(K)
                                                                     27
  APWeb 2004 Hangzhou, China




  Query Performance
  When the special efforts are taken to guarantee that element
  lists are sorted, for SP labeling scheme, we may use a more
  efficient algorithm, called Stack-Tree-Desc(2) to perform
  structural join. Stack-Tree-Desc is like sort-merge join in
  RDBMS.
  Since the original Stack-Tree-Desc algorithm is based on
  range labeling scheme, here we first modify it to utilize SP
  labeling scheme (but the main idea is the same).

(2):D. Srivastava, S.Al-khalifa, H. V. Jagadish, N. Koudas, J. M. Patel
   and Yuqing Wu. Structural Joins: A primitive for efficient XML query
   pattern matching. In ICDE 2002
                                                                          28
APWeb 2004 Hangzhou, China




                         SP-stack-tree-desc                   GRJ

                   350
                   300
    Elapsed time




                   250
                   200
                   150
                   100
                   50
                    0
                          50          100          150              250
                                 # of nodes in join set( K)


                                                                          29
APWeb 2004 Hangzhou, China




Query Performance
Interestingly, we find that although GRJ algorithm needs to
scan the element lists twice and Stack-Tree-Desc algorithm
scan them only once, GRJ algorithm still performs better
than Stack-Tree-Desc algorithm for the large data.




                                                              30
APWeb 2004 Hangzhou, China




Query Performance
 This can be explained as follows: Stack-Tree-Desc algorithm
 is based on SP labeling scheme, while GRJ is based on GRP
 scheme. Since the size of labels of SP is much larger than
 that of GRP, the time of accessing the GRP labels twice may
 be still smaller than accessing SP labels once. As a result,
 GRJ algorithm outperforms Stack-Tree-Desc algorithm.
 This result shows the importance of the size of labels.




                                                                31
APWeb 2004 Hangzhou, China




           ------ End-------

Thank you !
Question and Answer




                               32

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:22
posted:12/20/2011
language:English
pages:32