ELEMENT booktitle _txt - OoCities

Document Sample
ELEMENT booktitle _txt - OoCities Powered By Docstoc
					The Storage and
Benchmarking of XML

                                  Presenter: Kevin See
              <?XML?>                           (see@ca.ibm.com)
                                         IBM Toronto Lab.
                            DB2 SQL /Catalog Development
                                       Date: Nov 2, 2001
Nov 2, 2001    The Storage and Benchmarking of XML            1
  Outline
      Introduction
      Text file / OODBMS/ native DB
       approach
      Relational DB approaches
             Categories
             2 latest proposals
      XML benchmarks
      Conclusion
Nov 2, 2001           The Storage and Benchmarking of XML   2
  Introduction
      XML is emerging to become the
       standard for data exchanging
      Demand for storage and management
       of the XML documents is growing
      There are a few ways to manage the
       XML document


Nov 2, 2001    The Storage and Benchmarking of XML   3
  Text Approach
      File system
      A separate query engine will need to be
       implemented
      Parsing not possible
      Index strategies : (parent_offset, tag),
       (child_offset, parent_offset), (tagname,
       value), (attribute_name, attribute_value)
        not good for update

Nov 2, 2001      The Storage and Benchmarking of XML   4
  Object-oriented Database
  Management System
      Michael R. Olson and Byung S. Lee
       (1997)
      OO model fit well
      Immature technology: Hard to scale
      Conclude the experiment without any
       great success.


Nov 2, 2001    The Storage and Benchmarking of XML   5
  Native Database Approach
      Prototypes: Lore (Stanford University),
       Xyleme (INRIA, France).
      Immature technology
      No optimization capabilities




Nov 2, 2001     The Storage and Benchmarking of XML   6
  Relational Database
  Technology
      Very mature technology
      Query optimization techniques and the
       processing mechanisms in relational
       databases have been studied for a
       quarter of a century
      A very large percentage of the data are
       currently stored in RDMS

Nov 2, 2001     The Storage and Benchmarking of XML   7
  Storage and Retrieval of XML
  Using Relational Database
                              XML to Relational Mapping    Table1
                     XML




        XML
        Query
        Language                  XML Query to SQL                  SQL
        such as
        XPath,
        XQuery,
        Quilt, XQL



                                                           Table1

                     XML

                                   Relational to XML
                                      Conversion

Nov 2, 2001                The Storage and Benchmarking of XML            8
  Classifications of Various
  Mapping Methods
      Structure-mapping                        Model-mapping
       approach                                  approach
             The XML document’s                     Constructs of XML
              logical structures (or                  model are
              DTDs if available)                      represented by the
              are represented by                      database schemas
              the database                           1 set of generated
              schemas                                 schemas for all/any
             1 DTD : 1 set of                        DTD
              generated schemas

Nov 2, 2001            The Storage and Benchmarking of XML              9
  Relational Schema Prototype
  Tree Mapping Method
      M. Yan and A. Fu @ The Chinese
       University of Hong Kong (2001)
      Structure-mapping
      Global Schema Extraction Algorithm
      DTD Splitting Schema Extraction
       Algorithm


Nov 2, 2001    The Storage and Benchmarking of XML   10
  Relational Schema Prototype
  Tree Mapping Method (Cont’d)
         Relational Databases for Querying XML Documents:
          Limitations and Opportunities (J.
          Shanmugasundaram)
         Basic steps:
        1. Simplify DTD
        2. Construct schema prototype tree
        3. Generate relational schema prototypes
        4. Detect functional dependencies and candidate
           keys
        5. Normalize the relational schema prototypes

Nov 2, 2001        The Storage and Benchmarking of XML   11
  DTD Splitting Schema
  Extraction Algorithm
     Step 1: Simplify DTD
   p|p'       p, p'
   p+  p*
   (p, p')  p, p'
   ..., p,..., p*,...  p*
   p?  p
   (p, p’)*  p*, p’*
   ..., p,..., p,...  p*
Nov 2, 2001    The Storage and Benchmarking of XML   12
An Book DTD <!ENTITY %txt “(#PCDATA)”>
                  <!ELEMENT book (booktitle, price?, author, authority*)>
                  <!ELEMENT authority (authname, country)>
                  <!ELEMENT authname %txt>
                  <!ELEMENT country %txt>
                  <!ELEMENT booktitle %txt>
                  <!ELEMENT price %txt>
                  <!ELEMENT monograph (title, author, editor)>
                  <!ELEMENT title %txt>
                  <!ELEMENT editor (monograph+)>
                  <!ATTLIST editor name CDATA #REQUIRED>
                  <!ELEMENT author (name, address)>
                  <!ATTLIST author id ID>
                  <!ELEMENT name (firstname, lastname)>
                  <!ELEMENT firstname %txt>
                  <!ELEMENT lastname %txt>
                  <!ELEMENT address %txt>
    Nov 2, 2001           The Storage and Benchmarking of XML         13
Transformed/ Simplified DTD
    <!ELEMENT book (booktitle, price, author, authority*)>
    <!ELEMENT authority (authname, country)>
    <!ELEMENT authname (#PCDATA)>
    <!ELEMENT country (#PCDATA)>
    <!ELEMENT booktitle (#PCDATA)>
    <!ELEMENT price (#PCDATA)>
    <!ELEMENT monograph (title, author, editor)>
    <!ELEMENT title (#PCDATA)>
    <!ELEMENT editor (monograph*)>
    <!ATTLIST editor name CDATA>
    <!ELEMENT author (name, address)>
    <!ATTLIST author id ID>
    <!ELEMENT name (firstname, lastname)>
    <!ELEMENT firstname (#PCDATA)>
    <!ELEMENT lastname (#PCDATA)>
    <!ELEMENT address (#PCDATA)>
  Nov 2, 2001      The Storage and Benchmarking of XML       14
  Step 2: Construct Schema
  Prototypes Trees
   1.     Only an element can become a root
   2.     An element that is not nested inside other elements
          can become the root
   3.     A non-#PCDATA element that is nested in more
          than 1 other element becomes the root
   4.     If a non-#PCDATA element B is not the only
          subelement of A and B only appears in A with a
          “*”, it becomes the root
   5.     One of the elements in the recursion is selected as
          root should recursion occurs in the DTD

Nov 2, 2001         The Storage and Benchmarking of XML   15
        Roots for the Example DTD
                                      <!ELEMENT book (booktitle, price, author,
   Element book is selected          authority*)>

    as root – rule 2                  <!ELEMENT authority (authname,
                                      country)>

    Element author is
                                      <!ELEMENT authname (#PCDATA)>
                                     <!ELEMENT country (#PCDATA)>
    selected as root – rule 3         <!ELEMENT booktitle (#PCDATA)>
                                      <!ELEMENT price (#PCDATA)>
   Element authority is              <!ELEMENT monograph (title, author,
                                      editor)>
    selected as root – rule 4         <!ELEMENT title (#PCDATA)>
                                      <!ELEMENT editor (monograph*)>
   Element monograph is              <!ATTLIST editor name CDATA>
                                      <!ELEMENT author (name, address)>
    selected as root – rule 5         <!ATTLIST author id ID>
                                      <!ELEMENT name (firstname, lastname)>
                                      <!ELEMENT firstname (#PCDATA)>
                                      <!ELEMENT lastname (#PCDATA)>
                                      <!ELEMENT address (#PCDATA)>
      Nov 2, 2001   The Storage and Benchmarking of XML                      16
  Step 2: Construct Schema
  Prototypes Trees (Cont’d)
      Tree construction:
             Depth-first scan on DTD for all selected root(s)
              starting from the subelements of the root
             New nodes for each visited elements and
              attributes
             A mixed element (element containing both
              #PCDATA and other subelement) will be marked
              with a “#” in the tree
             Recursion – a new leaf node with label <node
              name>.A

Nov 2, 2001            The Storage and Benchmarking of XML   17
  Schema Prototype Trees




Nov 2, 2001   The Storage and Benchmarking of XML   18
  Step 3: Generate Relational
  Schema Prototype
      All necessary descendants are inlined
       starting from the root except key nodes
       or foreign key nodes.




Nov 2, 2001     The Storage and Benchmarking of XML   19
  Relational Schema Prototype
 Book (booktitle, price)
 Authority (country, authname)
 Author (address, id, firstname, lastname)
 Monograph (title, name)




Nov 2, 2001     The Storage and Benchmarking of XML   20
  Step 4: Discover FDs and
  Candidate Keys
      Functional dependencies (FDs) and the
       candidate keys discovery by analyzing
       the XML data
      TANE algorithm
       (http://www.cs.helsinki.fi/research/fdk/
       datamining/tane/)



Nov 2, 2001     The Storage and Benchmarking of XML   21
  Candidate Keys

        Book {booktitle}
        Authority {country, authname}
        Monograph {title}
        Author {id}, {lastname, address}




Nov 2, 2001       The Storage and Benchmarking of XML   22
    Relational Schema Prototype
    With Candidate Keys
  Book (booktitle, price, author.id)
  Authority (country, authname, assigned, book.booktitle)
  Author (address, id, firstname, lastname)
  Monograph (title, name, author.id, monograph.title)

Book (booktitle, price)
Authority (country,
authname)
Author (address, id,
firstname, lastname)
Monograph (title, name)

  Nov 2, 2001      The Storage and Benchmarking of XML   23
  Step 5: Normalize the
  Relational Schema Prototypes
      The last step.
      Normalize the schema to 3NF (third
       normal form) if possible.
      Structure mapping methods does not
       handle order but leave it to metadata or
       user to handle.


Nov 2, 2001     The Storage and Benchmarking of XML   24
   X-Rel

      Masatoshi Yoshikawa, Toshiyuki Amagasa,
       Takeyuki Shimura and Shunsuke Uemura @
       Nara Institute of Science and Technology,
       Japan (2001)
      Model-mapping
      Data model: XPath (root node, element
       nodes, attribute nodes, and text nodes)
      The concept of region

Nov 2, 2001     The Storage and Benchmarking of XML   25
  Definition of Region
      The region of:
             An element node or a text node is a pair of
              numbers representing the start and end
              positions of the node in the XML document
             An attribute node is a pair of identical
              numbers equal to the start position of the
              parent element node plus one



Nov 2, 2001          The Storage and Benchmarking of XML   26
      Simple Path Expressions




    Path – an unit of decomposition of XML trees
    Store simple path expression (denoted by
     SimplePathExpr) from the root node. Why?
         Path is appear in XML queries frequently

    Nov 2, 2001       The Storage and Benchmarking of XML   27
  Why “#” Is Added?
         Look for family descendants of issue.
   1.     WHERE p1.pathexp LIKE
          ‘/issue%/family’
   2.     WHERE p1.pathexp LIKE
          ‘#/issue#%/family’
         /issuelist/family (WRONG) is match for
          the first but not the second.

Nov 2, 2001      The Storage and Benchmarking of XML   28
       Example XML Document
<Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”>
        <Authors>
                 <FN> Mei </FN>
                 <LN> Zhou </LN>
                 <Affiliation> Open Text Corporation </Affiliation>
        </Authors>
        <Authors>
                 <FN> Frank </FN>
                 <LN> Tompa </LN>
                 <Affiliation> University of Waterloo </Affiliation>
        </Authors>
</Paper>


     Nov 2, 2001          The Storage and Benchmarking of XML                 29
                                                 1                                                                            ROOT

   XML Tree
                                                 2                                                                            Element



                                                             Paper
                                                                                                                              attribute




                    3                                4                      11
                                                                                                                                text


                              Title                      Authors                    Authors                abc                string-value

        The Suffix-Signature Method for
        Searching Phrases in Text




          5                       7                  9                 12               14             16

                     FN                   LN         Affiliation             FN               LN            Affiliation




                      6                      8            10                13                 15                 17
                                                                                                      11
                   Mei                    Zhou           Open               Frank             Tompa              University
                                                         Text                                                    OF
                                                         Corporation                                             Waterloo




Nov 2, 2001                                 The Storage and Benchmarking of XML                                                              30
  Simple Path Expressions
  /Regions
                                            1                                                                            ROOT
                                                                                                                                           Node 3
                                            2                                                                            Element
                                                                                                                                               #/Paper#/@Title
                                                        Paper
                                                                                                                         attribute
                                                                                                                                               (1,1)
               3
                         Title
                                                4
                                                    Authors
                                                                       11

                                                                               Authors                abc
                                                                                                                           text


                                                                                                                         string-value
                                                                                                                                           Node 9
   The Suffix-Signature Method for
   Searching Phrases in Text
                                                                                                                                               #/Paper#/Authors#/
                             7                  9                 12
                                                                                                                                                affiliation
     5                                                                             14             16

                FN                   LN         Affiliation             FN               LN            Affiliation                             (99, 145)
                 6                      8            10                13                 15                 17
                                                                                                 11
              Mei                    Zhou           Open               Frank             Tompa              University
                                                    Text                                                    OF
                                                    Corporation                                             Waterloo




Nov 2, 2001                                                                      The Storage and Benchmarking of XML                                              31
  Mapping Idea
       A relational table per node type
       Simple path expression are normalized
       docID is introduced
       Basic XRel schema 
    Element (docID, pathID, start, end, index, reindex)
    Attribute (docID, pathID, start, end, value)
    Text (docID, pathID, start, end, value)
    Path (pathID, pathexp)


Nov 2, 2001      The Storage and Benchmarking of XML      32
   Table - Element
docID          pathID     start          end             index   reindex
1              1          0              257             1       1
1              3          66             155             1       2
1              4          75             86              1       2
1              5          87             99              1       2
1              6          100            145             1       2
1              3          156            249             2       1
1              4          165            178             2       1
1              5          179            192             2       1
1              6          193            239             2       1
 Nov 2, 2001            The Storage and Benchmarking of XML                33
      Table - Attribute

docID      pathID start      end value

1          2       1         1          The Suffix-Signature
                                        Method for Searching
                                        Phrases in Text




    Nov 2, 2001    The Storage and Benchmarking of XML     34
  Table - Text
   docID path start end value
         ID
   1     4    79    81 Mei
   1          5    91         94        Zhou
   1          6    113        131 Open Text Corporation
   1          4    169        173 Frank
   1          5    183        187 Tompa
   1          6    206        225 University of Waterloo
Nov 2, 2001       The Storage and Benchmarking of XML   35
  Table - Path
   pathID pathexpr
   1          #/Paper
   2          #/Paper/@Title
   3          #/Paper#/Authors
   4          #/Paper#/Authors#/FN
   5          #/Paper#/Authors#/LN
   6          #/Paper#/Authors#/Affiliation

Nov 2, 2001      The Storage and Benchmarking of XML   36
  XML Benchmarking Desiderata
      Bulk loading
      Reconstruction
      Path traversals
      Casting
      Missing elements



Nov 2, 2001    The Storage and Benchmarking of XML   37
  XML Benchmarking Desiderata
  (continued)
      Ordered access
      References
      Joins
      Construction of large results
      Containment, full-text search



Nov 2, 2001     The Storage and Benchmarking of XML   38
  XML Benchmarks
      3 XML benchmark proposals.
             XMach-1.
                  University of Leipzig, Germany.
             XML benchmark project.
                  CWI, the Netherlands.
             Kanda et al. Proposal (unpublished).
                  University of Michigan, IBM Toronto lab center
                   for advanced studies.


Nov 2, 2001              The Storage and Benchmarking of XML   39
  Conclusion

      From the different mapping approaches
       and experiments, there are a few places
       where relational database enhancement
       can help in coping with XML model
       differences.
             Support for sets.
             Flexible comparisons operators.
             Multi-predicate merge join.

Nov 2, 2001          The Storage and Benchmarking of XML   40
  Questions & Answers




Nov 2, 2001   The Storage and Benchmarking of XML   41
Appendix A

        Enhancing Structural Mappings
             Based on Statistics



Nov 2, 2001     The Storage and Benchmarking of XML   42
  Optimal Hybrid Database
  Algorithm
      M. Klettke, and H. Meyer
      XML and object-relational database
       systems - enhancing structural
       mappings based on statistics (2000)
      An algorithm that finds a type of
       optimal mapping based on the statistics
       and the DTD

Nov 2, 2001     The Storage and Benchmarking of XML   43
  Optimal Hybrid Database
  Algorithm
   1. Build a graph representing the
      hierarchy of the elements and
      attributes of the DTD.
   2. For every element/attribute of the
      graph, a measure of significance, w, is
      determined.
   3. Derive the resulting database design
      from the graph.

Nov 2, 2001   The Storage and Benchmarking of XML   44
Nov 2, 2001   The Storage and Benchmarking of XML   45
  Graph for an Example DTD




Nov 2, 2001   The Storage and Benchmarking of XML   46
    Calculate the Weight (Step 2)
   W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) +
    ¼ (QA/QG) where
       SQ - exploitation of quantifiers
       SA - exploitation of alternatives
       SH - position in the hierarchy
       DA -number of documents containing the
        element/attribute
       DG - absolute number of XML documents
       QA - number of queries containing the element/attribute
       QG - absolute number of queries

Nov 2, 2001          The Storage and Benchmarking of XML          47
  The Graph With the Colored
  Weight




Nov 2, 2001   The Storage and Benchmarking of XML   48
  Step 3 - Deriving Hybrid
  Databases From the Graph
      First, specify a limit on which attributes
       and/or elements is represented as
       attributes of the databases and which
       attributes and/or element are
       represented as XML attributes




Nov 2, 2001     The Storage and Benchmarking of XML   49
  Step 3 - Deriving Hybrid
  Databases From the Graph
      Then, search for all nodes of the graph
       that satisfy the following conditions:
      The node is not a leaf of the graph
      The node and all its descendants are
       below the limit given
      No predecessor that satisfies the first
       two conditions exists.

Nov 2, 2001     The Storage and Benchmarking of XML   50
  Step 3 - Deriving Hybrid
  Databases From the Graph
      The selected nodes and its descendents
       (the whole sub-graph) will be replaced
       by an XML attribute. (A BLOB like
       attribute)
      All other elements and attributes will be
       mapped to relational database using
       mapping.

Nov 2, 2001     The Storage and Benchmarking of XML   51
  Resulting XML Attributes for
  the Example DTD




Nov 2, 2001   The Storage and Benchmarking of XML   52
  References
http://www.geocities.com/ysee.geo/ml.html
http://db-www.aist-
  nara.ac.jp/members/Yoshikawa/paper/TOIT20
  01.pdf
http://citeseer.nj.nec.com/454884.html
http://dol.uni-leipzig.de/pub/2001-1/en
http://www.cs.wisc.edu/niagara/papers/xmlstore.p
  df
http://www.cwi.nl/htbin/ins1/publications?request
  =papers
Nov 2, 2001   The Storage and Benchmarking of XML   53

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:11/12/2012
language:Unknown
pages:53
About Good!!!NICE!!! The best document database!