Store XML Data in a Relational Database by ems33260


									       Store XML Data in a Relational Database

 XML shredding
 Selective mapping from XML to relations
 Propagating constraints from XML to relations

                            QSX (LN 4)            1
                       Storing XML data

 Flat streams: store XML data as is in text files
    – fast for storing and retrieving whole documents
     query support: limited; concurrency control: no
 Native XML Databases: designed specifically for XML
    – XML document stored as is
    – Efficient support for XML queries
    – Many techniques need to be re-developed
 Colonial Strategies: Re-use existing storage systems
    – Leverage mature systems (DBMS)
    – Simple integration with legacy data
    – Map XML document into underlying structures
        E.g., shred document into flat tables
                              QSX (LN 4)                 2
         Why transform XML data to relations?

 Native XML databases need to build new systems from the
   ground for
    – storing XML data, indexing,
    – query processing/optimization
    – concurrency control
    – updates
    – access control, . . .
    Nontrivial: the study of these issues is still in its infancy –
      incomplete support for general data management tasks

 Sophisticated query processing, storage and concurrency
   control techniques have been developed for relational DBMS

Thus, why not take advantage of available DBMS techniques?
                               QSX (LN 4)                             3
                   From XML to relations
Store and query XML data using traditional DBMS
 Derive a relational schema from an XML DTD (schema)
 Shred XML data into relational tuples – store XML data
 Translate XML queries to SQL queries
 Convert query results back to XML

                       query   answer

                            XML                      query


                            QSX (LN 4)                          4
             Architecture: XML Shredding
                    query   answer


                                                query translation

Relational schema
generator            XML query parsing     XML tagger
                     query rewriting       Tagging results

XML document

                        QSX (LN 4)                                  5
                         Nontrivial issues

 mapping DTDs to relational schemas: data model mismatch
    – DTD: recursive, regular expressions
    – relational schema: tables
 query translation: query language mismatch
    – XQuery, XSLT: Turing-complete
    – SQL: first-order
 Information preservation:
    – lossless: there should be an effective method to reconstruct
      the original XML document from its relational storage – the
      inverse of the storage mapping should exist
    – propagation/preservation of integrity constraints – will be
      seen shortly
                              QSX (LN 4)                             6
       Derivation of relational schema from DTD

 Lossless: the original document can be effectively reconstructed
   from its relational representation
 Query support: XML queries should be able to be rewritten to
   efficient relational queries
 Normal form: BCNF, 3NF have proved extremely useful for
   integrity maintenance of updates -- will be addressed later

                Relational schema generator

                XML document shredder

                                  QSX (LN 4)                     7
         Running example – a book document

  <!ELEMENT    db (book*)>
  <!ELEMENT     book (title, authors*, chapter*, ref*)>
  <!ELEMENT     chapter (text | section)*>
  <!ELEMENT     ref book>
  <!ELEMENT title #PCDATA>
  <!ELEMENT author #PCDATA>
  <!ELEMENT section #PCDATA>

 Recursive
 Complex regular expressions

                            QSX (LN 4)                    8
              Graph representation of the DTD
 Each element type/attribute is represented
         by a unique node                              db
 Edges represent the subelement
         (and attribute) relations                      *
 *: 0 or more occurrences of                         book
 Cycles indicate recursion          title     *               *         *
    e.g., book
                                             author          chapter     ref
 Simplification: e.g., (text | section)*
     text* | section* -- ignore order                 *           *
 XML document conforming to the DTD
                                                      text     section
    are those trees that unfold the graph
    (special treatment: *  empty at leaf)
                                QSX (LN 4)                                     9
                    Canonical representation
Store an XML document as a graph (tree)
 Node relation:     node( nodeId,         tag,   type)
    e.g., node(02, book, element), node(03, author, element)
 Edge relation:     edge( sid,     did)
    sid, did: source and destination nodes; e.g., edge(02, 03)

Pros and cons
 Lossless: the original document can be reconstructed; order
   preserving: one can add “order” information to the edge relation
 Ignore topological structure
 Querying: Requires multi-table joins or self joins for element
A simple query /db/book[author=“Bush”]/title requires 4 joins of the
   edge relation!
                                  QSX (LN 4)                       10
     Schema-driven: the shred inlining algorithm

Require the availability of the document DTD
 Represent the DTD as a graph (simplifying          db
    regular expressions)
 Traverse the DTD graph depth-first                  *
    and create relations for the nodes              book

    – the root
    – each * node
                                   title       *             *         *
    – each recursive node                  author          chapter     ref
    – each node of in-degree > 1
 Inlining: nodes with in-degree of 1
                                                     *           *
  are inlined – no relation is created              text     section
                              QSX (LN 4)                                     11
                         Relational schema
To preserve the semantics                                 db

 ID: each relation has an artificial ID (key)
 parentID: foreign key coding edge relation
 Column naming: path in the DTD graph

db(dbID)                                title     *               *         *
book(bookID, parentID, code, title: string)
                                                author          chapter     ref
author(authorID, bookID, author: string)
chapter(chapterID, bookID)                                *           *
ref(refID, bookID)
                                                         text     section
text(textID, chapterID, text: string)
section(sectionID, chapterID, section: string)

Note: title is inlined
                                QSX (LN 4)                                        12
                        Keys and foreign keys
book(bookID, parentID, code, title: string)               *
author(authorID, bookID, author: string)
chapter(chapterID, bookID)
ref(refID, bookID)                      title    *               *         *
text(textID, chapterID, text: string)
section(sectionID, chapterID, section: string) author          chapter     ref

 Keys: book.bookID, author.authorID, …                  *           *
 book.parentID  db.dbID if code = 1
                                                        text     section
   book.parentID  ref.refID if code = 0
 Foreign keys: referring to parent node
   author.bookID  book.bookID, similarly for chapter, ref
   text.chapterID  chapter.chapterID, similarly for section
                                    QSX (LN 4)                                   13
               Schema-Driven: Summary

   Use DTD/XML Schema to decompose document
   Shred inlining:
    –   Rule of thumb: Inline as much as possible to minimize
        number of joins
    –   Shared: do not inline if shared, set-valued, recursive
    –   Hybrid: also inline if shared but not set-valued or recursive
   Querying: It supports a large class of XML queries commonly
    +   Fast lookup & reconstruction of inlined elements
    -   Reconstruction may require multi-table joins and unions

                              QSX (LN 4)                            14
                    Schema-Driven: Summary
Instance mapping can be easily derived from schema mapping.
 Is it lossless?
   The order information is lost (simplification of regular
   expressions defining element types)

 Is there anything missing?
    – “core dumping” the entire document to a new database
    In practice one often wants to
        • select relevant data from the document
        • store the selected data in an existing database of a
          predefined schema
    – XML Schema: type + constraints
    What happens to XML constraints? Can we achieve normal
     forms (BNCF, 3NF) for the relational storage?
                               QSX (LN 4)                        15
       Store XML Data in a Relational Database

 XML shredding
 Selective mapping from XML to relations
 Propagating constraints from XML to relations

                            QSX (LN 4)            16
            Example: selectively storing XML data

book document                    existing DB         db
                   SQL inserts
      XML                                             *

 Existing relational database R : title        *             *         *
            book (id, title)               author          chapter     ref
            ref (id1, id2)
                                                     *           *
 Select from XML and store in R
                                                    text     section
  books with title containing “WMD”, and
  books cited, directly or indirectly
  select only part of the data from an input document
  store the data in an existing database with a fixed schema
                               QSX (LN 4)                                    17
      Mapping specification: XML2DB mappings

XML2DB Mapping:
 Input: an XML document T of a DTD D, and an existing
  database schema R
 Output: a list of SQL inserts R, incrementing the database of R
An extension of Attribute Grammars:
    – treat the DTD D as an ECFG (extended context-free)
    – associate semantic attributes and actions with each
      production of the grammar
        • attributes: passing data top-down
        • actions: generate SQL inserts R
    – Evaluation: generate SQL inserts in parallel with XML

                             QSX (LN 4)                          18
                     XML2DB mappings

 DTD: normalized; element type definitions e  
     ::=   PCDATA |  | e1, …, en | e1 + … + en | e*

 Relation variables: for each relation schema Ri, define a
   variable Ri, which holds tuples to be inserted into Ri
 Attributes: $e associated with each element type e
    $e: tuple-valued, to pass data value top-down
 Rules: associated with each e  ; conditional statements
    – for each e’ in , define $e’ using the parent attribute $e
    – Increment relation variables: Ri := Ri  {tuple}

                              QSX (LN 4)                           19
                   Example: XML2DB mapping

 db        book*
        $book := top      /* indicating the children of the root

book        book     ...       book            book        db
$book      $book              $book            $book        *

                                      title           *            *         *
                                                 author          chapter     ref

                                                           *           *
                                  QSX (LN 4)              text     section         20
                              Semantic actions
 book             title, author*, chapter*, ref*
   if text(title) contains “WMD” or ($book <> top and $book <> bot)
   then id := gen_id( );                 /* generate a new id */
            book := book  { (id, text(title)) }; /* increment book */
            if $book <> top              /* cited by another book */
           then ref := ref  { ($book, id) };      /* increment ref */
           $ref := id;         /* passing information downward */
   else $ref := bot
recall relation schema: book (id, title), ref (id1, id2)
 gen_id( ): a function generating a fresh unique id
 conditional: either has “WMD” or is referenced by a book of WMD
       db                book $book

            title        chapter   chapter         ref   ...    ref

                                      QSX (LN 4)
                                                         $ref     $ref     21
            Implementing XML2DB mappings

XML2DB                     SAX parsing:
                           SQL inserts
         parsing                                            R
                           generation     SQL inserts

 SAX parsing extended with corresponding semantic actions
  startDocument( ), endDocument( );
  startElement(A, eventNo), endElement(A);
  text(s)

 SQL updates:
       insert into book
       select *
       from     book       QSX (LN 4)                          22
       Store XML Data in a Relational Database

 XML shredding
 Selective mapping from XML to relations
 Propagating constraints from XML to relations

                            QSX (LN 4)            23
                                          XML constraints
 Recall: an XML schema consists of both types and constraints
  (//book,           {isbn})      -- isbn is an (absolute) key of book
  (//book,        (chapter, {number}) -- number is a key of chapter
        relative to book
  (//book, (title, { })) -- each book has a unique title


                           book               book              book             book

 isbn     title      chapter                   chapter        isbn     title    chapter    chapter

“XML”      number section         title    number section            “XML” number title    number

    “1”           number text     DTD         QSX (LN 4)
                                            “6” number                         “1” XPath    “10”
           Mapping from XML to a predefined relation
 One wants to store certain information from the XML document in:
 RDB:             chapter(bookTitle, chapterNum, chapterTitle)
  Mapping: for each book, extract its title, and the numbers and
        titles of all its chapters
  Predefined relational key: (bookTitle, chapterNum)
 Can the XML document be mapped to the relation without violating
   the key?                       db

                           book              book            book             book

 isbn     title      chapter                     chapter   isbn     title    chapter   chapter

“XML”      number section         title   number section          “XML” number title number

    “1”           number text     DTD        QSX (LN 4)
                                           “6” number                       “1” XPath “10”
                                           A safe mapping
 Now change the relational schema to
 RDB:             chapter(isbn, chapterNum, chapterTitle)
 The relation can be populated without any violation. Why?
 The relational key (isbn, chapterNum) for chapter is implied
   (entailed) by the keys on the original XML data:
 (//book,         {isbn})
 (//book,         (chapter, {number})                        db
 (//book, (title, { }))
                            book               book                 book             book

 isbn     title       chapter                      chapter        isbn     title    chapter   chapter

“XML”      number section          title    number section               “XML” number title number

    “1”           number text      DTD         QSX (LN 4)
                                             “6” number                            “1” XPath “10”
            Why do we care about constraints?
 Constraints are a fundamental part of the semantics of the data
   – mapping from XML to relations should not lose the information
 Relational constraints are important for query optimization, data
   cleaning, and consistency/integrity maintenance, . . .
 Constraints help us determine whether a relational schema for
   storing XML data makes sense or not

Problem statement: Constraint Propagation
 Input: a set K of XML keys, a predefined relational schema S,
   a mapping f from XML to S, and a functional dependency FD
   over S
 Output:   is FD implied by K via f? I.e., does FD hold over f(T)
   for any XML document T that satisfies K?
Note: XML schema/DTD is not required – K is the only semantics
                          QSX (LN 4)                                 27
                        Constraints can do even better
 One wants to find a “good” relational schema to store:
   chapter(isbn, bookTitle, author, chapterNum, chapterTitle)
 What is a good schema? In normal form: BCNF, 3NF, …
  Prevent update anomaly (the relational theory)
  Efficient storage, …
 But how to find a normalized design?

                           book              book                 book             book

 isbn     title      chapter                     chapter        isbn     title    chapter   chapter

“XML”      number section         title   number section               “XML” number title number

    “1”           number text     DTD        QSX (LN 4)
                                           “6” number                            “1” XPath “10”
        Constraint propagation and normalization

From given XML keys:
 (//book,   {isbn}),   (//book, (chapter, {number}),   (//book, (title, { }))
one can derive functional dependencies:
 isbn  bookTitle,           isbn, chapterNum  chapterTitle

Normalize the relation by using these functional dependencies:
chapter(isbn, bookTitle, author, chapterNum, chapterTitle)

                book(isbn, bookTitle),
                chapter(isbn, chapterNum, chapterTitle),
                author(isbn, author)
The new schema is in BCNF!
                                  QSX (LN 4)                                    29
    Computing minimum cover of propagated FDs
Problem statement: Minimum Cover of Propagated FDs
 Input: a set K of XML keys, and a mapping f from XML to S
 Output: a minimum cover F of all the functional dependencies
   (FDs) implied by the XML keys K via f
    – F is a cover (a set of FDs): any FD implied by K via f is also
      implied by F
    – F is minimum: F contains no redundant FDs, i.e., any FD in
      F is not entailed by other FDs in F.
 Why not compute the complete set of FDs implied by K via f?
Exponentially large!
 Why care about minimum cover?
Normalize relational schema for storing/querying XML data!

Remark: again XML schema/DTD is not required – K is the only
  semantic specification QSX (LN 4)                                30
                        Research issues

What we know: for XML keys and a restricted transformation
  language (from XML to relations), there exist efficient algorithms
 Constraint propagation: polynomial time
 Minimum cover of propagated FDs: polynomial time
For general constraints/transformation languages: beyond reach
 Undecidable if the transformation language is relationally
   complete (selection, projection, join, union, difference), even for
   XML keys
 Undecidable if both XML keys and foreign keys are considered,
   even for the identity “transformation”
Open: identify
 practical transformation languages
 practical XML constraints
that allow efficient constraint propagation
                               QSX (LN 4)                            31
       Commercial System: MS SQL Server 2005

 CLOB (character large objects), XML data type
 Annotated schema (XSD): fixed tree templates
    – nonrecursive schema
    – associate elements and attributes with table and column
    core-dumping entire elements or documents
 XQuery: query(), value(), exist(), nodes(); binding relational data
    – Combine INSERT and node( ), value( ), XPath
    – OPENXML: access to XML data as a relational rowset
    selective shredding, limited recursion, can’t store the entire
      document in a single pass

                               QSX (LN 4)                            32
               MS SQL Server 2005 (cont.)

 Query support:
   – SQL
   – SQL extensions for publishing results as XML (FOR-XML
 Summary:
   – Support both entire-document and selective shredding
   – Able to increment existing tables
   – Nonrecursive schema
   – Does not support context-dependent tuple construction: to
     construct (A, B), one cannot extract attribute B based on the
     extracted value of A (inf. passing)
   – Does not support entire document shredding and selective
     shredding in a uniform QSX (LN 4)                           33
     Commercial System: IBM DB2 XML Extender

 XML Columns: CLOBs + side tables for indexing individual
 User-defined mapping through DAD (Document Access
  Definition): a fixed XML tree template (nonrecusive)
   – SQL mapping: template-based language to publish relational
     data as XML (to be discussed in the next lecture)
   – RDB node mapping: for both publishing and storing XML
     (associating element and attributes with tables and columns)
 XML Collections: Declarative decomposition of XML into
  multiple tables
   – Data loading: follows DAD mapping

                             QSX (LN 4)                         34
       Commercial System: Oracle 10g XML DB

 Store XML data in CLOB (character large objects) or tables
 Canonical mapping into object-relational tables
    – tag names are mapped to column names
    – elements with text-only map to scalar columns
    – elements with sub-elements map to object types
    – list of elements maps to collections
    – Indexing: standard relational
    Core-dumping entire elements
 Annotated schema: recursive

                             QSX (LN 4)                        36
                    Oracle 10g (cont.)

 Query support:
   – CLOBs: SQL + Oracle Text; XPath
   – Canonical: SQL
 Summary:
   – Support both entire-document and selective shredding
   – Support recursive schema
   – cannot increment existing tables (DB2)
   – does not support context-dependent tuple construction

                           QSX (LN 4)                        37
                     Update Support

 XQuery does not support updates (yet…)
 How to update?
   – Flat streams: overwrite document
   – Colonial: SQL
   – Native: DOM, proprietary APIs
 But how do you know you have not violated schema?
   – Flat streams: re-parse document
   – Colonial: need to understand the mapping and maintain
     integrity constraints
   – Native: supported in some systems (e.g., eXcelon)

                           QSX (LN 4)                        39
                       Summary and Review

 Why store and query XML data via relational DBMS?
 Why study constraint propagation from XML to relations?
 Understand the shred inlining algorithm
 How to selectively map XML data to relations? Tuple construction via
   information passing? Combining shredding and parsing
 Is it possible to translate all XQuery queries to SQL? If not, is there a
   general approach to querying XML data stored in relations?
 Can one update XML data by making use of relational DBMS? How?

 Consider projects related to the topic – start the project early!
 Survey projects: individual, commercial products (Microsoft SQL Server
   2005, Oracle 10g, IBM DB2 XML Extender)

                                  QSX (LN 4)                                  40

To top