Binary XML Storage and Query Processing in Oracle

Document Sample
Binary XML Storage and Query Processing in Oracle Powered By Docstoc
					                           VLDB 2009




   <Insert Picture Here>




Binary XML Storage and Query Processing in
Oracle

Sam Idicula, Oracle XML DB Development Team
    Outline

•   Motivation
•   Binary XML Overview
•   Storage Format Details
•   Query Processing
•   Performance Evaluation
•   Conclusion
 Previous Oracle XML Storage Models

• CLOB Storage
  • Text representation preserves exact form of original
    document (including white spaces)
  • Very good performance for insert & full retrieval
  • Size bloat (including tags, string representation of dates,
    numbers etc)
  • Need to parse the document for all XML processing
     • Query & DML processing are not efficient
  • Memory overhead with DOM
  • Mid-tier does not take advantage of parsing and validation
    already done on DB tier (and vice-versa)
 Previous Oracle XML Storage Models

• Object Relational Storage (OR)
  • XML Schema-based mapping to object-relational tables
  • Preserves DOM fidelity (more than traditional
    shredding)
  • Simple XPaths translate to table/column access
  • Very good query performance for highly structured use
    cases
  • Flexibility is limited due to schema dependency
  • Insert, full retrieval etc are poor (expand on this;
    separate these into 2 slides)
 Motivation/Goals for Binary XML

• Bridge the gap between two extremes
  • Structure-unaware text representation: Full flexibility, poor
    query performance
  • Object-relational mapping: Heavily dependent on rigid
    structure
  • Several customer use cases fall in between these extremes
• Native format that can:
  • Handle full spectrum of XML database use cases
  • Optimized semi-structured use cases
  • Provide good performance for a wide variety of operations
• Retain flexibility advantage of XML data model while
  providing good performance
       Customer use cases
                                        High Flexibility




               Majority of semi-structured
                  customer use cases
Unstructured                                               Structured




                                       Low Flexibility
 Motivation/Goals for Binary XML

• XML Schema usage
  • Need to be efficient for query processing on schemaless &
    loosely structured schemas
  • Ability to use schema constraints for more efficient processing
• Provide good performance for a wide range of
  operations
  •   Query
  •   DML: Insert/Load, Partial (piecewise) update
  •   Full-document & fragment retrieval
  •   Schema Validation & Evolution
• Mid-tier integration
 Oracle Binary XML Overview

• Compact Schema-aware XML Format
  • Pre-parsed tokenized binary representation
  • Addresses space-bloat associated XML 1.x serialization
• Intended for use in all tiers of Oracle stack
  • Oracle XML DB
  • Oracle iAS / XDK Java
• Exploits XML Schema information if available
  • Also supports non-schema-based encoding
• Preserves Infoset or Data Model fidelity – Not bytes
• Can create an XML Index for query optimization
 Oracle Binary XML

Database                     App                       Web
                            Server                    Cache                Client

              Binary XML                 Binary XML           Binary XML




                               Oracle Binary XML


• Mid-tier Processing: Oracle XDK Java support
• Binary XML allows direct access to fragments/sub-trees
• XML processing optimization: Scalable mid-tier DOM
    Format Details

•   Opcodes roughly corresponding to SAX events
•   Each opcode has fixed number of operands
•   Document-ordered serialization of opcodes
•   Stored as a BLOB
•   Tag names are tokenized into qname IDs
    • Central repository (or)
    • Inlined definitions
• Optimized opcodes for simple elements, repeating
  elements etc.
• Uses native data-types in the presence of XML
  schema
 Streaming Capabilities

• Streaming XPath evaluation
  • XPathTable with NFA: Multiple XPaths evaluated in a single
    pass
  • Forward axes
• Streaming partial updates
  • Most common update scenarios handled in streaming manner
      eg: updateXML( ‘/purchaseOrder/Reference/text()’, ‘XXXX’)
  • Can be directly applied on disk avoiding expensive DOM
    construction
  • Takes advantage of the Oracle SecureFile LOB storage to
    perform delta update
Query Processing Architecture

         SQL/XML                        XQuery



                    DB XQuery Rewrite



         XMLIndex
                                    Functional Evaluation
Path-based    Table-based            (Streaming XPath)
XMLIndex       XMLIndex



                    Binary XML
 Document-level Summary

• Long-term goal: Efficient tree-oriented navigation
• Important for query execution
  • Pure streaming is too costly over large documents
• Current Implementation
  • Start & end offsets for large subtrees
  • Threshold for “large” can be adjusted
  • Used for skipping to end of subtree
• Working on significant enhancements
  • Handling all axes
 Search-based Decoder

• Goal: Search for a simple XPath or XPath location
  step in a Binary XML stream
• Main search params are (axis, qname ID)
  • Supports wild cards
  • OR of multiple qnameIDs allowed
• Return only when there’s a result or search is done
• Skip irrelevant subtrees
  • Using summary if possible
• Schema-aware search
  • Can search for kidnum or child-position instead of qname ID
  • Can terminate search earlier based on schema
 Schema-aware NFA

• Goal: Evaluate multiple XPaths in single pass over
  document
• Uses Y-Filter-like approach to build NFA
• Works in conjunction with search-based decoder
  • Translates transitions to searches when possible
  • Push unbranched linear state transition paths into search-
    based decoder
• Uses XML schema when available
  • Use of kidnum instead of qname ID
  • Sequence & Occurrence constraints
  • Derives a “strict sequential” constraint
  Performance: Query - XMark

 3                                      • Ratio of elapsed
                                          time geometric
2.5                                       mean for 100M
 2
                                          XMark doc
                                        • SB – Schema-
1.5                                       based
 1                                      • NSB – Non-
                                          schema-based
0.5
                                        • CLOB is 144x
 0                                      • No indexes
      Binary XML Binary XML     O-R
         NSB         SB       Storage
      Performance: Insert

3.5
                                  • Ratio of elapsed
 3                                  time for XMark 10M
2.5
                                    doc
                                  • SB – Schema-based
 2

1.5

 1

0.5

 0
        CLOB   Binary XML   O-R
                   SB
      Performance: Full Retrieval

  1                               • Ratio of elapsed
0.9                                 time for XMark 10M
0.8                                 doc
0.7                               • SB – Schema-based
0.6
0.5
0.4
0.3
0.2
0.1
  0
        CLOB   Binary XML   O-R
                   SB
      Performance: Compression


1.4                           • D1 – Structured
                              • D2 – Semi-structured
1.2
                              • D3 – Document-centric
 1
0.8                  O-R      • Based on actual customer
                     CLOB       datasets; mix of XML
0.6
                     Binary     document sizes
0.4                           • Further compression possible
0.2                             via SecureFile LOB
                                compression
 0
      D1   D2   D3
 Summary

• Binary XML
  • Native XML storage format
  • Handle full spectrum of XML use cases
  • Schema-aware
• Query Processing Optimizations
  • Search-based Decoder
  • Document-level Summary
• Performance Results
 For more information

• Contact:
      Sam.Idicula@Oracle.com


• Downloads, technical documentation:
      http://www.oracle.com/technology/tech/xml/xmldb/index.html