Managing Uncertainty of XML Schema Matching by usr10478

VIEWS: 45 PAGES: 41

									MANAGING UNCERTAINTY OF
XML SCHEMA MATCHING
--jim
Supervisors: Prof Cheung and Dr Cheng
To appear in ICDE’2010
THE DATA INTEGRATION PROBLEM
   Querying multiple data sources from a uniform interface




             Query interface            Mediate schema




                                                               Schema
                                                               mapping


     Data         Data          Data
    source       source        source          Source schema
       I           II            III
                                                                         2
SCHEMA MAPPING AND UNCERTAINTY
   The mapping between schemas can be uncertain


Example: Purchase Order schemas     Uncertain mappings
                                     M1: Order-ORDER, …, BCN-ICN, …
                                     M2: Order-ORDER, …, RCN-ICN, …
                                     …




                                     Which one is correct?            3
QUERY EVALUATION AND UNCERTAINTY
   The uncertainty in mappings affect query answers


Example: a source document              Uncertain mappings
                                         M1: Order-ORDER, …, BCN-ICN, …
                                         M2: Order-ORDER, …, RCN-ICN, …
                                         …




                                 Target query
                                  Q: //ICN

                                 which finds all ICNs (contact names
                                 of invoice parties) in the purchase order

Return by M1
                                                                             4
                 Return by M2
DATA INTEGRATION RELOADED
   Managing uncertainty of XML schema matching
        Issues: mapping generation and storage, query evaluation etc



             Query interface             Mediate schema




                                                             Uncertain
                                                          schema mapping


     Data          Data          Data
    source        source        source          Source schema
       I            II            III
                                                                           5
OUTLINE
 Background
 Problem
     Data model
     Query model

 Techniques
 Results

 Conclusion




                    6
DATA MODEL
   XML schema and document
       Node-labeled tree
       Document node may carry text values
   Schema mapping                            Uncertain mappings
                                               M1: Order-ORDER, …, BCN-ICN, …
       One-to-one mapping                     M2: Order-ORDER, …, RCN-ICN, …
                                               …




                               Schema

        Schema
                                                                          7
                                                     Document
QUERY MODEL
   Twig query on XML document
       Return all instance of the twig pattern



    Twig query:




    Answer:




                                                  8
QUERY MODEL
   Twig query through a target schema
     Step 1: rewrite target query into source query, based
      on schema mapping
     Step 2: evaluate source query on source document

                                  M1: Order-ORDER, BP-IP, BCN-ICN, …




                                                                       9
QUERY MODEL
   Query evaluation with uncertain mappings
       Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)}


                                       QS1
                           M1,Pr(M1)           R1,Pr(M1)


    Target query QT          …                      …

                                       QSh
                           Mh,Pr(Mh)           Rh,Pr(Mh)


               Rewriting               Evaluation

                                                           10
OUTLINE
 Background
 Problem

 Techniques
     Block tree
     Query evaluation
     Mapping generation

 Results
 Conclusion




                           11
  OBSERVATION
     Sharing among uncertain mappings




                                         Uncertain mappings




Overlapping:
 “Order~ORDER” shared by m1-m5
 “BP~IP” shared by m1, m2, m4, m5
 “BCN~ICN” shared by m1, m2
 …
                                                              12
THE BLOCK
   Each block in the block tree consists of:
       C: A set of correspondences
       M: A set of mappings
   Semantic: mappings in M share correspondences in C



                 Block        Block    Block




                                                         13
THE BLOCK
   Blocks which contain multiple correspondences




                    Block   Block   Block    Block




    Drawback:
    Exponential number of
    blocks to handle



                                                     14
THE C-BLOCK
   A c-block (constrained block) is a block which:
       Contains correspondence for all elements in its sub-tree (so that
        it’s more useful for query evaluation)
       Contains shared mappings more than a threshold (else it’s
        worthless to store it)




                        c-block




     |pM| = 5
     Threshold = 0.4
                                                                            15
THE BLOCK TREE
   Creation of the block tree
        A bottom-up method
        Parameter: MAX_B



Lemma 1: (informal)
The c-blocks for an schema element t can be
created from the c-blocks of t’s children.
(detail)


Lemma 2: (informal)
If a schema element t has no c-block, then
t’s parent (if any) has no c-blcok.

                                              16
THE BLOCK TREE
   Reducing the storage cost of uncertain mappings

                m1             m2          m3                                   ORDER
                Order~ORDER    Order~ORDER Order~ORDER
                                                                            C: Order~ORDER
                      b5.C          b5.C    SP~IP                     g3
                                                                           M: m1, m2, m3, m4, m5
                                              b2.C
                    RCN~SCN        OCN~SCN
                ...            ...            b3.C
                m4             m5            BP~SP                        IP                  SP
                Order~ORDER    Order~ORDER ...                         C: BP~IP
                  BP~IP          BP~IP                        g1                             ...   g2
                                                                   M: m1, m2, m4, m5
                      b2.C         OCN~ICN
                                    b4.C                      b5 C: BP~IP, BCN~ICN
                      b4.C     ...                                    M: m1, m2
                ...

                                                                 ICN                    SCN
                                                              C: BCN~ICN        b3   C: OCN~SCN
                                                         b1
                                                               M: m1, m2               M: m2, m3
    If part of a mapping is in the block tree,
    then replace it with a link                          b2   C: RCN~ICN        b4   C: BCN~SCN
                                                               M: m3, m4               M: m4, m5

                                                                                                        17
THE BLOCK TREE
   Fast locating c-blocks with a given path


                                                                ORDER
                                                            C: Order~ORDER
                                                      g3
                                                           M: m1, m2, m3, m4, m5
Use a hash-table to index all distinct
paths of the c-blocks
                                                          IP                  SP
                                                       C: BP~IP
                                              g1                             ...   g2
                                                   M: m1, m2, m4, m5

                                              b5 C: BP~IP, BCN~ICN
                                                      M: m1, m2


              Path         Node                                         SCN
                                                 ICN
              ORDER.IP                        C: BCN~ICN             C: OCN~SCN
              ORDER.IP.ICN               b1                     b3
                                               M: m1, m2               M: m2, m3
              ORDER.SP.SCN
                                         b2   C: RCN~ICN        b4   C: BCN~SCN         18
                                               M: m3, m4               M: m4, m5
OUTLINE
 Background
 Problem

 Techniques
     Block tree
     Query evaluation
     Mapping generation

 Results
 Conclusion




                           19
THE BASELINE APPROACH
   Evaluate QT with each mapping in pM separately
   Drawback
        When the mapping Mi is large, or h is large, the computation cost
         is expensive

                                             QS1
                               M1,Pr(M1)              R1,Pr(M1)
                                             DS

    Target query QT               …                       …

                                             QSh
                               Mh,Pr(Mh)              Rh,Pr(Mh)
                                             DS

                Rewriting                    Evaluation

                                                                             20
QUERY EVALUATION WITH BLOCK TREE
   Consider the root of a query
       Case 1): the root is found in the block tree, then use the blocks to
        evaluate the whole query
       Case 2): the root is not found, decompose the query (if possible),
        invoke recursion, and join partial answers




                                                                               21
QUERY EVALUATION WITH BLOCK TREE
   Case 1): the root is found in the block tree, then use the
    blocks to evaluate the whole query
       Only one mapping in the block is used
       Deal with remainder mappings

          IP



         ICN




                                                                 22
QUERY EVALUATION WITH BLOCK TREE
   Case 2): the root is not found, decompose the query (if
    possible), invoke recursion, and join partial answers

               ORDER


             IP           SP



             ICN




                          IP
     ORDER         +               +   SP

                          ICN

    Direct             Recursion       Direct                 23
    query                              query
TOP-K PROBABILISTIC TWIG QUERY
   Only k answer tuples {Ri,Pr(Ri)} whose probabilities are
    among the highest ones are returned
       Notice that Pr(Ri) = Pr(Mi)
       Equivalently, only consider the k mappings whose probabilities
        are among the highest ones


   Still use the previous algorithm, but with a set of filtered
    set of mappings




                                                                         24
OUTLINE
 Background
 Problem
     Data model
     Query model

   Techniques
       Block tree
       Query evaluation
       Mapping generation
 Results
 Conclusion

                             25
MAPPING GENERATION
   A mapping m for a schema S with another schema T
    contains a set of correspondences (es,et)
       et may be EMPTY, i.e., es matches none element in T
       Each element in S occurs exactly once in m
       Each element in T occurs at most once in m
       m’s score is the sum of similarities of its correspondences


   Problem definition
       Given: two schemas S and T, a set of correspondences (es,et) with
        similarities (which are schema matching results)
       Return: h mappings m1, …, mh, whose scores are among the
        highest ones


                                                                            26
MAPPING GENERATION
   Baseline solution
       Finding h-maximum bipartite matching (Min-Cost Flow)
       Polynomial with the size of bipartite




                                                               27
MAPPING GENERATION
   Observation: XML schema matching is usually sparse
   Improvement: a divide-and-conquer approach
       Derive partitions (Maximal Connected Sub-Graphs) of the
        bipartite
       Find the top-h partial mappings from each partition
       Merge




                                                                  28
DATASET AND RESULTS
   XML schemas and documents
       7 schemas for purchase order, obtained from various E-Commence
        standards (eg. XCBL, OpenTrans)
       Accompanied sample XML documents


   Schema matching
       Tool: COMA++, with different schema matching methods
       10 dataset: (source-schema, target-schema, matching-method)


   Target query
       10 hand-write queries



                                                                         29
RESULTS
   Uncertain mappings, do they really overlap?




                                                  30
RESULTS
   How much space does the block tree save for storing
    uncertain mappings? And why?




                                                          31
RESULTS
   Is the block tree effective?
       Intuitively, larger blocks tends to be more useful




                                                             32
RESULTS
   The block tree can be efficiently created
       Fast, and controllable




                                                33
RESULTS
   Can the block tree really help to improvement query
    performance?
       Varies the total number of mappings




                                                          34
RESULTS
   Can it scale?
       Probabilistic twig query and top-k query




                                                   35
RESULTS
   Top-h mapping generation
       Performance gain of partitioning




                                           36
CONCLUSION
   We study the problem of handling uncertainty in XML
    schema matching
   Observation
       Overlapping mappings, sparse bipartite, etc
   Approach
       The block tree
       Query evaluation with the block tree
       Generating uncertain mapping more efficiently
   Future work
       Other types of queries, probabilistic document, index update,
        relational scenario, etc



                                                                        37
THANKS!
    Q&A

    Happy Mid-Autumn Day!




                             38
QUERY REWRITING
   Given
     A target twig query QT
     A schema mapping m between S and T, which is a set
      of correspondences (es,et)
   Mapping semantic
       For each sub-tree in source document DS which
        contains a set of source element in m, there exists a
        sub-tree in target document DT which contains the
        corresponding target elements
   Procedure
       For each element in QT, replace with a source
        element                                                 39
       Connect all the source elements
LEMMA 1
   An example
                                                           Order



                                   InvoiceTo                       DeliverTo                    ...



                    Contact                                         Contact
                    27|24|25|24            Address                 27|24|25|24       Address


        name              email          street city country    name email         street city country
        52|48             50|50           100   51|49   49|51      52|48   53|47    100 51|49     49|51

     b1.M: 1-52        b3.M: 1,3,5,…
     b2.M: 53-100      b4.M: 2,4,6,...


                                                 Lemma 1: (conceptually)
                                                 The c-blocks for an schema element t can be
                                                 created from the c-blocks of t’s children.
                                                                                                          40
                                                 (detail)
RESULTS
   What kind of queries do we used?




                                       41

								
To top