TDD Research Topics in Distributed Databases by oga20203

VIEWS: 7 PAGES: 65

									   TDD: Research Topics in Distributed Databases


 Schema mapping, data exchange and data integration


 Schema mapping and data exchange

 Schema matching

 Data integration: an introduction




                                                      1
  Schema mapping, data exchange and integration



 Schema mapping and data exchange

 Schema matching

 Data integration: an introduction




                                                  2
        Schema mapping and schema matching


Schema mapping
 Input: a source schema S1 and a target schema S2
 Output: a mapping from instances of S1 to instances of S2

Important for data exchange, migration, and answering query using
   views


Structural similarity of two schemas




                                                                    3
         Graph similarity – traditional approach


 Represent schemas S1 and S2 as node-labeled, rooted graphs,
   also denoted S1, S2
 Similarity: Given two node-labeled graphs G1, G2, G1 is similar
   to G2 if there is a binary relation  on their nodes, s.t.,
    – if r1 and r2 are the roots of G1, G2, then r1  r2;
    – for any nodes x in G1 and y in G2, if x  y then
        • x and y have the same node label (or their labels match
          by some attribute/tag matching function)
        • for any “child” x‟ of x in G1, there exists a child y‟ of y in
          G2 such that x‟  y‟.
Complexity: O(|G1|  |G2|)
                                                                           4
 Schema S1 matches schema S2 if S1 is similar to S2
                       Example: graph similarity

Is S1 similar S2 to ?
      S1       A       S2           A


               B            B               C


        S1         A       S2       A


           B           C    B               C

       S1          A        S2          A
                       *
             B         C        B               C

Edge labels should be taken into account            5
      Example: Similarity-based schema mapping
There is a match from D1 to D2 given the attribute mapping:
   books  publications, book  publication
 A mild extension of graph similarity: edge labels also count
 D1 is similar to D2, but the converse is not true
                                                                 D2
 D1                                          publications
            books
                                                      *
                    *
            book                              publication


   title     year       authors     title    year     type    authors
                              *                                       *
                        author              textB     other      author
                                                                      6
         Example: XML mapping – source DTD


Source schema S1:                           db
 db         class*
                                                   *
                                           class
 class      cno, title, type
 type       ( regular + project )   cno   title       type

 regular    prereq
                                           regular     project
 prereq     class*
                                           prereq
                                      *




                                                                 7
            Example: XML mapping – target DTD

                                        school
target schema S2:
                          courses                students

                history             current
                                                        *
                                                    student
                      *   course    *
                                                     ssn      name gpa   taking
               basic       category
                   *
                                                                                  *
   cno credit semester mandatory advanced

    title   year term regular lab seminar project

                                   required
                          prereq              gpa
                  *                                                               8
                   Limitations of graph similarity

  S1                                            S2             school
          db
                                                  courses               students
               *
         class                          history             current      ...
                                             *    course   *
 cno     title      type                basic       category
                                            *
                            cno credit semester mandatory advanced
         regular project
                            title   year term        regular lab seminar project
         prereq
                                                            required
   *
                                         *           prereq             gpa
Graph similarity:
 Yes: attribute renaming
 NO: restructuring                                                                9
                               Data integration

   S1’     db                                            S1          db
             *                                                            *
         student
                                                                    class
   ssn     name taking
                   *                                          cno   title     type
                   cno

                                                                    regular project
                          S2         school

                         courses                                    prereq
                                              students
                                                               *
             history               current
                                                   *
                                              student
                               ...
Multiple sources are to be mapped to a single target: the target
  schema must have a larger information capacity – it cannot be
  similar to sources                                                            10
          Information preserving XML mapping


                        XML
                        mapping


XML tree T of S1                                 XML tree of S2

Objective: Find a mapping σd: I(S1) → I(S2) such that

 Type safety: the target tree σd (T) must conform to S2.

For any XML tree T of S1, σd(T) is an XML document that is an instance of
   (conforms to) S2
 Information preservation (lossless): no information of the source
   is lost in the transformation
                                                                       11
         Information preserving XML mapping


 Invertibility: there exists an inverse σ-1d: I(S2) → I(S1) such

   that for any XML tree T of S1, T = σ-1d (σd (T)).

The source T can be recovered from the target σd(T)


 Query preservation: for an XML query language L, there is a
   function F: L → L such that for any Q in L and any instance T
   of S1, Q(T) = F(Q)(σd (T)).
  All queries in L on the source can be answered on the target



                                                                    12
    Separation: invertibility vs. query preservation


Fundamental properties of Information preserving XML mappings
For relational data w.r.t. relational calculus (L), invertiblility (calculus
   dominance) and query preservation (dominance) coincide [Hull 84]

However,
(a) There is an invertible XML mapping that is NOT query preserving
   w.r.t. XPath.
(b) There is an XML mapping that is query preserving w.r.t. XPath
   without position( ) but it is NOT invertible.




                                                                       13
   Equivalence: invertibility vs. query preservation


 Sufficient conditions:

Theorem: For any XML query language L and XML mapping σd
    – If the identify mapping is definable in L and σd is query
       preserving w.r.t. L, then σd is invertible.
    – If L is composable, σd is invertible and σ-1d is definable in
       L, then σd is query preserving w.r.t. L.
 Regular XPath: query preservation is a stronger property.




                                                                      14
                        Regular XPath

 Regular XPath:

 Q ::=      | A   | Q/text()     | Q/Q   | Q∪Q |        Q* |      Q[q]

 q ::= Q | Q/text() = ‘c’ | position() = k | q ∧ q | q ∨ q | not q
 The child-axis, Kleene closure, union, position(),

 An XPath fragment: Q//Q instead of Q*                 db
                                                               *
Example: Find the “left-most”                          class
prerequisites of TDD
                                             cno       title       type

  class [ cno/text() = ‘TDD’] /                        regular     project

    (type/regular/prereq/                              prereq
                                               *                     15
     class[position() = 1])*
   Equivalence: invertibility vs. query preservation


Regular XPath: query preservation is a stronger property.
(a) If an XML mapping is query preserving w.r.t. regular XPath,
   then it is invertible.
(b) There is an invertible XML mapping that is NOT query
   preserving w.r.t. regular XPath.




                                                                  16
  Complexity: determining information preservation


It is undecidable to determine, for any XML mapping defined in any
     language subsuming FO, whether or not the mapping is
(a) invertible, or

(b) query preserving w.r.t. any query language including projection
   queries.

It is beyond reach for XML mappings defined in XQuery/XSLT
 to determine whether or not they are information preserving
 to have an effective function that automatically computes
   information preserving XML mappings.


                                                                      17
                Schema mapping for XML

Objective: Given a source schema S1 and a target schema S2, find
   an information-preserving mapping   σd: I(S1) → I(S2) if there
   is any.
Approach:
 Find a mapping at the schema level:   σ: S1 → S2 with certain
   properties – schema embedding
 Derive an instance-level mapping σd: I(S1) → I(S2) from σ
   that is guaranteed to be information-preserving

A systematic way to compute XML mappings that
 automatically guarantee information preservation,
 cover important mappings commonly found in practice, and
                                                                  18
 accommodate integration (multiple sources)
                      Schema embedding


Given
 source DTD S1 = (E1, P1, r1), target DTD S2 = (E2, P2, r2);
 similarity matrix att( ) on element type names: att(A, B) in [0, 1]
   indicates how close A ∈ E1 is to B ∈ E2
Schema embedding: σ = (λ, path)
 λ: E1 → E2, type mapping: λ(r1) = r2 and att(A, λ(A)) > 0
 path(A, B) maps an edge (A, B) in S1 to a unique path from λ(A)
   to λ(B) in S2:
    – Path: A1[position( ) = k1] / … /An(position( ) = kn]
    – The types of edge and path match: Information capacity
    – prefix-free: Type safety
                                                                    19
                         Schema embedding
Schema embedding: σ = (λ, path)
 λ: E1 → E2, type mapping: λ(r1) = r2 and att(A, λ(A)) > 0
 path(A, B) maps an edge (A, B) in S1 to a unique path from λ(A)
     to λ(B) in S2
      – path type: AND (OR, STAR) edge to AND (OR, STAR) path
        (solid/star edges, solid + at least 1 dashed, solid edges + *)
          Information capacity
      – prefix-free: if P1(A) = A1, …, An, path(A, Ai) is NOT a prefix
        of any path(A, Aj) for j ≠ i; similarly for P1(A) = A1+ … + An.
           Type safety – valid mapping

Is there a schema embedding for the following?
S1    A       S2     A           S1       A       S2   A
                                      *
                                                                     20
 B        C    B         C            B       C    B       C
               Example: Schema embedding

      A            A
 S1           S2                  λ(A) = A, λ(B) = B, λ(C) = B

 B        C
               1       2        path(A, B) = B[position( ) = 1]
                   B            path(A, C) = B[position( ) = 2]



                       *
S1    A       S2   A             λ(A) = A, λ(B) = B, λ(C) = C
                                       path(A, B) = A/B
 B        C        B                   path(A, C) = B/C
                           Unfolding: the prefix-free condition.
                   C

                                                                   21
                  Schema embedding: example

 λ(db) = school, λ(class) = course
  path(db, class) = courses/current/course
      – mapping edge to path
      – STAR edge to STAR path
      – Graph similarity? NO               school

 S1
         db             S2
              *              courses                students

        class                                               *
                   history             current      student

                        *              *
                             course                   ssn       name gpa   taking 22
                  Schema embedding: example

 λ(type) = category, λ(A) = A
  path(class, cno) = basic/cno
   path(class, title) = basic/semester/title
   path(class, type) = category
 AND (STAR) edges to AND (STAR) paths
                                                          course
 Relative path
 S1       class                    S2          basic               category

                                                    *
  cno     title     type
                           cno credit          semester


                           title        year    term
                                                                              23
                Schema embedding: example
 λ(X) = X
  path(type, regular) = mandatory/regular
  path(type, project) = advanced/project
                                                        OR edges to OR paths
                               category
 S1      type      S2
                    mandatory            advanced
 regular project
                    regular        lab       seminar   project
 λ(X) = X
                                                                          course
  path(regular, prereq)       S1             class     S2                     .
                                                                              .
        = required/prereq                      .                          regular
                                               .
  path(prereq, class)                    regular
                                                                         required
        = course                             prereq
                                         *                           prereq          gpa
                                                                                    24
                                                                 *
            Deriving instance-level mapping


Each schema embedding σ: S1 → S2 determines an XML

   mapping σd: I(S1) → I(S2).

Given an XML tree T1 of S1, σd (T1) constructs an instance T2 of
   S2, top-down by mapping A-elements of T1 to λ(A)-nodes in T2
 the root of T2 is mapped from the root of T1;
 for each λ(A)-element in T2 mapped from an A-element of T1,
   generate path(A, B) in T2 for each B-child of the A-element;
 when all the element in T2 mapped from nodes in T1 are fully
   expanded, add necessary “default” elements to T2 such that T2
   satisfies S2.
                                                                  25
             Properties of schema embedding


Theorem: The XML mapping σd: I(S1) → I(S2) derived from a
   schema embedding σ: S1 → S2 is
 well defined (type safety)
 invertible (with a quadratic-time inverse), and
 query preserving w.r.t. regular XPath (query rewriting: linear-time
   data complexity, quadratic-time combined complexity if the
   rewritten regular XPath expression is represented as an
   automaton)




                                                                   26
                    Example: query rewriting


Example: a regular XPath query on S1 to an equivalent query on S2
                                             Q2: courses/current/course
Q1: class [ cno/text() = ‘CS331’] /
                                             [ basic/cno/text() = ‘CS331’] /
   (type/regular/prereq/class)*
                                             (category/mandatory/regular/
                                                 required/prereq/course)*
           db                                                  school
                *                                 courses               students
           class
                                        history             current      ...
   cno     title    type                     *    course   *
                                        basic              category
                            cno credit semester mandatory advanced
          regular project
                            title   year term      regular lab seminar project
           prereq                                         required        27
     *                                   *          prereq         gpa
                   Integration: multiple sources

   S1’     db                                              S1                db
             *                                                                    *
         student                                                             class

   ssn     name taking                                          cno          title    type
                                      S2
                   *
                   cno               school
                                                                            regular project

                         courses              students                       prereq
                                                     *              *
             history               current    student
                         ...
                       cno                     ssn       name gpa       taking
                                      *
λ(db) = school, λ(X) = X               S2: larger information capacity than
path(db, student) =                     S1 and S1’
         students/student              pairwise disjoint path mappings from
                                                                           28
path(taking, cno) = cno                 S1, S1’ to S2
            Schema embedding vs. graph similarity
 Schema embedding
    –   Definition: mapping edges to paths
    –   Graphs: capturing various DTD constructs
    –   Global restructuring: source and target with different structures
    –   Information preservation: automatically guaranteed
 Graph simulation:
   – Definition: mapping edges to edges
   – Graphs: do not distinguish different types of edges
   – Local restructuring: identify certain edges from the same node
   – Information preservation: NO
Schema embedding is not a mild generalization of graph simulation
              A
  S1                   S2
                             A
                                   Schema embedding: NO
        1          2                                                 29
                                    Graph simulation: YES
              B              B
   Schema embedding vs. existing mapping methods

 Information preservation for XML mappings:
   – Schema embedding: automatically guarantee both invertibility
     and query preservation w.r.t. regular XPath
   – Other approaches: do not consider information preservation
 Restructuring:
   – Schema embedding: global restructuring
   – Other approaches: typically for schemas with similar structures
 Data integration:
   – Schema embedding: capable of mapping multiple source
     schemas to a single target schema while preserving information
   – Other approaches: single source schema
Schema embedding: the first systematic method to define information-
  preserving XML mappings                                         30
        Complexity: finding schema embedding


Input: two DTD schemas S1 and S2, and a similarity matrix att( )
Output: find a schema embedding from σ: S1 → S2 such that

   qual(σ, att) is maximum, if there is any

   qual(σ, att) is the sum of att(A, λ(A)) for all A in S1

Theorem: It is NP-complete to determine whether or not there is a
  schema embedding from S1 to S2, even when S1 and S2 are
  nonrecursive and they consist of concatenation types only.

Efficient algorithms for finding schema embedding are necessarily
    heuristic.
 Find local embedding for each DTD production of S1
                                                                    31
 Assemble local embeddings to make a schema embedding
  Computing local embedding – fixed type mapping

 Input: a production A → P(A) in a source schema S1, a target
    schema S2, and λ0 from types in P(A) to S2
Output:   σ0 = (λ0, path0),   a partial embedding from P(A) to S2
Example: given λ0(type) = category, λ0(X) = X, find path0
                                    category
      S1      type
                        S2
                          mandatory         advanced
      regular project
                         regular      lab   seminar    project

There is an O(|P(A)| |S2|) algorithm findPath to compute local
embedding
(depth-first search, checking each S2 subtree only once)

                                                                    32
                Computing local embedding

Input: a production A → P(A) in a source schema S1, a target
   schema S2
Output:   σ0 = (λ0, path0),   a partial embedding from P(A) to S2
Example: λ0 is not given, find both λ0 and path0
                                    category
      S1      type
                        S2
                          mandatory         advanced
      regular project
                         regular      lab   seminar    project

Theorem. When λ0 is not fixed, the local embedding problem is
NP-hard
Heuristic: randomized findPath to find both λ0 and path0
(randomly pick up possible type-node match in the search)
                                                                    33
                 Assembling local embeddings


Input: C(A), a set of local embeddings for each A in the source
   schema S1; a target schema S2.
Output:     σ = (λ, path),   a schema embedding from S1 to S2 if there
   is any

Theorem: The assemble-embedding problem is NP-complete even
   when S1 and S2 are nonrecursive.

Conflict:
 type mapping

 prefix free

Efficient assembling algorithms have to be heuristic                34
      Heuristic for assembling local embeddings


Heuristic algorithms:

1. Fix an order O on S1 types via qual( ), pick a local embedding

   σA from C(A) in the order O,   and increment σ with σA in the
   absence of conflict

2. Assume a random order O on S1 types, then do the same as (1)

3. Reduction to the MAX-Weight-Independence-Set problem,
   leveraging an existing tool for that problem.

Use randomized findPath to
 initialize C(A), and
 generate new local embeddings in case of failure
                                                                    35
     Information-preserving schema embedding


 Information preservation:
    – more intriguing than its relational counterparts: separation,
      equivalence, complexity of invertibility and query
      preservation
    – important for data exchange, migration, integration, P2P, …
 Schema embedding:
    – automatically guarantee information preservation
    – capture various DTD schema constructs
    – support global restructuring
    – accommodate integration: multiple source to a single target
    – NP-complete, but with efficient and effective heuristic
                                                                      36
  Schema mapping, data exchange and integration



 Schema mapping and data exchange

 Schema matching

 Data integration: an introduction




                                                  37
         Schema mapping vs. schema matching

Where can we get similarity matrix att( ) on element type names?

Schema matching:
 Input: a source schema S1 and a target schema S2
 Output: a pairing (association) of elements (attributes, tags) from
   S1 to elements (attributes, tags) of S2
Correspondence between attributes/tags in the source and target

Schema mapping: to define data transformation
 Input: a source schema S1 and a target schema S2
 Output: a mapping from instances of S1 to instances of S2
Schema matching is a first step to find schema mapping

 We focus on relational schema matching – already hard             38
         Schema Matching vs. Schema Mapping

 Schema Matching means “computer-suggested arrows”
                      Source
     RS.Person      Schema: RS           RT.Student


                            .88
        First                              Name
                                                             Target
        Last                .93            Address         Schema: RT
        City                               City
          .                  .97             .
          .                                  .
          .                                  .



 Arrows inferred based on meta-data or sample instance data
 Associated confidence score
 Meaning (variant of): RS.Person.City  RT.Student.City
                                                                 39
          Schema Mapping: “From Arrows to Queries”


RS.Person           RT.Student
                                        select concat(First, “ ”,Last) as Name,
   First              Name                     City as City
   Last               Address           from RS.Person, RS.Education,…
   City               City
    .                  .                where …
    .                  .
    .                  .                            Q: RS -> RT



   Given a set of arrows user input, produce a query – schema
     mapping -- that maps instances of RS into instances of RT

   Reference: Putting Context into Schema Matching, VLDB 06
   http://homepages.inf.ed.ac.uk/wenfei/papers/vldb06-matching.pdf
                                                                           40
                     Inventory mapping example


 Consider two inventory schemas
 Books, music in separate tables in RT
 Run some nice schema match software
                                  RT.book
     RS.inv
                                title: string
   id: integer                  isbn: string
   name: string                 price: float
   code: string                 format: string
   type: integer
   instock: string                     RT.music
   descr: string                      title: string
                                      asin: string
   arrival: date
                                      price: float
                                      sale: float
                                      label: string
                                                      41
                       Inventory where clause

 The lines are helpful (schema matching is a best-effort affair),
   but…
 lines are semantically correct only in the context of a selection
   condition
                     where type=1     RT.book
     RS.inv
                                    title: string
   id: integer                      isbn: string
   name: string                     price: float
   code: string                     format: string
   type: integer
   instock: string                         RT.music
   descr: string                          title: string
                                          asin: string
   arrival: date
                                          price: float
                                          sale: float
                     where
                      type = 2            label: string
                                                                      42
                      Definition and Goals

 Contextual schema match: A set of arrows between source and
   target schema elements, annotated with logical conditions
                     M
                         RS.aa   RT.bb   RS.c=3
    – In a standard schema match, the condition “true” is always
      used
                     M
                         RS.aa   RT.bb    true

 Goal: Adapt instance-driven schema matching techniques to
   infer semantically valid contextual schema matches, and create
   maps from those matches




                                                                    43
                        Grade mapping example


 Consider integrating data about grade assignments
 Again context is needed, but semantics are slightly different
                                                  where Assgn=2

                        where Assgn=1                              =3      =…


 Name   Assgn   Grade                   Name   Grade1   Grade2    Grade3   …
 Joe     1                              Bob
                 84
 Joe     2       86                     Sue
 Joe     3       75
 Mary    1       92
 Mary    2       94
 Mary    3       85




                                                                                44
     From schema matching to schema mapping


Schema mapping:
 Input: a collection of matches (RS.aa, RT.bb, c)
 Output: a mapping (query) Q( ) from instances of S1 to
   instances of S2
Approach: for each target relation RT
 For each RS, define an SQL query R(S, T) from I(RS) to I(RT),
   based on the matches
 Schema mapping for RT is the union of of all SQL queries
   defined for RT



                                                                  45
 From schema matching to schema mapping (cont.)


Define R(S, T)
 Create a „logical table‟ based on
    – Inclusion constraints RS.aa  RT.bb from the matches
        Extend inclusion by semantic association rules: outer joins
         •   Attributes in the same source table
         •   key-foreign keys: if RS1 is a logical table, and RS1 has a
             foreign key referencing RS2, then extend RS1 by outer-
             join with RS2 on the foreign key
 For attributes in RT that are not mapped from any attributes in
   Rs, add default value (using Skolem function)
 Derive the SQL query from the logical table
                                                                      46
From schema matching to schema mapping (cont.)


Semantic association based on context. Recall (RS.aa, RT.bb, c)
 Contextual foreign key: RS.aa  RT.bb where c
 Extended semantic rules:
    view: select aa from RS where c
    Source: student(name, email, address)
            project(name, assign, grade, inst)
    Target: proj(name, inst, assign0, grade0, …, assign9, grade9)
    – view Vj: select name, grade from project where assign = j
    Logical table: group the 10 views by outer-join on name
    – view Uj: select name, inst from project where assign = j
    Logical table: group Vi and Uj only if i = j
    – ...                                                           47
  Schema mapping, data exchange and integration



 Schema mapping and data exchange

 Schema matching

 Data integration: an introduction




                                                  48
                        Data Integration


Data exchange: from a single source to target.
What about multiple distributed and heterogeneous sources?

Integration: distributed systems with middleware architecture
 Warehouse: AIG
 Mediator: Enosys
 Hybrid: Active XML




                                                                49
               Middleware: data warehouse


 Data warehouse: a repository of integrated information,
   available for querying and analysis
 Data warehousing: architectures, algorithms and tools for
   integrating data from multiple databases or other information
   sources into a single repository

Heterogeneous sources
 structured data
    object-oriented databases, relational databases, ...
 semistructured data
   XML documents, Web data, ...
 unstructured data
    video, audio, ...
                                                                   50
              Warehouse architecture
                  client applications




                  data warehouse



                       integrator



monitor/wrapper     monitor/wrapper     monitor/wrapper




     RDB                   OODB             XML
                                                     51
                       Monitor/wrapper


A monitor/wrapper for each data source: incrementally added
 translation: translate an information source into a common
   integrating model
 change detection: detect changes to the underlying data source
   and propagate the changes to the integrator
    – active databases (triggers: condition, event, action)
    – logged sources: inspecting logs
    – periodic polling, periodic dumps/snapshots
 Data cleaning:
    – detect erroneous/incomplete information to ensure validity
    – back flushing: return cleaned data to the source

                                                                   52
                            Integrator


Receive change notifications from the wrapper/monitors and reflect
  the changes in the data warehouse.
Typically a rule-based engine:
 merging information (e.g., skolemiation)
 handling references
 Data cleaning:
    – removing redundancies and inconsistencies
    – inserting default values
    – blocking sources



                                                                 53
                           Warehouse

Data from data sources are imported into the warehouse
 the underlying data sources are still operational
 the data is replicated in the warehouse
The warehouse data is not typically in the same form and volume
  as in the underlying sources:
 metadata and subject-oriented: for analytical purpose (e.g.,
   sales, marketing, finance, distribution, …)
 historical data (timespan): cover a long time frame
 multi-dimensional: data cubes or hypercubes
 highly integrated and summarized: derived data
 granularity: roll up and drill down
 large volume of data: VLDB (very large database)
                                                                  54
                            Applicability

Problem: potential inconsistency with the sources.
Commonly used for relatively “static” data
 when clients require specific, predicable portion of the available
   information
 when clients require high query performance but not necessarily
   the most recent state of the information
 when clients want summarized/aggregated information such as
   historical information
Examples:
 scientific data
 historical enterprise data
 caching frequently requested information
                                                                       55
         Data warehouse vs. materialized views

 materialized view is over an individual structured database,
   while a warehouse is over a collection of heterogeneous,
   distributed data sources
 materialized view typically has the same form as in the
   underlying database, while a warehouse stores highly
   integrated and summarized data
 materialized view modifications occur within the same
   transaction updating its underlying database, while a
   warehouse may have to deal with independent sources:
    – sources simply report changes
    – sources may not have locking capability
    – integrator is loosely coupled with the sources
                                                                 56
               Mediated system architecture


Virtual approach: data is not stored in the middle tier


                          client applications



                               Mediator



     wrapper                     wrapper                  wrapper




        RDB                       OODB                    XML
                                                                    57
                Lazy vs. eager approaches


Lazy approach (mediated systems):
 accept a query, determine the appropriate set of data sources,
   generate sub-queries for each data source
 obtain results from the data sources, perform translation,
   filtering and composing, and return the final answer


Eager approach (warehouses):
 information from each source that may be of interest is
   extracted in advance, translated, filtered, merged with relevant
   sources, and stored in a repository
 query is evaluated directly against the repository, without
   accessing the original information sources
                                                                      58
         Data warehouse vs. mediated systems


 Efficiency
   – response time: at the warehouse, queries can be answered
      efficiently without accessing original data sources.
      Advantageous when data sources are slow, expensive or
      periodically unavailable, or when translation, filtering and
      merging require significant processing
   – space: warehousing consumes extra storage space
 Extensibility: warehouse
 consistency with the sources: warehouse data may become out
  of date
 applicability:
    – warehouses: for high query performance and static data
    – mediated systems: for information that changes rapidly
                                                                 59
                     Data integration in XML


 multiple, heterogeneous data sources – multi-source queries,
  query decomposition, object fusion, …
 distributed sources: scheduling of query execution
 schema-conformance, …
                                                       schema
                           query    answer

                                   XML
           integration                                    query
                                                       translation

                             Integration middleware
         updates


                   DB1                                     DB4
                                DB2          DB3                     60
              AIG: Schema-directed integration


    (D, )
                                          Attribute Integration
   XML view
                                                Grammar
                   middleware


     DB1                                  DB3
                       DB2


 Integration:
   – extract relevant data from distributed, multiple databases
   – construct an XML view
 Schema-directed: conformance to a predefined schema (D, )
   – D: a DTD, type constraints
   – : a set of XML integrity constraints (keys, foreign keys)   61
                 Enosys – a mediated system

 XMLizer: wrapper, converting source to virtual XML view
 Mediator: export virtual integrated XML (VIX) database
 Translator: rewrites XML queries to intermediate algebraic exps
 Rewriter: decompose                query

    multi-source queries   Mediator
 Optimizer: query plan           Translator
    generation
                              Rewriter/optimizer
 Execution: sends
                                 Execution engine
   requests to wrappers,
   compose results, tagging, …
           XMLizer                  XMLizer                XMLizer


                                                                     62
            source                   source                 source
    Active XML – a hybrid of warehouse and mediator


 XML doc template: a mix of
      – data nodes -- materialization
      – function nodes: embedded calls to
         Web services, queries -- virtual
    Integration:
      – data nodes: concrete data (static)
      – function nodes extract up-to-date data (dynamic)
    Data exchange: materialization before/after sending the doc
      – After: smaller doc (less transmission cost) -- Web services
         can be accessed from the receiver site
      – Before: for security (access control) and capability reasons
    Optimization: very hard
    Target-schema conformance? Fixed doc template
                                                                   63
                   Summary and review

 What are the differences between schema mapping and
   schema matching? Why should we care about these?
 Information preservation: what? Why?

 Contextual schema matching: what? Why?

 What is the main difference between data exchange and data
  integration?
 What are main approaches to integrating data? What are the
  major difficulties?
 For what applications warehousing is preferable to a mediator
  based approach?



                                                                  64
                   Summary and review

 Understand AIG, Enosys and Active XML
 Given an AIG and relational sources, you should be able to
   – understand how AIG works by providing the integrated XML
     data
   – understand the optimization and evaluation process
 Can you combine Active XML and ATG to ensure target
  schema-conformance?
 Find and read papers about the GUI of Enosys
 Find and read papers on data integration based on Active XML




                                                               65

								
To top