Embed
Email

Data Integration

Document Sample

Shared by: ewghwehws
Categories
Tags
Stats
views:
0
posted:
2/9/2012
language:
pages:
16
Data Integration

Overview

• Virtual Tables and Mediators

• On the fly data integration

Data Warehouse Architecture

OLTP OLAP





OLTP Applications GUI, Spreadsheets









DB1



DB2

Data Warehouse

DB3

Data Warehouse Architecture

• Data Warehouse = Materialized View(s)

– base data comes from different sources

– schema of data warehouse can be defined as

SQL over base data (modulo data cleaning)

• What is the alternative?

– data warehouse = (normal) View

• Advantages of Materialized Views

– guaranteed response time after ETL

• Disadvantages of Materialized Views

– investment for ETL (needs to be amortized)

– freshness of data for „real-time warehousing“

– additional resources to store / query views

Heterogeneous DBMS

• Idea: Transparently query n databases as if it

were one database (SQL or XQuery)

– hide heterogeneity

– hide distribution

• Kinds of heterogeneity

– heterogeneous hardware / OS / platform

– heterogeneous DBMS products (DB2, Oracle, ...)

– heterogeneous data models (XML, hierarchical, ...)

– semantic heterogeneity (EUR, USD, ...)

Mediator Architecture



Client Client Client



SQL



Mediator



SQL



wrapper wrapper wrapper wrapper wrapper



proprietary



DB2 LDAP EXCEL SAP MySQL

Mediator Architecture

• Key idea

– hide heterogeneity in wrappers

• data format conversions, capabilities

• wrapper makes data source look like DB

– mediator operates like a distributed DB

• optimizes join orders

• optimizes data movement

• needs to respect capabilities of data sources (!)

• Does not help for semantic heterogeneity!

Heterogeneous DB Products

• IBM Data Joiner

– today, part of DB2

• Products from all major DB vendors

• Research Prototypes

– Garlic, TSIMMIS, Info Manifold, ...

– very popular in the late 90‘s

What is the Schema of a HDB?

• Which schema is supported by mediator?

• Local as View

– the local schema is a view on the global schema

– involves answering queries using views

– change local schema independent of global schema

• Global as View

– the global schema is a view over the local schemas

– involves view unfolding (much simpler)

– change to local schema -> change of global schema

• N.B.

– often difficult to find global schema (no matter)

none of these approaches scale at the Web-level

Schema Examples

• Global Schema

– emp(eno, name, salary, dno)

– dept(dno, manager, budget)

• Local Schema 1

– empL1(eno, name, salary, level, dno)

– deptL1(dno, budget)

• Local Schema 2

– empL2(eno, name, salary, dno)

– deptL2(dno, manager, budget)

Global as View

create view emp as

(select eno, name, (salary * 1.6), dno

from empL1)

UNION

(select * from empL2)



create view dept as

(select d.dno, e.eno, d.budget * 1.6

from empL1, deptL2

where e.dno = d.dno and e.level = „manager“)

UNION

(select * from deptL2)

GaV: Query Processing

select max(budget) from dept



is equivalent to (after optimization+assumpt.)



select max(budget)

from (select max(budget * 1.6) as budget)

from deptL1) union

(select max(budget) as budget

from deptL2)

Local as View

create view deptL2 as

(select dno, budget / 1.6

from dept)



create view empL2 as

(select eno, name, salary, dno

from emp)



• Query processing complicated, but

ultimately the same after AQUV.

Open Questions

• How do you execute updates?

– difficult in the GaV approach

• How do you define integrity constraints?

– typically done at the data source level only

– global integrity constraints very difficult

• How do you execute transactions?

– the usual distributed DBMS shabang (2PC, etc.)

Capability-based Query Opt.

• Let us assume that your empL1 DB is

served by a Web-based interface

– only look-ups by eno and dno possible

• Query

select max(salary)

from emp

where dno = 4711;





• What is the best feasible plan for this query?

Solution

select max(salary)

from (select eno, name, salary*1.6 as salary)

from empL1) union

(select max(salary)

from empL2)



• How is this done?

– wrappers provide rules to enumerate possible plans

– wrappers provide cost model for their plans

– reuse query optimizer (DP) infrastructure

– [Haas et al., VLDB 1997]



Related docs
Other docs by ewghwehws
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!