Data Integration
Overview
• Virtual Tables and Mediators
• On the fly data integration
Data Warehouse Architecture
OLTP OLAP
OLTP Applications GUI, Spreadsheets
DB1
DB2
Data Warehouse
DB3
Data Warehouse Architecture
• Data Warehouse = Materialized View(s)
– base data comes from different sources
– schema of data warehouse can be defined as
SQL over base data (modulo data cleaning)
• What is the alternative?
– data warehouse = (normal) View
• Advantages of Materialized Views
– guaranteed response time after ETL
• Disadvantages of Materialized Views
– investment for ETL (needs to be amortized)
– freshness of data for „real-time warehousing“
– additional resources to store / query views
Heterogeneous DBMS
• Idea: Transparently query n databases as if it
were one database (SQL or XQuery)
– hide heterogeneity
– hide distribution
• Kinds of heterogeneity
– heterogeneous hardware / OS / platform
– heterogeneous DBMS products (DB2, Oracle, ...)
– heterogeneous data models (XML, hierarchical, ...)
– semantic heterogeneity (EUR, USD, ...)
Mediator Architecture
Client Client Client
SQL
Mediator
SQL
wrapper wrapper wrapper wrapper wrapper
proprietary
DB2 LDAP EXCEL SAP MySQL
Mediator Architecture
• Key idea
– hide heterogeneity in wrappers
• data format conversions, capabilities
• wrapper makes data source look like DB
– mediator operates like a distributed DB
• optimizes join orders
• optimizes data movement
• needs to respect capabilities of data sources (!)
• Does not help for semantic heterogeneity!
Heterogeneous DB Products
• IBM Data Joiner
– today, part of DB2
• Products from all major DB vendors
• Research Prototypes
– Garlic, TSIMMIS, Info Manifold, ...
– very popular in the late 90‘s
What is the Schema of a HDB?
• Which schema is supported by mediator?
• Local as View
– the local schema is a view on the global schema
– involves answering queries using views
– change local schema independent of global schema
• Global as View
– the global schema is a view over the local schemas
– involves view unfolding (much simpler)
– change to local schema -> change of global schema
• N.B.
– often difficult to find global schema (no matter)
none of these approaches scale at the Web-level
Schema Examples
• Global Schema
– emp(eno, name, salary, dno)
– dept(dno, manager, budget)
• Local Schema 1
– empL1(eno, name, salary, level, dno)
– deptL1(dno, budget)
• Local Schema 2
– empL2(eno, name, salary, dno)
– deptL2(dno, manager, budget)
Global as View
create view emp as
(select eno, name, (salary * 1.6), dno
from empL1)
UNION
(select * from empL2)
create view dept as
(select d.dno, e.eno, d.budget * 1.6
from empL1, deptL2
where e.dno = d.dno and e.level = „manager“)
UNION
(select * from deptL2)
GaV: Query Processing
select max(budget) from dept
is equivalent to (after optimization+assumpt.)
select max(budget)
from (select max(budget * 1.6) as budget)
from deptL1) union
(select max(budget) as budget
from deptL2)
Local as View
create view deptL2 as
(select dno, budget / 1.6
from dept)
create view empL2 as
(select eno, name, salary, dno
from emp)
• Query processing complicated, but
ultimately the same after AQUV.
Open Questions
• How do you execute updates?
– difficult in the GaV approach
• How do you define integrity constraints?
– typically done at the data source level only
– global integrity constraints very difficult
• How do you execute transactions?
– the usual distributed DBMS shabang (2PC, etc.)
Capability-based Query Opt.
• Let us assume that your empL1 DB is
served by a Web-based interface
– only look-ups by eno and dno possible
• Query
select max(salary)
from emp
where dno = 4711;
• What is the best feasible plan for this query?
Solution
select max(salary)
from (select eno, name, salary*1.6 as salary)
from empL1) union
(select max(salary)
from empL2)
• How is this done?
– wrappers provide rules to enumerate possible plans
– wrappers provide cost model for their plans
– reuse query optimizer (DP) infrastructure
– [Haas et al., VLDB 1997]