Learning to Map Between Schemas Ontologies
Alon Halevy University of Washington
Joint work with Anhai Doan and Pedro Domingos
Agenda
Ontology mapping is a key problem in many applications:
– – – – Data integration Semantic web Knowledge management E-commerce
LSD:
– Solution that uses multi-strategy learning. – We’ve started with schema matching (I.e., very simple ontologies) – Currently extending to more expressive ontologies. – Experiments show the approach is very promising!
2
The Structure Mapping Problem
Types of structures:
– Database schemas, XML DTDs, ontologies, …,
Input:
– Two (or more) structures, S1 and S2 – Data instances for S1 and S2 – Background knowledge
Output:
– A mapping between S1 and S2
– Should enable translating between data instances.
– Semantics of mapping?
3
Semantic Mappings between Schemas
Source schemas = XML DTDs
house address contact-info agent-name 1-1 mapping house location contact name full-baths phone
4
num-baths
agent-phone non 1-1 mapping
half-baths
Motivation
Database schema integration
– A problem as old as databases themselves. – database merging, data warehouses, data migration
Data integration / information gathering agents
– On the WWW, in enterprises, large science projects
Model management:
– Model matching: key operator in an algebra where models and mappings are first-class objects. – See [Bernstein et al., 2000] for more.
The Semantic Web
– Ontology mapping.
System interoperability
– E-services, application integration, B2B applications, …,
5
Desiderata from Proposed Solutions
Accuracy, efficiency, ease of use. Realistic expectations:
– Unlikely to be fully automated. Need user in the loop.
Some notion of semantics for mappings. Extensibility:
– Solution should exploit additional background knowledge.
“Memory”, knowledge reuse:
– System should exploit previous manual or automatically generated matchings. – Key idea behind LSD.
6
LSD Overview
L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1:
– [SIGMOD, 2001]: 1-1 mappings between XML DTDs.
Current focus:
– Complex mappings – Ontology mapping.
7
Outline
Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.
8
Data Integration
Find houses with four bathrooms priced under $500,000
mediated schema
Query reformulation and optimization.
source schema 1
wrappers
source schema 2
source schema 3
realestate.com
homeseekers.com
homes.com
Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.
9
Semantic Mappings between Schemas
Source schemas = XML DTDs
house address contact-info agent-name 1-1 mapping house location contact name full-baths phone
10
num-baths
agent-phone non 1-1 mapping
half-baths
Semantics (preliminary)
Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given:
– R(A1,…,An) and S(B1,…,Bm) – 1-1 mappings (Ai,Bj)
Then, we postulate the existence of a relation W, s.t.:
– P (C1,…,Ck) (W) = P (A1,…,Ak) (R) , – P (C1,…,Ck) (W) = P (B1,…,Bk) (S) , – W also includes the unmatched attributes of R and S.
In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences. 11
Why Matching is Difficult
Aims to identify same real-world entity
– using names, structures, types, data values, etc
Schemas represent same entity differently
– different names => same entity:
– area & address => location
– same names => different entities:
– area => location or square-feet
Schema & data never fully capture semantics!
– not adequately documented, not sufficiently expressive
Intended semantics is typically subjective!
– IBM Almaden Lab = IBM?
Cannot be fully automated. Often hard for humans. Committees are required! 12
Current State of Affairs
Finding semantic mappings is now the bottleneck!
– largely done by hand – labor intensive & error prone – GTE: 4 hours/element for 27,000 elements [Li&Clifton00]
Will only be exacerbated
– – – – data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on semantic web
Need semi-automatic approaches to scale up!
13
Outline
Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.
14
The LSD Approach
User manually maps a few data sources to the mediated schema. LSD learns from the mappings, and proposes mappings for the rest of the sources. Several types of knowledge are used in learning:
– Schema elements, e.g., attribute names – Data elements: ranges, formats, word frequencies, value frequencies, length of texts. – Proximity of attributes – Functional dependencies, number of attribute occurrences.
One learner does not fit all. Use multiple learners and combine with meta-learner.
15
Example
Mediated schema address
location
price
agent-phone
phone
description
comments Learned hypotheses If “phone” occurs in the name => agent-phone
listed-price
Schema of realestate.com
location listed-price phone comments realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house Boston, MA $110,000 (617) 253 1429 Great location ... ... ... ...
If “fantastic” & “great” occur frequently in data values => description
homes.com
price contact-phone extra-info $550,000 (278) 345 7215 Beautiful yard $320,000 (617) 335 2315 Great beach ... ... ...
16
Multi-Strategy Learning
Use a set of base learners:
– Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers:
– County name, zip code, phone numbers.
Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.
17
Base Learners
Name Learner
(contact-info,office-address) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (listed-price,price) (contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone) (listed-price,price)
contact-phone => (agent-phone,0.7), (office-address,0.3)
Naive Bayes Learner [Domingos&Pazzani 97]
– “Kent, WA” => (address,0.8), (name,0.2)
Whirl Learner [Cohen&Hirsh 98] XML Learner
– exploits hierarchical structure of XML data
18
Training the Base Learners
address
location Mediated schema price agent-phone
description
listed-price phone comments Schema of realestate.com Name Learner
realestate.com
Miami, FL > $250,000> (305) 729 0831> Fantastic house > Boston, MA > $110,000> (617) 253 1429> Great location >
(location, address) (listed-price, price) (phone, agent-phone) ... Naive Bayes Learner
(“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) ...
19
Entity Recognizers
Use pre-programmed knowledge to identify specific types of entities
– date, time, city, zip code, name, etc – house-area (30 X 70, 500 sq. ft.) – county-name recognizer
Recognizers often have nice characteristics
– easy to construct – many off-the-self research & commercial products – applicable across many domains
– help with special cases that are hard to learn
20
Meta-Learner: Stacking
Training of meta-learner produces a weight for every pair of:
– (base-learner, mediated-schema element) – weight(Name-Learner,address) = 0.1 – weight(Naive-Bayes,address) = 0.9
Combining predictions of meta-learner:
– computes weighted sum of base-learner confidence scores
Seattle, WA>
Name Learner Naive Bayes Meta-Learner (address,0.6) (address,0.8)
(address, 0.6*0.1 + 0.8*0.9 = 0.78)
21
Training the Meta-Learner
For address
Extracted XML Instances Name Learner Naive Bayes 0.5
0.4 0.3 0.6 0.3 ... 0.8 0.3 0.9 0.8 0.3 ...
True Predictions
1 0 1 1 0 ...
Miami, FL> $250,000> Seattle, WA > Kent, WA> 3> ...
Least-Squares Linear Regression
Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9
22
Applying the Learners
Schema of homes.com area day-phone extra-info address Mediated schema price agent-phone description
Seattle, WA> Kent, WA> Austin, TX>
Name Learner Naive Bayes Name Learner Naive Bayes
Meta-Learner
Meta-Learner
(address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.7), (description,0.3)
(278) 345 7215> (617) 335 2315> (512) 427 1115>
(agent-phone,0.9), (description,0.1)
Beautiful yard> Great beach> Close to Seattle>
(description,0.8), (address,0.2)
23
The Constraint Handler
Extends learning to incorporate constraints
– hard constraints
– a = address & b = address a=b – a = house-id a is a key – a = agent-info & b = agent-name b is nested in a
– soft constraints
– a = agent-phone & b = agent-name a & b are usually close to each other
– user feedback = hard or soft constraints
Details in [Doan et. al., SIGMOD 2001]
24
The Current LSD System
Training Phase
Mediated schema Source schemas Domain Constraints User Feedback
Base-Learner1 Base-Learnerk Meta-Learner
Matching Phase
Data listings
Constraint Handler
Mappings
25
Outline
Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.
26
Empirical Evaluation
Four domains
– Real Estate I & II, Course Offerings, Faculty Listings
For each domain
– – – – create mediated DTD & domain constraints choose five sources extract & convert data listings into XML (faithful to schema!) mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run:
– manually provide 1-1 mappings for 3 sources – ask LSD to propose mappings for remaining 2 sources – accuracy = % of 1-1 mappings correctly identified
27
Matching Accuracy
Average Matching Acccuracy (%)
100 90 80 70 60 50 40 30 20 10 0 Real Estate I Real Estate II Course Offerings Faculty Listings
LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
28
Sensitivity to Amount of Available Data
Average matching accuracy (%)
100 90 80 70 60 50 40 0 100 200 300 400 500
29
Number of data listings per source (Real Estate I)
Contribution of Schema vs. Data
Average matching accuracy (%)
100 90 80 70 60 50 40 30 20 10 0
Real Estate I
Real Estate II
Course Offerings Faculty Listings
LSD with only schema info. LSD with only data info.
Complete LSD
More experiments in the paper [Doan et. al. 01]
30
Reasons for Incorrect Matching
Unfamiliarity
– suburb – solution: add a suburb-name recognizer
Insufficient information
– correctly identified general type, failed to pinpoint exact type – Richard Smith> (206) 234 5412 > – solution: add a proximity learner
Subjectivity
– house-style = description?
31
Outline
Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.
32
Moving Up the Expressiveness Ladder
Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints provide more to learn from. Non 1-1 mappings:
– F1(A1,…,Am) = F2(B1,…,Bm)
Ontologies (of various flavors):
– Class hierarchy (I.e., containment on unary relations) – Relationships between objects – Constraints on relationships
33
Finding Non 1-1 Mappings Current work
Given two schemas, find
– 1-many mappings: address = concat(city,state) – many-1: half-baths + full-baths = num-baths – many-many: concat(addr-line1,addr-line2) = concat(street,city,state)
1-many mappings
– expressed as query
– value correspondence expression: room-rate = rate * (1 + tax-rate) – relationship: state of tax-rate = state of hotel that has rate
– special case: 1-many mappings between two relational tables
Mediated schema address description num-baths city Source schema state comments half-baths full-baths
34
Brute-Force Solution
Define a set of operators
– concat, +, -, *, /, etc
For each set of mediated-schema columns
– enumerate all possible mappings – evaluate & return best mapping
Mediated-schema columns Source-schema columns
m1 m1, m2, ..., mk
35
Search-Based Solution
States = columns
– goal state: mediated-schema column – initial states: all source-schema columns
– use 1-1 matching to reduce the set of initial states
Operators: concat, +, -, *, /, etc Column-similarity:
– use all base learners + recognizers
36
Multi-Strategy Search
Use a set of expert modules: L1, L2, ..., Ln Each module
– applies to only certain types of mediated-schema column – searches a small subspace – uses a cheap similarity measure to compare columns
Example
– L1: text; concat; TF/IDF – L2: numeric; +, -, *, /; [Ho et. al. 2000] – L3: address; concat; Naive Bayes
Search techniques
– beam search as default – specialized, do not have to materialize columns
37
Multi-Strategy Search (cont’d)
Apply all applicable expert modules
L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z
Combine modules’ predictions & select the best one
m11, m12, m21, m22, m31,m32
m11
38
Related Work
Recognizers + Schema + 1-1 Matching
TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] Hybrid + 1-1 Matching
Single Learner + 1-1 Matching
SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97]
DELTA [Clifton et. al. 97]
Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching LSD [Doan et. al. 2000, 2001]
Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction CLIO [Miller et. al. 00],[Yan et. al. 01]
?
39
Summary
LSD:
– uses multi-strategy learning to semi-automatically generate semantic mappings. – LSD is extensible and incorporates domain and user knowledge, and previous techniques. – Experimental results show the approach is very promising.
Future work and issues to ponder:
– Accommodating more expressive languages: ontologies – Reuse of learned concepts from related domains. – Semantics?
Data management is a fertile area for Machine Learning research!
40
Backup Slides
41
Mapping Maintenance
Mediated-schema M Source-schema S
m1 m2 m3
Ten months later ...
– are the mappings still correct?
Mediated-schema M’ Source-schema S’
m1 m2 m3
42
Information Extraction from Text
Extract data fragments from text documents
– date, location, & victim’s name from a news article
Intensive research on free-text documents Many documents do have substantial structure
– XML pages, name card, tables, list
Each such document = a data source
– structure forms a schema – only one data value per schema element – “real” data source has many data values per schema element
Ongoing research in the IE community
43
Contribution of Each Component
Average Matching Acccuracy (%)
100 80 60 40 20 0
Real Estate I Course Offerings Faculty Listings Real Estate II
Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system
44
Exploiting Hierarchical Structure
Existing learners flatten out all structures
Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.
Gail Murphy MAX Realtors
Developed XML learner
– similar to the Naive Bayes learner
– input instance = bag of tokens
– differs in one crucial aspect
– consider not only text tokens, but also structure tokens
45
Domain Constraints
Impose semantic regularities on sources
– verified using schema or data
Examples
– a = address & b = address a=b – a = house-id a is a key – a = agent-info & b = agent-name b is nested in a
Can be specified up front
– when creating mediated schema – independent of any actual source schema
46
The Constraint Handler
Predictions from Meta-Learner
area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) area: address 0.7 contact-phone: agent-phone 0.9 extra-info: address 0.6 0.378
Domain Constraints
a = address & b = adderss a=b
area: address 0.7 contact-phone: agent-phone 0.9 extra-info: description 0.4 0.252
0.3 0.1 0.4 0.012
Can specify arbitrary constraints User feedback = domain constraint
– ad-id = house-id
Extended to handle domain heuristics
– a = agent-phone & b = agent-name a & b are usually close to each other
47