Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Putting Context into Schema Matching

VIEWS: 8 PAGES: 52

									Putting Context into Schema Matching

Philip Bohannon*
Yahoo! Research

Wenfei Fan
Univ of Edinburgh and Bell Labs

Eiman Elnahrawy*
Rutgers University

Michael Flaster*
Google

*Work performed at Lucent Technologies -- Bell Laboratories.

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusions

Slide 2

Schema Matching vs. Schema Mapping
 Schema Matching means “computer-suggested arrows”
RS.Person RT.Student

First

.88 .93 .97

Name
Address City

Source Schema: RS

Last City

Target Schema: RT

. . .

. . .

 Arrows inferred based on meta-data or sample instance data  Associated confidence score  Meaning (variant of): RS.Person.City  RT.Student.City
Slide 3

Schema Mapping: “From Arrows to Queries”
RS.Person First Last RT.Student Name Address select concat(First, “ ”,Last) as Name, City as City from RS.Person, RS.Education,… where …

City .

. .

City .

. .

Q: RS -> RT

 Given a set of arrows user input, produce a query that
maps instances of RS into instances of RTRT

 Transformations, joins [Miller, Has, Hernandez, VLDB 2002]
added by, or with help from, the user  Most of this talk is about matching, some implications for mapping later
Slide 4

Motivation: inventory mapping example
 Consider integrating two inventory schemas  Books, music in separate tables in RT  Run some nice schema match software
RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date RT.book title: string isbn: string price: float format: string RT.music title: string asin: string price: float sale: float label: string

Slide 5

Inventory where clause
 The lines are helpful (schema matching is a besteffort affair), but…

 lines are semantically correct only in the context of
a selection condition
RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date

where type=1

RT.book title: string isbn: string price: float format: string RT.music title: string asin: string price: float sale: float label: string

where type = 2
Slide 6

Definition and Goals
Contextual schema match: An arrow between source and target schema elements, annotated with a selection condition
M

RS.aa

RT.bb

RS.c=3

– In a standard schema match, the condition “true” is always used
M

RS.aa

RT.bb

true

 Goal: Adapt instance-driven schema matching techniques to
infer semantically valid contextual schema matches, and create schema maps from those matches

Slide 7

Attribute promotion example
 Consider integrating data about grade assignments [Fletcher,
Wyss, SIGMOD 2005 demo]

 Again context is needed, but semantics are slightly different:
attribute promotion
where Assgn=2 where Assgn=1
Name Assgn Grade Name Grade1 Grade2

=3
Grade3

=… …

Joe Joe Joe Mary Mary Mary

1 2 3 1 2 3

84
86 75 92 94 85

Slide 8

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusion

Slide 9

Background: Instance-level matching
Dubious, at best!
M

Nice match!
M

RS.ac

RT1.bb true

RS.ac

RT1.ac true
San Jose Cupertino Palo Alto Gilroy Pleasanton Sunnyvale

Sunnyvale Los Angeles Cupertino Gilroy San Diego

(408) 123-4456 (212) 223-3455 (408) 123-2222 (408) 324-4444

Sunnyvale Los Angeles Cupertino Gilroy San Diego

Slide 10

Background: Instance-level matching
Dubious, at best!
M

Perfect match!
M

RS.ac

RT1.bb true

RS.ac

RT1.ac true

Bayesian Tri-gram

Type Expert

String Edit Distance

Cosine Similarity

Whatever

More Whatever

Coming up with a good score is far from simple!
• Derive comparable scores across sample size, data types, etc.

Slide 11

StandardMatch(RS,RT,)
1. Consider all |RS||RT| matches, score them, normalize the scores RT1.ad true
RT1.ar true RT1.cd true RT1.ac true RT1.vw true RT1.sb true 2. Rank by normalized score



M

RS.bd
RS.db RS.ba RS.ac RS.ac RS.ba

M

RS.ac RS.ba RS.ba RS.db RS.ac RS.bd

RT1.ac true RT1.cd true RT1.sb true RT1.ar true RT1.vw true RT1.ad true

M

M

M

M

3.

M

M

Apply  as a cutoff, and return

M

M

M

M

Slide 12

Background: Categorical Attributes
 What attributes are candidates
for the where clause?

 We focus on “categorical”
attributes (leaving noncategorical attributes as future work)

RS.inv id: integer
name: string

 If not identified by schema, infer
from sample data, as any attribute with
– more than 1 value – most values associated with more than one tuple

code: string type: integer instock: string
descr: string arrival: date

Slide 13

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusion

Slide 14

Strawman Algorithm
1. Use instance-based matching algorithm to
compute a set of matches, L = M1..Mn, along with associated scores 2. For each Mi in L, of the form (RS.s,RT.t,true) For each categorical attribute c in the source (or target) For each value v taken by c in the sample
§ Restrict the sample of RS to tuples where c=v § Re-compute the match score on the new sample

3. For c,v that most improves score, replace Mi with
(RS.s,RT.t,c=v)

Slide 15

ContextMatch(RS,RT,)
StandardMatch…
4. Try each context condition
2. 3.
M

Rank by normalized score Apply  as a cutoff, and return



RS.c = 2 RS.t = 1

RS.d = “open” RS.c = 2 or RS.c = 3

RS.t = 0

RS.ac RS.ba RS.ba RS.db RS.ac

RT1.ac true RT1.cd true
M

M

5. Evaluate quality of match
RS.ba RT1.cd Rs.t=1

M

RT2.sb true RT2.ar true RT1.vw true

M

6. Keep the best!

M

M

RS.bd

RT1.ad true

Slide 16

Problems with Strawman
 False Positives
– the increase in the score may not be meaningful, since some random subsets of corpus will match better than the whole (even with size-adjusted metrics)

 False Negatives
– original matching algorithm only returned matches with quality above some threshold  to be in L, but a match that didn‟t make the cut may improve greatly with contextual matching

 Time
– with disjuncts -- too many expressions to test

Slide 17

Strawman 2.0
 Like Strawman, but require an improvement
threshold, w, to cut down on false positives

 Not much better  Setting w is problematic, as matcher scores are not
perfect

Slide 18

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusion

Slide 19

Our approach:
1. Pre-filter conditions based on classification 2. Find conditions that improve several matches from
the same table
RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date RT.book title: string isbn: string price: float format: string RT.music title: string asin: string price: float sale: float label: string

Slide 20

View-oriented contextual mapping (cont‟d)
RS.inv where type = 1 id: integer name: string RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date code: string type: integer instock: string descr: string arrival: date RS.inv where type = 2 id: integer name: string code: string type: integer instock: string descr: string arrival: date
Slide 21

RT.book title: string isbn: string price: float format: string RT.music title: string asin: string price: float sale: float label: string

Algorithm ContextMatch(RS,RT,)
L = ; M = StandardMatch(RS,RT,); C = InferCandidateViews(RS,M,EarlyDisjuncts); for c  C do Vc = select * from RS where c; for m  M do m‟ := m with RS replaced by Vc;

s := ScoreMatch(m‟);
L = L  {(m‟,s)}; return SelectContextualMatches(M, L, EarlyDisjuncts)

Slide 22

ContextMatch(RS,RT,)
StandardMatch…
2. 3.
M

Rank by normalized score Apply  as a cutoff, and return

InferCandidateViews



RS.ac RS.ba RS.ba RS.db RS.ac

RT1.ac

RS.c = 2

RS.d = “open” RS.c = 2 or RS.c = 3

RS.t = 0

true

RS.t = 1

M

RT1.cd true RT2.sb true RT2.ar true RT1.vw true

For each candidate view V, 4. Re-compute summaries for V as: “select * from RS where RS.t = 1” 5. Evaluate quality of matches
M

M

M

M

M

RS.bd

RT1.ad true

RS.ba

RT1.cd Rs.t=1

Slide 23

How to Filter Candidate Views
 Naïve
– Any Boolean condition involving a categorical attribute (strawman approach)

 SourceClassifier, TargetClassifier
– Check for categorical attributes that do a “good job” categorizing other attributes

 Disjunct Handling (early or late)
 Conjunct Handling

Slide 24

Source Classifier Intuition
RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date id 0 1 2 3 4 name leaves of grass the white album heart of darkness wasteland hotel california type 1 2 1 1 2 instock y y n y n code 0195128 B002UAX 0486611 039995 B002GVO descr hardcover audio cd paperback paperback electra

 how well do the categorical attributes serve as
classifier labels for the other attributes?

Slide 25

Source Classifier Intuition: type
id 0 1 2 3 4 name leaves of grass the white album heart of darkness wasteland hotel california type 1 2 1 1 2 instock y y n y n code 0195128 B002UAX 0486611 039995 B002GVO descr hardcover audio cd paperback paperback electra

 how about „type‟?

Slide 26

Source Classifier Intuition: instock
id 0 1 2 3 4 name leaves of grass the white album heart of darkness wasteland hotel california type 1 2 1 1 2 instock y y n y n code 0195128 B002UAX 0486611 039995 B002GVO descr hardcover audio cd paperback paperback electra

 how about „instock‟?

Slide 27

What do we really mean by a “good job”?
 Split the sample into a training set and a testing set
(randomly)  For each categorical attribute C and non-categorical attribute A
– Train a classifier H by treating the value of A as the document and the value of C as the label – Test H against test set, determine precision, p, and recall, r – Score(C) w.r.t. A based on combination of precision and recall (F = 2pr/(p+r)) – Compare Score(C) to Score(NC), wher NC is a Naïve Classifier:
• This classifier chooses most frequent label

– C does a good job with H if H‟s improvement over Naïve is statistically significant with 95% confidence
Slide 28

Target Classifier Intuition
id 0 1 2 3 4 name leaves of grass the white album heart of darkness wasteland hotel california type 1 2 1 1 2 instock y y n y n code 0195128 B002UAX 0486611 039995 B002GVO descr hardcover audio cd paperback Book.comment paperback electra Music.label Book.comment

 Train a new classifier, T, treating each target schema
attribute as a class of documents  Check source values against this classifier  Label each value with best guess label  Use labels instead of values in the same framework
Slide 29

Handling Disjunctive Conditions
 Why Disjuncts? What if type field had separate categories for
hardback and paperback?  Two approaches to handling disjunctive conditions, “early” and “late”  Early Disjuncts
– InferCandidateViews is responsible for identifying “interesting” disjuncts – Each interesting disjunct is evaluated separately, no overlapping conditions are output

 Late Disjuncts
– InferCandidateViews returns no disjuncts – All high-scoring conditions are unioned together (Clio semantics), effectively creating a disjunct

Slide 30

Early Disjuncts: A Heuristic Approach
 When evaluating trained classifier on test set for some
categorical attribute C, make note of misclassifications of the form “should be A, but guessed B”

 Consider merging the (A,B) pair that would repair most errors
– by merge, we mean “replace” A and B values with (A,B)

   

Re-evaluate Repeat Keep all alternatives formed this way that score well Only accept 1 view that mentions attribute C (don‟t union)

Slide 31

Handling Conjuncts
 Proposed Approach:
– Assumes that a good conjunctive view has a good disjunctive view as one of the terms in the conjunct.

 Run Context Match Repeatedly  At stage i, consider views VC identified by the
previous (i-1)th run as the input base tables
– where C was the select condition defining the view

 When considering candidate attributes for a run,
only consider categorical attributes not in C.

 (Conjunct handling not in current experiments)

Slide 32

Selecting Contextual Matches
 Each view V based on condition c is evaluated,
rather than each match

 Compute overall confidence of matches from V, and
compare to overall confidence from base table

 If overall confidence is better than w, use V instead
of the base table

 If more than one qualifies
– If EarlyDisunct, choose the best – Else, take all that are over w

Slide 33

Comments on Schema Mapping
 Seek to apply the Clio ([Popa et al, VLDB 2002])
approach to mapping construction

 Create „logical tables‟ based on key-foreign key
constraints

 Two challenges
– Extend notion of foreign-key constraints in context of selection views, undecidability result – Extend join rules of [Popa et al, VLDB 2002] to handle the selection views

 See paper for details

Slide 34

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusion

Slide 35

Experimental Study
 Used schemas from the retail domain
– schemas created by students at UW
• Aaron, Ryan, Barrett

– Populated code, descr info by scraping web-sites, used some name data from Illinois Semantic Integration Archive

 ItemType is split, so that instead of just CD, BOOK
– e.g. CD1, CD2, BOOK1, BOOK2, =4

 Compare matched edges to correct edges
– Accuracy: how many of BOOKi edges go to book target table? – Precision: of the BOOKi edges, how many go to book target? – Fmeas: 2(Accuracy * Precision)/(Accuracy + Precision)

Slide 36

View improvement threshold: w

Aaron

Barett

 How sensitive is technique to

w?  Depends on disjunct strategy  Easier to pick w with EarlyDisjunct
Ryan
Slide 37

Strawman

 Strawman means
– Late disjunct (EarlyDisjunct=false) – Pick best arrow from each source attribute on per-attribute basis (MultiTable)
Slide 38

Sensitivity to Decoy Categorical Attributes
LateDisjunct

EarlyDisjunct

 Add 3 extra categorical attributes  Vary their correlation with ItemType (higher correlation makes
it harder)  Naïve is not only slow, it is overly confusing to the quality metrics  EarlyDisjunct heuristic based on classification helps with quality
Slide 39

Varying schema size

  Add n non-categorical attributes to every table, all taken from same
domain  Add n/4 categorical attributes to tables with categorical attributes  Early dip is before non-categorical attributes match each other
Slide 40

Runtime as schema gets larger

 Same experiment, compare runtimes  TgtClass is somewhat higher quality (not shown), but takes much
longer for large schemas
Slide 41

Grades Example
 Create an experiment based on grades example  Artificial data
– mean of assignment I is 40 + 10(I-1) (as grades improve) – standard deviation is varied
where Assgn=2 where Assgn=1
Name Assgn Grade Name Grade1 Grade2

=3
Grade3

=… …

Joe Joe Joe Mary Mary Mary

1 2 3 1 2 3

84 86 75

Bob Sue

92
94 85

Slide 42

Grades accuracy as std. dev increases

Slide 43

Overview
1. 2. 3. 4. 5. 6. 7.
Motivation Background Strawman Framework Experimental Evaluation Related Work Conclusion

Slide 44

Related Work
 Instance level schema matching
– Survey [Rahm, Bernstein, VLDB Journal 2001], Coma [Do, Rahm, VLDB02], Coma++ [SIGMOD 05], iMAP [Doan et al, SIGMOD 01], Cupid [Madhavan, Bernstien, Rahm, VLDB 01], etc.

 Schema mapping
– Clio [Popa, et al, VLDB 02], [Haas et al, SIGMOD 2005], etc – Model Management (many papers)

 Overcoming heterogeneity during match process
– Schema Mapping as Query Discovery [Miller, Haas, Hernandez, VLDB 2000] - present user with examples to derive join conditions – MIQIS [Fletcher, Wyss, (demo) SIGMOD 2005] - search through a large space of schema transformations (beyond what is given here), but requires the same data to appear in both source and target – We focus on inferring selection views only, but are very compatible with existing schema match work

Slide 45

Conclusions
 Contributions
– Introduced contextual matching as an important extension to schema matching – Defined a general framework in which instance-level match technique is treated as a black box – Identified two techniques based on classification to find good conditions – Identified filtering criterea for contextual matches – Define contextual foreign key and new join rules to extend a Cliostyle schema mapper to better handle contextual matches – Experimental study illustrating time/quality tradeoffs

 Future Work
– More complex view conditioning (horizontal partitioning + attribute promotion) – Consider taking constraints on target into account in quality functions
Slide 46

The End

Thank you, any questions?

sizes_fmeas.eps

Slide 48

Standard Match Algorithm
 StandardMatch(RS,RT, )
– Evaluate quality of match between all pairs of (source, target) attributes
• Ignore complex (multi-attribute) matches for simplicity

– return matches between source table RS and target schema RT that have confidence threshold >= 


Slide 49

M

RS.ac
RS.ba RS.ba RS.db RS.ac RS.bd RS.af

RT1.ac true
RT1.cd true RT1.sb true RT1.ar true RT1.vw true RT1.ad true RT1.ca true

M

M

M

M

M

M

Slide 50

Background: Instance-level matching
 Instance-level schema matching requires sample data for
source and target schema

 Train a variety of matchers by treating each (source, target)
column as a set of documents labeled by the column name
– e.g. text matchers based on string similarity, token similarity, format similarity, number of tokens, etc, or – numeric matchers based on value distribution, etc.

 Apply source matchers to sample target data, and vice versa  Combine resulting scores (with machine-learned weightings
[Doan, Domingos, Halevy, SIGMOD 2001])
M

to score each arrow
RT1.bb true

RS.ac score

RT1.bb true

M

RS.ac

“perfect match”

Slide 51

Algorithm ContextMatch(RS,RT,)
L = ; M = StandardMatch(RS,RT,); C = InferCandidateViews(RS,M,EarlyDisjuncts); for c  C do Vc = select * from RS where c; for m  M do m‟ := m with RS replaced by Vc;

s := ScoreMatch(m‟);
L = L  {(m‟,s)}; return SelectContextualMatches(M, L, EarlyDisjuncts)

Slide 52


								
To top