Learning to Map Between

Reviews
Shared by: tao peng
Stats
views:
4
rating:
not rated
reviews:
0
posted:
10/22/2009
language:
ENGLISH
pages:
0
Learning to Map Between Schemas Ontologies Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Agenda  Ontology mapping is a key problem in many applications: – – – – Data integration Semantic web Knowledge management E-commerce  LSD: – Solution that uses multi-strategy learning. – We’ve started with schema matching (I.e., very simple ontologies) – Currently extending to more expressive ontologies. – Experiments show the approach is very promising! 2 The Structure Mapping Problem   Types of structures: – Database schemas, XML DTDs, ontologies, …, Input: – Two (or more) structures, S1 and S2 – Data instances for S1 and S2 – Background knowledge  Output: – A mapping between S1 and S2 – Should enable translating between data instances. – Semantics of mapping? 3 Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name 1-1 mapping house location contact name full-baths phone 4 num-baths agent-phone non 1-1 mapping half-baths Motivation  Database schema integration – A problem as old as databases themselves. – database merging, data warehouses, data migration   Data integration / information gathering agents – On the WWW, in enterprises, large science projects Model management: – Model matching: key operator in an algebra where models and mappings are first-class objects. – See [Bernstein et al., 2000] for more.   The Semantic Web – Ontology mapping. System interoperability – E-services, application integration, B2B applications, …, 5 Desiderata from Proposed Solutions  Accuracy, efficiency, ease of use.  Realistic expectations: – Unlikely to be fully automated. Need user in the loop.  Some notion of semantics for mappings.  Extensibility: – Solution should exploit additional background knowledge.  “Memory”, knowledge reuse: – System should exploit previous manual or automatically generated matchings. – Key idea behind LSD. 6 LSD Overview       L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1: – [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus: – Complex mappings – Ontology mapping. 7 Outline      Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work. 8 Data Integration Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 wrappers source schema 2 source schema 3 realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. 9 Semantic Mappings between Schemas  Source schemas = XML DTDs house address contact-info agent-name 1-1 mapping house location contact name full-baths phone 10 num-baths agent-phone non 1-1 mapping half-baths Semantics (preliminary)  Semantics of mappings has received no attention.  Semantics of 1-1 mappings –  Given: – R(A1,…,An) and S(B1,…,Bm) – 1-1 mappings (Ai,Bj)  Then, we postulate the existence of a relation W, s.t.: – P (C1,…,Ck) (W) = P (A1,…,Ak) (R) , – P (C1,…,Ck) (W) = P (B1,…,Bk) (S) , – W also includes the unmatched attributes of R and S.  In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences. 11 Why Matching is Difficult   Aims to identify same real-world entity – using names, structures, types, data values, etc Schemas represent same entity differently – different names => same entity: – area & address => location – same names => different entities: – area => location or square-feet    Schema & data never fully capture semantics! – not adequately documented, not sufficiently expressive Intended semantics is typically subjective! – IBM Almaden Lab = IBM? Cannot be fully automated. Often hard for humans. Committees are required! 12 Current State of Affairs  Finding semantic mappings is now the bottleneck! – largely done by hand – labor intensive & error prone – GTE: 4 hours/element for 27,000 elements [Li&Clifton00]  Will only be exacerbated – – – – data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on semantic web  Need semi-automatic approaches to scale up! 13 Outline      Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work. 14 The LSD Approach  User manually maps a few data sources to the mediated schema.  LSD learns from the mappings, and proposes mappings for the rest of the sources.  Several types of knowledge are used in learning: – Schema elements, e.g., attribute names – Data elements: ranges, formats, word frequencies, value frequencies, length of texts. – Proximity of attributes – Functional dependencies, number of attribute occurrences.  One learner does not fit all. Use multiple learners and combine with meta-learner. 15 Example Mediated schema address location price agent-phone phone description comments Learned hypotheses If “phone” occurs in the name => agent-phone listed-price Schema of realestate.com location listed-price phone comments realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house Boston, MA $110,000 (617) 253 1429 Great location ... ... ... ... If “fantastic” & “great” occur frequently in data values => description homes.com price contact-phone extra-info $550,000 (278) 345 7215 Beautiful yard $320,000 (617) 335 2315 Great beach ... ... ... 16 Multi-Strategy Learning    Use a set of base learners: – Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: – County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score.  Combine base learners with a meta-learner, using stacking. 17 Base Learners  Name Learner (contact-info,office-address) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (listed-price,price) (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3)    Naive Bayes Learner [Domingos&Pazzani 97] – “Kent, WA” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner – exploits hierarchical structure of XML data 18 Training the Base Learners address location Mediated schema price agent-phone description listed-price phone comments Schema of realestate.com Name Learner realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house Boston, MA $110,000 (617) 253 1429 Great location (location, address) (listed-price, price) (phone, agent-phone) ... Naive Bayes Learner (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) ... 19 Entity Recognizers  Use pre-programmed knowledge to identify specific types of entities – date, time, city, zip code, name, etc – house-area (30 X 70, 500 sq. ft.) – county-name recognizer  Recognizers often have nice characteristics – easy to construct – many off-the-self research & commercial products – applicable across many domains – help with special cases that are hard to learn 20 Meta-Learner: Stacking  Training of meta-learner produces a weight for every pair of: – (base-learner, mediated-schema element) – weight(Name-Learner,address) = 0.1 – weight(Naive-Bayes,address) = 0.9  Combining predictions of meta-learner: – computes weighted sum of base-learner confidence scores Seattle, WA Name Learner Naive Bayes Meta-Learner (address,0.6) (address,0.8) (address, 0.6*0.1 + 0.8*0.9 = 0.78) 21 Training the Meta-Learner  For address Extracted XML Instances Name Learner Naive Bayes 0.5 0.4 0.3 0.6 0.3 ... 0.8 0.3 0.9 0.8 0.3 ... True Predictions 1 0 1 1 0 ... Miami, FL $250,000 Seattle, WA Kent, WA 3 ... Least-Squares Linear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9 22 Applying the Learners Schema of homes.com area day-phone extra-info address Mediated schema price agent-phone description Seattle, WA Kent, WA Austin, TX Name Learner Naive Bayes Name Learner Naive Bayes Meta-Learner Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.7), (description,0.3) (278) 345 7215 (617) 335 2315 (512) 427 1115 (agent-phone,0.9), (description,0.1) Beautiful yard Great beach Close to Seattle (description,0.8), (address,0.2) 23 The Constraint Handler  Extends learning to incorporate constraints – hard constraints – a = address & b = address a=b – a = house-id a is a key – a = agent-info & b = agent-name b is nested in a – soft constraints – a = agent-phone & b = agent-name a & b are usually close to each other – user feedback = hard or soft constraints  Details in [Doan et. al., SIGMOD 2001] 24 The Current LSD System Training Phase Mediated schema Source schemas Domain Constraints User Feedback Base-Learner1 Base-Learnerk Meta-Learner Matching Phase Data listings Constraint Handler Mappings 25 Outline      Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work. 26 Empirical Evaluation   Four domains – Real Estate I & II, Course Offerings, Faculty Listings For each domain – – – – create mediated DTD & domain constraints choose five sources extract & convert data listings into XML (faithful to schema!) mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48  Ten runs for each experiment - in each run: – manually provide 1-1 mappings for 3 sources – ask LSD to propose mappings for remaining 2 sources – accuracy = % of 1-1 mappings correctly identified 27 Matching Accuracy Average Matching Acccuracy (%) 100 90 80 70 60 50 40 30 20 10 0 Real Estate I Real Estate II Course Offerings Faculty Listings LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% 28 Sensitivity to Amount of Available Data Average matching accuracy (%) 100 90 80 70 60 50 40 0 100 200 300 400 500 29 Number of data listings per source (Real Estate I) Contribution of Schema vs. Data Average matching accuracy (%) 100 90 80 70 60 50 40 30 20 10 0 Real Estate I Real Estate II Course Offerings Faculty Listings LSD with only schema info. LSD with only data info. Complete LSD  More experiments in the paper [Doan et. al. 01] 30 Reasons for Incorrect Matching  Unfamiliarity – suburb – solution: add a suburb-name recognizer  Insufficient information – correctly identified general type, failed to pinpoint exact type – Richard Smith (206) 234 5412 – solution: add a proximity learner  Subjectivity – house-style = description? 31 Outline      Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work. 32 Moving Up the Expressiveness Ladder  Schemas are very simple ontologies.  More expressive power = More domain constraints.  Mappings become more complex, but constraints provide more to learn from.  Non 1-1 mappings: – F1(A1,…,Am) = F2(B1,…,Bm)  Ontologies (of various flavors): – Class hierarchy (I.e., containment on unary relations) – Relationships between objects – Constraints on relationships 33 Finding Non 1-1 Mappings Current work  Given two schemas, find – 1-many mappings: address = concat(city,state) – many-1: half-baths + full-baths = num-baths – many-many: concat(addr-line1,addr-line2) = concat(street,city,state)  1-many mappings – expressed as query – value correspondence expression: room-rate = rate * (1 + tax-rate) – relationship: state of tax-rate = state of hotel that has rate – special case: 1-many mappings between two relational tables Mediated schema address description num-baths city Source schema state comments half-baths full-baths 34 Brute-Force Solution   Define a set of operators – concat, +, -, *, /, etc For each set of mediated-schema columns – enumerate all possible mappings – evaluate & return best mapping Mediated-schema columns Source-schema columns m1 m1, m2, ..., mk 35 Search-Based Solution  States = columns – goal state: mediated-schema column – initial states: all source-schema columns – use 1-1 matching to reduce the set of initial states  Operators: concat, +, -, *, /, etc  Column-similarity: – use all base learners + recognizers 36 Multi-Strategy Search  Use a set of expert modules: L1, L2, ..., Ln  Each module – applies to only certain types of mediated-schema column – searches a small subspace – uses a cheap similarity measure to compare columns  Example – L1: text; concat; TF/IDF – L2: numeric; +, -, *, /; [Ho et. al. 2000] – L3: address; concat; Naive Bayes  Search techniques – beam search as default – specialized, do not have to materialize columns 37 Multi-Strategy Search (cont’d)  Apply all applicable expert modules L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z  Combine modules’ predictions & select the best one m11, m12, m21, m22, m31,m32 m11 38 Related Work Recognizers + Schema + 1-1 Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] Hybrid + 1-1 Matching Single Learner + 1-1 Matching SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] DELTA [Clifton et. al. 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching LSD [Doan et. al. 2000, 2001] Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction CLIO [Miller et. al. 00],[Yan et. al. 01] ? 39 Summary  LSD: – uses multi-strategy learning to semi-automatically generate semantic mappings. – LSD is extensible and incorporates domain and user knowledge, and previous techniques. – Experimental results show the approach is very promising.  Future work and issues to ponder: – Accommodating more expressive languages: ontologies – Reuse of learned concepts from related domains. – Semantics?  Data management is a fertile area for Machine Learning research! 40 Backup Slides 41 Mapping Maintenance Mediated-schema M Source-schema S m1 m2 m3  Ten months later ... – are the mappings still correct? Mediated-schema M’ Source-schema S’ m1 m2 m3 42 Information Extraction from Text   Extract data fragments from text documents – date, location, & victim’s name from a news article Intensive research on free-text documents  Many documents do have substantial structure – XML pages, name card, tables, list  Each such document = a data source – structure forms a schema – only one data value per schema element – “real” data source has many data values per schema element  Ongoing research in the IE community 43 Contribution of Each Component Average Matching Acccuracy (%) 100 80 60 40 20 0 Real Estate I Course Offerings Faculty Listings Real Estate II Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system 44 Exploiting Hierarchical Structure  Existing learners flatten out all structures Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors  Developed XML learner – similar to the Naive Bayes learner – input instance = bag of tokens – differs in one crucial aspect – consider not only text tokens, but also structure tokens 45 Domain Constraints   Impose semantic regularities on sources – verified using schema or data Examples – a = address & b = address a=b – a = house-id a is a key – a = agent-info & b = agent-name b is nested in a  Can be specified up front – when creating mediated schema – independent of any actual source schema 46 The Constraint Handler Predictions from Meta-Learner area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) area: address 0.7 contact-phone: agent-phone 0.9 extra-info: address 0.6 0.378   Domain Constraints a = address & b = adderss a=b area: address 0.7 contact-phone: agent-phone 0.9 extra-info: description 0.4 0.252 0.3 0.1 0.4 0.012 Can specify arbitrary constraints User feedback = domain constraint – ad-id = house-id  Extended to handle domain heuristics – a = agent-phone & b = agent-name a & b are usually close to each other 47

Related docs
Functional Map for Learning Mentors
Views: 5  |  Downloads: 3
6 Keys to More Effective Learning Habits - Mind Map
Views: 825  |  Downloads: 158
learning programme
Views: 3  |  Downloads: 0
Learning Outcome Map Skills
Views: 41  |  Downloads: 0
Learning Map Basics
Views: 363  |  Downloads: 19
map_saskatchewan _endoc
Views: 0  |  Downloads: 0
concept map
Views: 468  |  Downloads: 12
MAP READING
Views: 47  |  Downloads: 2
07. Maze learning
Views: 32  |  Downloads: 1
premium docs
Other docs by tao peng
舞台資料
Views: 10  |  Downloads: 0
竞价货物一览表:
Views: 12  |  Downloads: 0
孯VER SERVICE BULLETIN Cinema 5.
Views: 23  |  Downloads: 0
利濠喇叭 完美虓
Views: 8  |  Downloads: 0
出倉大拍賣
Views: 3  |  Downloads: 0
“The Sound of Silence”
Views: 10  |  Downloads: 0
“THE PONY EXPRESS”
Views: 4  |  Downloads: 0
“...the best sounding subwoofer
Views: 15  |  Downloads: 0
“ Subwoofer of the Year” “Produc
Views: 4  |  Downloads: 0
Ценова листа
Views: 13  |  Downloads: 0
Съдържание
Views: 13  |  Downloads: 0
СОДРЖИНА
Views: 2  |  Downloads: 0