Docstoc

Automating Schema Matching for Data Integration - Data Extraction

Document Sample
Automating Schema Matching for Data Integration - Data Extraction Powered By Docstoc
					                    Automatically Extracting
                   Ontologically Specified Data
                       from HTML Tables
                    with Unknown Structure

                            David W. Embley, Cui Tao, Stephen W. Liddle
                                     Brigham Young University



BYU Data Extraction Group                     Funded by NSF               ER 2002
                            Information Exchange
                               Source     Target




                                                      Information
                                                      Extraction

                                                   Leverage
                                                   this …
                                                              … to do
                                                              this


                                                        Schema
                                                        Matching
BYU Data Extraction Group                                               ER 2002
                            Information Extraction




BYU Data Extraction Group                            ER 2002
              Extracting Pertinent Information
                      from Documents




BYU Data Extraction Group                        ER 2002
           A Conceptual-Modeling Solution
                                             Year                 Price
                                            1..*                     1..*


                                   1..*        has                has
                            Make                                              1..* Mileage
                                             0..1 0..1         0..1 0..1
                                      has                                   has
                                                         Car
                                               0..1        0..* 0..1
                                                                    is for
                                            has      has                      1..* PhoneNr
                            Model 1..*                                                 0..1
                                                           1..*                    has
                                                      Feature                          1..*
                                                                                  Extension




BYU Data Extraction Group                                                                     ER 2002
                               Car-Ads Ontology
                            Car [->object];
                            Car [0..1] has Year [1..*];
                            Car [0..1] has Make [1..*];
                            Car [0...1] has Model [1..*];
                            Car [0..1] has Mileage [1..*];
                            Car [0..*] has Feature [1..*];
                            Car [0..1] has Price [1..*];
                            PhoneNr [1..*] is for Car [0..*];
                            PhoneNr [0..1] has Extension [1..*];
                            Year matches [4]
                              constant {extract “\d{2}”;
                                          context "([^\$\d]|^)[4-9]\d[^\d]";
                                          substitute "^" -> "19"; },
                                          …
                              …
                            End;
BYU Data Extraction Group                                                      ER 2002
                        Recognition and Extraction

                                                                   Car    Feature
                                                                   0001   Auto
                                                                   0001   AC
                                                                   0002   Black
                                                                   0002   4 door
                                                                   0002   tinted windows
                                                                   0002   Auto
                                                                   0002   pb
                                                                   0002   ps
                                                                   0002   cruise
                                                                   0002   am/fm
   Car      Year     Make Model      Mileage Price PhoneNr         0002   cassette stereo
   0001     1989     Subaru SW               $1900 (336)835-8597   0002   a/c
   0002     1998     Elantra                       (336)526-5444   0003   Auto
   0003     1994     HONDA ACCORD EX 100K          (336)526-1081   0003   jade green
                                                                   0003   gold

BYU Data Extraction Group                                                             ER 2002
      Schema Matching for HTML Tables
           with Unknown Structure




BYU Data Extraction Group           ER 2002
                            Table-Schema Matching
                                  (Basic Idea)
       • Many Tables on the Web
       • Ontology-Based Extraction
              – Works well for unstructured or semistructured data
              – What about structured data – tables?
       • Method
              – Form attribute-value pairs
              – Do extraction
              – Infer mappings from extraction patterns


BYU Data Extraction Group                                            ER 2002
                     Problem: Different Schemas
            Target Database Schema
                        {Car, Year, Make, Model, Mileage, Price, PhoneNr},
                        {PhoneNr, Extension}, {Car, Feature}
            Different Source Table Schemas
                   – {Run #, Yr, Make, Model, Tran, Color, Dr}
                   – {Make, Model, Year, Colour, Price, Auto, Air Cond.,
                     AM/FM, CD}
                   – {Vehicle, Distance, Price, Mileage}
                   – {Year, Make, Model, Trim, Invoice/Retail, Engine,
                     Fuel Economy}



BYU Data Extraction Group                                                    ER 2002
                      Problem: Attribute is Value




BYU Data Extraction Group                           ER 2002
         Problem: Attribute-Value is Value


                            ?    ?




BYU Data Extraction Group                    ER 2002
                    Problem: Value is not Value




BYU Data Extraction Group                         ER 2002
                            Problem: Implied Values



                            ``
                            ``   ``




BYU Data Extraction Group                             ER 2002
                       Problem: Missing Attributes




BYU Data Extraction Group                            ER 2002
                 Problem: Compound Attributes




BYU Data Extraction Group                       ER 2002
                            Problem: Factored Values




BYU Data Extraction Group                              ER 2002
                            Problem: Split Values




BYU Data Extraction Group                           ER 2002
                            Problem: Merged Values




BYU Data Extraction Group                            ER 2002
               Problem: Values not of Interest




BYU Data Extraction Group                        ER 2002
       Problem: Information Behind Links




                                       Table
                                       extending
                                       over several
                                       pages
                    Single-Column
                    Table (formatted
                    as list)




BYU Data Extraction Group                      ER 2002
                            Solution
       • Form attribute-value pairs (adjust if necessary)
       • Do extraction
       • Infer mappings from extraction patterns




BYU Data Extraction Group                                   ER 2002
      Solution: Remove Internal Factoring



                       ACURA
                       ACURA         Legend




           Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

           Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

BYU Data Extraction Group                                                                                                         ER 2002
          Solution: Replace Boolean Values


                                                                  Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA   Legend                              Auto         AM/FM
                                                                  Auto Air Cond. AM/FM CD
                                                                               AM/FM

                                                                       Air Cond. AM/FM
                                                                  Auto Air Cond. AM/FM




                                       βAutoβAir CondβAM/FM βYes, Table
                                        Yes, Yes,     Yes,
                                                              CD




BYU Data Extraction Group                                                                   ER 2002
     Solution: Form Attribute-Value Pairs


                                                                  Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA     Legend                            Auto          AM/FM
                                                                  Auto Air Cond. AM/FM CD
                                                                                AM/FM

                                                                       Air Cond. AM/FM
                                                                  Auto Air Cond. AM/FM




                   <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
                     <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

BYU Data Extraction Group                                                                             ER 2002
  Solution: Adjust Attribute-Value Pairs


                                                                  Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA     Legend                            Auto          AM/FM
                                                                  Auto Air Cond. AM/FM CD
                                                                                AM/FM

                                                                       Air Cond. AM/FM
                                                                  Auto Air Cond. AM/FM




                   <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
                     <Auto>, <Air Cond>, <AM/FM>

BYU Data Extraction Group                                                                             ER 2002
                              Solution: Do Extraction



                                           Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA    Legend      Auto         AM/FM
                                           Auto Air Cond. AM/FM CD
                                                        AM/FM

                                                Air Cond. AM/FM
                                           Auto Air Cond. AM/FM




BYU Data Extraction Group                                            ER 2002
                      Solution: Infer Mappings



                                                                         Auto Air Cond. AM/FM CD

                  ACURA
                  ACURA         Legend                                   Auto             AM/FM
                                                                         Auto Air Cond. AM/FM CD
                                                                                          AM/FM

                                                                              Air Cond. AM/FM
                                                                         Auto Air Cond. AM/FM




           π row is a car. Colour, Price, Auto, Air Cond, AM/FM, Air Cond, AM/FM, CD)*Table
        Each Makeμ(Model,Table πModelμ(Year, Colour, Price, Auto,CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table
                   πYear Year,

 {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
                          Note: Mappings produce sets for attributes. Joining to form records
BYU Data Extraction Group is trivial because we have OIDs for table rows (e.g. for each Car).                            ER 2002
                            Solution: Do Extraction



                                                                      Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA   Legend                                  Auto             AM/FM
                                                                      Auto Air Cond. AM/FM CD
                                                                                       AM/FM

                                                                           Air Cond. AM/FM
                                                                      Auto Air Cond. AM/FM




                               πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

  {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

BYU Data Extraction Group                                                                       ER 2002
                            Solution: Do Extraction



                                                       Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA   Legend                   Auto         AM/FM
                                                       Auto Air Cond. AM/FM CD
                                                                    AM/FM

                                                            Air Cond. AM/FM
                                                       Auto Air Cond. AM/FM




                                         πPriceTable
  {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

BYU Data Extraction Group                                                              ER 2002
                            Solution: Do Extraction


                                                                      Auto Air Cond. AM/FM CD

                      ACURA
                      ACURA     Legend                                Auto           AM/FM
                                                                      Auto Air Cond. AM/FM CD

                                                                                      AM/FM
                                                                           Air Cond. AM/FM
                                                                      Auto Air Cond. AM/FM




ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.
                                                           Yes,

β Yes, Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDYes,
  Air Cond.
                                          Yes,                            Table

  {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
BYU Data Extraction Group                                                                              ER 2002
                               Experiment
       •    Tables from 60 sites
       •    10 “training” tables
       •    50 test tables
       •    357 mappings (from all 60 sites)
              – 172 direct mappings (same attribute and meaning)
              – 185 indirect mappings (29 attribute synonyms, 5 “Yes/No”
                columns, 68 unions over columns for Feature, 19 factored
                values, and 89 columns of merged values that needed to be
                split)



BYU Data Extraction Group                                                   ER 2002
                                        Results
           • 10 “training” tables
                  – 100% of the 57 mappings (no false mappings)
                  – 94.6% of the values in linked pages (5.4% false
                    declarations)
           • 50 test tables
                  – 94.7% of the 300 mappings (no false mappings)
                  – On the bases of sampling 3,000 values in linked pages,
                    we obtained 97% recall and 86% precision
           • 16 missed mappings
                  –    4 partial (not all unions included)
                  –    6 non-U.S. car-ads (unrecognized makes and models)
                  –    2 U.S. unrecognized makes and models
                  –    3 prices (missing $ or found MSRP instead)
                  –    1 mileage (mileages less than 1,000)
BYU Data Extraction Group                                                    ER 2002
                             Conclusions
  • Summary
         – Transformed schema-matching problem to extraction
         – Inferred semantic mappings
         – Discovered source-to-target mapping rules
  • Evidence of Success
         – Tables (mappings): 95% (Recall); 100% (Precision)
         – Linked Text (value extraction): ~97% (Recall); ~86% (Precision)
  • Future Work
         – Discover and exploit structure in linked text
         – Broaden table understanding
         – Integrate with current extraction tools

                                                      www.deg.byu.edu

BYU Data Extraction Group                                               ER 2002

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:11/11/2012
language:Unknown
pages:34
About Good!!!NICE!!! The best document database!