Docstoc

ViewDownload - Ontology Generati

Document Sample
ViewDownload - Ontology Generati Powered By Docstoc
					     TANGO
 Table ANalysis for
Generating Ontologies

           Yuri A. Tijerino*,
         David W. Embley*,
      Deryle W. Lonsdale* and
            George Nagy**
     * Brigham Young University
  ** Rensselaer Polytechnic Institute
List of contents
 Motivation
 Applications
 Table understanding
 Concept matching
 Ontology merging/growing
 Example
 Future direction
Motivation
   Semi-automated ontological engineering through
    Table Analysis for Generating Ontologies
    (TANGO)
   Keyword or link analysis search not enough to
    search for information in tables
   Structure in tables can lead to domain knowledge
    which includes concepts, relationships and
    constraints (ontologies)
   Tables on web created for human use can lead to
    robust domain ontologies
TANGO Applications
 Extraction ontologies (generation)
 Data integration
 Semantic web
 Multiple-source query processing
 Document image analysis for documents
  that contain tables
Table understanding
 What is a table?
 Why table normalization?
 What is table understanding?
 What is mini-ontology generation?
Table understanding:
What is a table?
   “…a two-dimensional assembly of cells
    used to present information…”
       Lopresti and Nagy
 Normalized tables (row-column format)
 Small paper (using OCR) and/or electronic
  tables (marked up) intended for human
  use
Table understanding:
What is table normalization?
              Raw table


 Table normalization
 means to take any table
 and produce a standard
 row-column table with all
 data cells containing
 expanded values and type    Country            GDP/PPP         GDP/PPP    Real-    Inflation

 information                                                      Per
                                                                 Capita
                                                                           Growth
                                                                           Rate

                             Afghanistan      $21,000,000,000      $800    ?        ?

                             Albania          $13,200,000,000     $3,800   7.3%     3.0%

                             Algeria         $177,000,000,000     $5,600   3.8%     3.0%

                             Andorra           $1,300,000,000    $19,000   3.8%     4.3%
           Normalized        Angola           $13,300,000,000     $1,330   5.4%     110.0%
           table             Antigua and        $674,000,000     $10,000   3.5%     0.4%
                                   Barbuda

                             …                             …          …    …        …
Table understanding:
What is table normalization?
Table understanding:
What is table normalization?
??                  Population   Population    Population    Birth   Death   Migration   Life          Life          Infant
                                 Growth rate   Density       Rate    Rate    Rate        Expectancy    Expectancy    Mortality
                                                                                         Male          Female

Afghanistan        25,824,882    3.95%         39.88         4.19%   1.70%   1.46%       47.82 years   46.82 years   14.06%
                                               persons/km2

Albania             3,364,571    1.05%         122.79        2.07%   0.74%   -0.29%      65.92 years   72.33 years   4.29%
                                               persons/km2

Algeria            31,133,486    2.10%         13.07         2.70%   0.55%   -0.05%      68.07 years   70.46 years   4.38%
                                               persons/km2

American Samoa         63,786    2.64%         320.53        2.65%   0.40%   0.39%       71.23 years   79.95 years   1.02%
                                               persons/km2

Andorra                65,939    2.24%         146.53        1.03%   0.55%   1.76%       80.55 years   86.55 years   0.41%
                                               persons/km2
Angola                 11,510    2.84%         8.97          4.31%   1.64%   0.16%       46.08 years   50.82 years   12.92%
                                               persons/km2

…                          …     …             …             …       …       …           …             …             …

Western Sahara        239,333    2.34%         0.90          4.54%   1.66%   -0.54%      47.98 years   50.57 years   13.67%
                                               persons/km2

World            5,995,544,836   1.30%         14.42         2.20%   0.90%   ?           61.00 years   65.00 years   5.60%
                                               persons/km2

Yemen              16,942,230    3.34%         32.09         4.33%   0.99%   0.00%       58.17 years   61.88 years   6.98%
                                               persons/km2
Zambia              9,663,535    2.12%         13.05         4.45%   2.26%   0.08%       36.72 years   37 21 years   9.19%
                                               persons/km2

Zimbabwe           11,163,160    1.02%         28.87         3.06%   2.04%   ?           38.77 years   38.94 years   6.12%
                                               persons/km2
Table understanding:
Information useful for normalization
 Captions – in vicinity of table (above,
  below etc)
 Footnotes – on annotated column labels or
  data cells
 Embedded information – in rows, columns
  or cells {e.g., $, %, (1,000), billions, etc}
 Links to other views of the table, possibly
  with new information
What is table understanding?
    Normalize table
    Take a table as an input and produce standard records in the form of
     attribute-value pairs as output
    Discover constraints among columns
    Understand the data values
                                                                                  {<Country: Afghanistan>,
Left-most,        {has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),
                                                                                  <GDP/PPP:
primary key         has(Country,Real-growth rate*), has(Country, Inflation*)
                                                                                  $21,000,000,000>,
                                                                                  <GDP/PPP per capita:
    Country              GDP/PPP         GDP/PPP      Real-Growth     Inflation   $800>, <Real-growth rate:
                                        Per Capita    Rate
                                                                                  ?>, <Inflation: ?>}
    Afghanistan     $21,000,000,000          $800     ?               ?

    Albania         $13,200,000,000        $3,800     7.3%            3.0%

    Algeria       $177,000,000,000         $5,600     3.8%            3.0%

    Andorra          $1,300,000,000       $19,000     3.8%            4.3%

    Angola          $13,300,000,000        $1,330     5.4%            110.0%

    Antigua and          $674,000,000     $10,000     3.5%            0.4%
    Barbuda
    …                              …              …   …               …



    Country names              Dollar amount               Percentage
     (from data frame)        (from data frame)           (from data frame)
Example:
Creating a domain ontology
                                       Longitude      Latitude

                           Latitude and longitude
                           designates location


 Name           Geopolitical Entity          Location        Distances
        names                          has                                Includes procedural
                                                      Has
                                                      GMT                 knowledge
                                                             Duration between
                                               Time
                                                             Time zones
         Country                City




                   Has associated
                   data frames
Example:
Table understanding to mini-ontology generation
                             Agglomeration           Population   Continent          Country

                             Tokyo                   31,139,900   Asia               Japan

                             New York-Philadelphia   30,286,900   The Americas       United States of
                                                                                             America

                             Mexico                  21,233,900   The Americas       Mexico

                             Seoul                   19,969,100   Asia               Korea (South)

                             Sao Paulo               18,847,400   The Americas       Brazil

                             Jakarta                 17,891,000   Asia               Indonesia

                             Osaka-Kobe-Kyoto        17,621,500   Asia               Japan

                             …                               …    …                  …

                             Niigata                   503,500    Asia               Japan

                             Raurkela                  503,300    Asia               India

                             Homjel                    502,200    Europe             Belarus

                             Zunyi                     501,900    Asia               China

                             Santiago                  501,800    The Americas       Dominican Republic

                             Pingdingshan              501,500    Asia               China

                             Fargona                   501,000    Asia               Uzbekistan

                             Kirov                     500,200    Europe             Russia

                             Newcastle                 500,000    Australia          Australia
                                                                          /Oceania



Agglomeration   Population



  Country       Continent
 Example:
 Concept matching to ontology Merging
                                  Longitude      Latitude
                                                                      Agglomeration               Population
                     Latitude and longitude
                     designates location
                                                            Merge
Name       Geopolitical Entity          Location                         Country                  Continent
   names                          has
                                                  Has
                                                  GMT
                                                                                         Results

                                          Time

       Country             City                                 Population           Longitude     Latitude

                                                                         Latitude and longitude
                                                                         designates location


                                                   Name        Geopolitical Entity         Location




                                              Continent        Country         Agglomeration            City
Concept matching
   We use exhaustive concept matching
    techniques to match concepts from
    different mini-ontologies, including:
       Lexical and Natural Language Processing
       Value Similarity
       Value Features
       Data Frame Comparison
       Constraints
Concept Matching
(Lexical & NLP)
   Lexical
       Direct comparisons (substring/superstring)
       WordNet (Synonyms, Word Senses,
        Hypernyms/Hyponyms)
   Natural Language Processing
       Phrases in column headers
       Footnotes (for columns, rows, values)
       Explanations of symbols, rows, columns
       Titles and subtitles
Concept Matching
(Value Similarity)
 Compute overlap for string values
  comparing data sets
 Compute overlap for numeric values
  comparing Gaussian Probability
  Distributions
 Compute similarity of numeric values
  using regression
Concept Matching
(Value Similarity)
Afghanistan                   Afghanistan      Real-world example
Albania                       Albania
                                               Total of 193 cells in A
                                               Total of 267 cells in B
Algeria                       Algeria
                                               77 fields in B not in A
              In B not in A   American Samoa
                                               3 fields in A not in B
Andorra       In A not in B
                                               190 total matches
…                             …
                                               Proportion of matches with
              In B not in A   World
                                               respect to A = 190/193 = 98%
Yemen                         Yemen
                                               Proportion of matches with
Zambia                        Zambia           respect to B = 190/267 = 71%
Zimbabwe                      Zimbabwe

    A                                   B
Concept Matching
(Value Similarity)
             Gaussian PDF
31,900,600                    31,500,900   Total of 170 cells in A
30,521,550                    30,400,111
                                           Total of 240 cells in B

25,335,200                    25,500,100   50 fields in B not in A
                                           2 fields in A not in B
              In B not in A   21,000,900

12,300,555    In A not in B                168 total matches

…                             …            Proportion of matches with
                                           respect to A = 168/170 = 99%
              In B not in A   7,000,000

3,567,203                     3,500,050    Proportion of matches with
                                           respect to B = 168/240 = 70%
2,300,531                     2,300,000

1,400,112                     1,500,000

    A                                B
Concept Matching
(Value Features)
   We can also compute similarities from
    value characteristics such as:
       Character/numeric length, ratio
       Numeric values mean, variance, standard
        deviation
Concept Matching
(Data frames)
 Snippets of real-world knowledge about
  data (type, length, nearby keywords,
  patterns [as in regexps], functional, etc)
 We have used data frames to
       Recognize data types
       Include recognizers for values (dates, times,
        longitude, latitude, countries, cities, etc)
       Provide conversion routines
       Match headers, labels, footnotes and values
       Compose or split columns (e.g., addresses)
Concept Matching
(Constraints)
 Keys in tables (as well as nonkeys)
 Functional relationships
 1-1, 1-*, *-1 or *-* correspondences
 Subset/superset of value sets
 Unknown and null values
Ontology merging/growing
   Direct merge (no conflicts)
       Use results of matching phase to find similar concepts in
        ontologies (e.g., data value similarities, data frames,
        NLP, etc)
   Conflict resolution
       Interactively identify evidence and counter evidence of
        functional relationships among mini-ontologies using
        constraint resolution
   IDS Interaction with human knowledge engineer
       Issues – identify
       Default strategy – apply
       Suggestions – make
Example:
Another mini-ontology generation




                      Longitude      Latitude

        Place Name                         Elevation


              State          Place         USGS Quad

           Country                         Area
                              ⊎
          City/town   Lake    Reservoir    Mine
 Example:
 Another mini-ontology generation
              Longitude      Latitude

Place Name                         Elevation

      State          Place         USGS Quad

   Country                         Area
                      ⊎
  City/town   Lake    Reservoir    Mine


                                               Merge
                                                               Population           Longitude      Latitude

                                                                        Latitude and longitude
                                                                        designates location


                                               Name           Geopolitical Entity         Location
                                                      names                         has
                                                                                                   has
                                                                                                   GMT
                                                                                            Time


                                          Continent           Country         Agglomeration              City
Example:
Concept Mapping to Ontology Merging
                               Population           Longitude       Latitude

                                        Latitude and longitude
                                        designates location


              Name            Geopolitical Entity           Location
                      names                         has
                                                                    has
                                                                    GMT
                                                             Time

                 Geopolitical
                 Entity with
                 population                                                                   Elevation


                                                             State               Place        USGS Quad

                                                           Country                            Area
                                                                                  ⊎
Continent   Country           Agglomeration               City/town       Lake    Reservoir   Mine
Future direction
 Start with multiple tables (or URLs) and
  generate mini-ontologies
 Identify most suitable mini-ontologies to
  merge by calculating which tables have
  most overlap of concepts
 Generate multiple domain ontologies
 Integrate with form-based data extraction
  tools (smarter Web search engines)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:8/12/2010
language:English
pages:27