WSD and meaning by yurtgc548

VIEWS: 1 PAGES: 95

									Concepts, Ontologies,
 and Project TANGO
                        Deryle Lonsdale
   BYU Linguistics and English Language
                             lonz@byu.edu




                                            1
Outline
l   NSF projects
l   Semantic Web
    l   Concepts
l   Project TIDIE
    l   Ontologies
l   Project TANGO
    l   Tables
    l   Ontology generation

                              2
Acknowledgements
l   NSF
l   David Embley (BYU CS), Steve Liddle (BYU
    Marriott School) and Yuri Tijerino
l   BYU Data Extraction Group members




                                               3
The National Science
Foundation
l   Federal agency, $5.5 billion budget, funds 20% of
    all federally supported basic research conducted
    by America’s colleges and universities
l   7 directorates
    l   Biological Sciences, Computer and Information Science
        and Engineering, Engineering, Geosciences, Mathematics
        and Physical Sciences, Social, Behavioral and Economic
        Sciences, and Education and Human Resources
l   200,000 scientists, engineers, educators and
    students at universities, laboratories and field
    sites
l   10,000 awards/year, 3 years duration (avg.)
                                                             4
The NSF Nifty 50 (general)
l   ACCELERATING, EXPANDING       l   GEMINI TELESCOPES
    UNIVERSE                      l   HANTAVIRUS
l   ANTARCTIC OZONE HOLE              IDENTIFICATION
    RESEARCH                      l   DNA FINGERPRINTING
l   ARABIDOPSIS—A PLANT GENOME    l   MRI—MAGNETIC
    PROJECT                           RESONANCE IMAGING
l   BAR CODES                     l   NANOTECHNOLOGY
l   BLACK HOLES CONFIRMED         l   THE NATIONAL
l   BUCKY BALLS                       OBSERVATORIES
l   COMPUTER VISUALIZATION        l   OVERCOMING HEAVY
    TECHNIQUES                        METALS
l   DATA COMPRESSION TECHNOLOGY   l   OVERCOMING SALT
l   DISCOVERY OF PLANETS              TOXICITY
l   DOPPLER RADAR                 l   TISSUE ENGINEERING
l   EFFECTS OF ACID RAIN          l   TUMOR DETECTION
l   EL NIÑO AND LA NIÑA           l   VOLCANIC ERUPTION
    PREDICTIONS                       DETECTION
l   FIBER OPTICS                  l   YELLOW BARRELS
                                                           5
Language-related Nifty 50
l   AMERICAN SIGN LANGUAGE DICTIONARY
    DEVELOPMENT
l   COMPUTER VISUALIZATION TECHNIQUES
l   THE DARCI CARD
l   DATA COMPRESSION TECHNOLOGY
l   THE "EYE CHIP" OR RETINA CHIP
l   THE INTERNET
l   PERSONS WITH DISABILITIES ACCESS
    TO THE WEB
l   PROJECT LISTEN
l   SPEECH RECOGNITION TECHNOLOGY
l   vBNS—VERY HIGH SPEED BACKBONE
     NETWORK SYSTEM
l   WEB BROWSERS
                                        6
Browsing the Semantic Web


  Hypernym
Synonym
                   The search query
                  Annotation




                                      7
  Browsing the Semantic Web


 movie



 astronomy



 sports


4 Ranking based on content data and structure
4 Using lexical semantics for similarity search
4 Grouping results by their conceptual relationships
                                                       8
Desirable, not (yet) possible
l   Word sense disambiguation
l   Other types of queries (e.g. services)
    l   What is the cheapest available round-trip
        flight to Cancun the day after finals this
        semester?
    l   Set up an appointment with my optometrist for
        next week.
    l   List available 4-person BYU-approved
        apartments in Orem for under $150/month.
    l   Find me a linguistics job in Tahiti.
                                                        9
     Project TIDIE
Apr 10, 2001 – May 12, 2005




                              10
Overview of TIDIE
l   3-year NSF project at BYU
l   Total amount about $430,000
l   PI David Embley (BYU CS), 4 co-PI’s from
    BYU
l   18 grad students, 45 publications
l   Demos, tools, papers, presentations at
    website (www.deg.byu.edu/)


                                               11
Goal of TIDIE
l   Target-Based Independent-of-Document
    Information Extraction
l   Target-based: user specifies what to find
    l   Not just keyword search, but concept-based search
        using an ontology
l   Document independent
    l   Should work even if pages change over time, on new
        documents
l   IE: match, merge, retrieve, format information
l   Present in way that user can search, query
    results
                                                             12
Document-based IE




                    13
  Recognition and extraction


                                                            Car    Feature
                                                            0001   Auto
                                                            0001   AC
                                                            0002   Black
                                                            0002   4 door
                                                            0002   tinted windows
                                                            0002   Auto
                                                            0002   pb
                                                            0002   ps
                                                            0002   cruise
                                                            0002   am/fm
Car    Year   Make Model      Mileage Price PhoneNr         0002   cassette stereo
0001   1989   Subaru SW               $1900 (336)835-8597   0002   a/c
0002   1998   Elantra                       (336)526-5444   0003   Auto
0003   1994   HONDA ACCORD EX 100K          (336)526-1081   0003   jade green
                                                                              14
                                                            0003   gold
Concepts
l   What drive the matching process for IE
l   Inherent in words, numbers, phrases, text
l   Linguistics: lexical semantics
l   Denotations: entities, attributes
l   Location: relationships
l   Occurrences: constraints



                                                15
Concept matching
l   We use exhaustive concept matching
    techniques to find concepts in documents
    including:
    l   Lexical information (lexicons)
    l   Natural language processing (NLP) techniques
    l   Similarity of values
    l   Features of value
    l   Data frames
    l   Constraints
                                                       16
Lexicons
l   Repositories of enumerable classes of
    lexical information
l   FirstNames, LastNames, USStates,
    ProvoOremApts, CarMakes, Drugs,
    CampGroundFeats, etc.
l   WordNet (synonyms, word senses,
    hypernyms/hyponyms)


                                            17
The data-frame library
 l   Snippets of real-world knowledge about data
     (type, length, nearby keywords, patterns [as in
     regexps], functional relations, etc)
 l   Low-level patterns implemented as regular
     expressions
 l   Match items such as email addresses, phone
     numbers, names, etc.
 Mileage matches [8]
     constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },
           { extract "[1-9]\d{0,2}?,\d{3}";
             context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},
           { extract "[1-9]\d{0,2}?,\d{3}";
             context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},
              { extract "[1-9]\d{3,6}";
                context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";},
              { extract "[1-9]\d{3,6}";
                context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";};
     keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";
 end;




                                                                                                18
Isolated concepts are OK, but...
l   We’re also interested in the relations
    between concepts
l   This is often best done graphically
l   Ontology: arrangement of concepts that
    explicitizes their relations, constraints
l   Conceptual modeling: field of CS /
    linguistics that deals with formalizing
    concepts, using such information
l   BYU has its own well-known conceptual
    modeling framework (OSM)
                                                19
Conceptual modeling (OSM)

                             Year                  Price
                            1..*                     1..*


                   1..*        has                 has
            Make                                              1..* Mileage
                             0..1 0..1         0..1 0..1
                      has                                   has
                                         Car
                               0..1         0..* 0..1
                                                     is for
                            has       has                     1..* PhoneNr
            Model 1..*                                                 0..1
                                            1..*                   has
                                      Feature                          1..*
                                                                  Extension



                                                                              20
Ontologies and IE
         Source     Target




                             21
Constant/keyword recognition
    '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles.
     '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles.
    Previous owner heart broken! Asking only $11,995. #1415.
     Previous owner heart broken! Asking only $11,995. #1415.
    JERRY SEINER MIDVALE, 566-3800 or 566-3888
     JERRY SEINER MIDVALE, 566-3800 or 566-3888

                                     Descriptor/String/Position(start/end)

                                     Year|97|2|3
                                     Make|CHEV|5|8
                                     Make|CHEVY|5|9
                                     Model|Cavalier|11|18
                                     Feature|Red|21|23
                                     Feature|5 spd|26|30
                                     Mileage|7,000|38|42
                                     KEYWORD(Mileage)|miles|44|48
                                     Price|11,995|100|105
                                     Mileage|11,995|100|105
                                     PhoneNr|566-3800|136|143
                                                                22
                                     PhoneNr|566-3888|148|155
 Database instance generator
Year|97|2|3
Make|CHEV|5|8
Make|CHEVY|5|9
Model|Cavalier|11|18
Feature|Red|21|23
Feature|5 spd|26|30
Mileage|7,000|38|42
KEYWORD(Mileage)|miles|44|48
Price|11,995|100|105
Mileage|11,995|100|105
PhoneNr|566-3800|136|143
PhoneNr|566-3888|148|155

                      insert into Car values(1001, “97”, “CHEVY”, “Cavalier”,
                         “7,000”, “11,995”, “556-3800”)
                      insert into CarFeature values(1001, “Red”)
                      insert into CarFeature values(1001, “5 spd”)
                                                                           23
Car ads extraction ontology




                              24
Car ads ontology (textual)
    Car [->object];
    Car [0..1] has Year [1..*];
    Car [0..1] has Make [1..*];
    Car [0...1] has Model [1..*];
    Car [0..1] has Mileage [1..*];
    Car [0..*] has Feature [1..*];
    Car [0..1] has Price [1..*];
    PhoneNr [1..*] is for Car [0..*];
    PhoneNr [0..1] has Extension [1..*];
    Year matches [4]
       constant {extract “\d{2}”;
                  context "([^\$\d]|^)[4-9]\d[^\d]";
                  substitute "^" -> "19"; },
                  …
      …
    End;
                                                       25
A gene ontology




                  26
A geneology data model




                         27
Finding jobs in linguistics
l   Built ontology for linguistics jobs: what
    defines a linguistics job
l   Data frames and lexicons: language names
    (www.ethnologue.com), subfields of
    linguistics (www.linguistlist.org), tools
    linguists use, programming languages,
    activities, responsibilities, country names
l   Documents: 3500 web pages + emails to me
l   Complete results reported in DLLS 2003

                                              28
Sample query




               29
Sample output




                30
Subfield expertise sought




                            31
Technical skills sought




                          32
Sample observations
l   270 don’t have linguist* (!)
l   Computer/computational background required for
    almost 1/3 (1116)
l   Noticeable amount of headhunting, particularly in
    Seattle, DC areas
l   Often a job title is not even listed (!)
l   Great need for ontologies related to linguistics
    l   job titles
    l   theoretical frameworks, subfields
    l   typical linguist job activities
    l   linguistic research/development venues
                                                    33
An engineering discipline?
l   160 linguistics jobs ending in “engineer”
l   Software development cycle
    l   research e., software design e.
    l   development e., software e.
    l   software quality e., linguistic test e., linguistic quality e.
    l   linguistic support e., user experience e.
    l   presales e., technical sales e.
l   Specific subfields
    l   web site e.
    l   speech e., voice recognition e., speech recognition application e., speech e., ASR
        tuning e., audio e.
    l   dialog e.
l   tools e.
l   AI e., NLP e.
l   knowledge e., ontology e.
l   linguist e., natural language e.
l   staff e.
l   human factors e., user interface e.
                                                                                             34
A recent ontologist job ad
l   Date: Thu, 28 Jul 2005 11:44:40
l   Subject: General Linguistics: Ontologist, Denver, USA

l   Job Rank: Ontologist
l   Specialty Areas: General Linguistics

l   Position Summary: Ontologist

l   This person will be responsible for modifying & editing Ontology structures.

l   Skills:
    Ø    Basic computer skills such as Internet, email, and spreadsheet programs
    Ø    In-depth knowledge of any major industry, such as Health Care, Automotive, Legal, Construction, and so forth
         helpful
    Ø    Superior communication skills, both oral and written. Ability to communicate effectively with reports, peers,
         superiors, and customers essential
    Ø    Travel &/or foreign language experience desired

l   Personal Characteristics:
    Ø    A healthy sense of logic, and a love for details
    Ø    A deep and abiding love of language, and of rule-governed classification systems. This person should be excited
         by the challenge of figuring out the precise place where a word belongs, and be delighted with the prospect of
         performing such tasks as the major part of their job

l   Position Qualifications:
    Ø    -Bachelor's degree, preferably in Linguistics, Library Science, English, or related field
                                                                                                                         35
Another recent ontologist ad
l   Position Summary: Lead Ontologist

l   The Lead Ontologist will be responsible for creating & designing Ontology and
    Ontology structures. This person will be responsible for innovation and general
    Ontology development as Ontology requirements change. They will serve as Team Lead
    on various Ontology projects, and they will assist the Director with certain aspects of
    management, including the development of department culture and standards. They
    will also serve as a liaison between the Director and the rest of the team.

l   Skills:
    Ø   Ability to edit & manipulate text highly desired, using tools such as Emacs and Perl. High level
        programming language experience and SQL also desired
    Ø   Knowledge of Ontology structures, and experience with developing and maintaining such
        structures
    Ø   Ability to assist with Ontology development and use problem-solving skills to overcome
        obstacles
    Ø   Ability to QA own Ontology work, and work of others
    Ø   Ability to lead projects from set-up through to QA
    Ø   Leadership or management experience a plus

l   Position Qualifications:
    Ø   -Bachelor's degree in Linguistics, Library Science, or related field
    Ø   -2-3 years experience in Ontology or related field
                                                                                                       36
l   Application Deadline: Open until filled.
Matching request with ontology

  “Tell me about cruises on San Francisco Bay. I’d like to know
     scheduled times, cost, and the duration of cruises on Friday of
     next week.”




                                                                       37
Building a query
Projection                                Selection Constants
scheduled times                       San Francisco Bay
cost                                  Friday, Oct. 29th
duration
                              Join Path




             p    s   (   ´         ) = Result                  38
StartTime                                   Price     Duration   Source

10:45 am, 12:00 pm, 1:15, 2:30, 4:00        $20.00,                 1
                                            $16.00,
                                            $12.00


10:00 am, 10:45 am, 11:15 am, 12:00 pm,     $17.00,   1 Hour       2
12:30 pm, 1:15 pm, 1:45 pm, 2:30 pm, 3:00   $16.00,
pm, 3:45 pm, 4:15 pm, 5:00 pm               $12.00




                                                                          39
Another example
l   Service Request
          I want to see a dermatologist next week; any day would
          be ok for me, at 4:00 p.m. The dermatologist must be
          within 20 miles from my home and must accept my
          insurance.

l   Match with Task Ontology
    l   Domain Ontology
    l   Process Ontology
l   Complete, Negotiate, Finalize


                                                                   40
Service domain ontology




                          41
                ü

        ü

ü   ü
                    ü
            ü




                        42
Relevant mini-ontology




                         43
Ontologies: issues
l   Most successful in data-rich, narrow- domain
    applications
l   Ambiguities are problematic, context only
    partially eliminates
l   Incompleteness: implicit information
l   Commonsense world pragmatics evasive
l   Knowledge prerequisites are steep
l   Major efforts in creation, maintenance
    l   Must be created by experts
    l   Experts are biased in knowledge, agreement needed
    l   Ontologies continually change; upkeep a massive task
                                                               44
Ontologies: possible solutions
l   Some automation is needed
l   Current automatic generation of ontologies is not
    successful, because extracted from free-form,
    unstructured text.
l   A more effective alternative is to extract
    ontologies from structured data on the web
    (tables, charts, etc.)
l   TANGO project
    l   Part 1: Extract tables from the web
    l   Part 2: Define mini-ontologies from tables
    l   Part 3: Merge into growing domain ontology   45
Project TANGO




                46
Overview
l   Table ANalysis for Generating Ontologies
l   3-year NSF-funded project
l   Joint BYU/RPI project
l   Uses and extends TIDIE concepts,
    ontologies
l   Goal is to process tables, generate
    ontologies, use results for IE


                                               47
Motivation
l   Keyword or link analysis search not enough
    to search for information in tables
l   Structure in tables can lead to domain
    knowledge which includes concepts,
    relationships and constraints (ontologies)
l   Tables on web created for human use can
    lead to robust domain ontologies


                                                 48
Table understanding
l   What is a table?
l   Why table normalization?
l   What is table understanding?
l   What is mini-ontology generation?




                                        49
What is a table?
l   “…a two-dimensional assembly of cells used
    to present information…”
    l   Lopresti and Nagy
l   Normalized tables (row-column format)
l   Small paper (using OCR) and/or electronic
    tables (marked up) intended for human use



                                             50
?

    Olympus C-750 Ultra Zoom

    Sensor Resolution:   4.2 megapixels
    Optical Zoom:        10 x
    Digital Zoom:        4x
    Installed Memory:    16 MB
    Lens Aperture:       F/8-2.8/3.7
    Focal Length min:    6.3 mm
    Focal Length max:    63.0 mm


                                          51
?

    Olympus C-750 Ultra Zoom

    Sensor Resolution:   4.2 megapixels
    Optical Zoom:        10 x
    Digital Zoom:        4x
    Installed Memory:    16 MB
    Lens Aperture:       F/8-2.8/3.7
    Focal Length min:    6.3 mm
    Focal Length max:    63.0 mm


                                          52
?

    Olympus C-750 Ultra Zoom

    Sensor Resolution:   4.2 megapixels
    Optical Zoom:        10 x
    Digital Zoom:        4x
    Installed Memory:    16 MB
    Lens Aperture:       F/8-2.8/3.7
    Focal Length min:    6.3 mm
    Focal Length max:    63.0 mm


                                          53
?

    Olympus C-750 Ultra Zoom

    Sensor Resolution   4.2 megapixels
    Optical Zoom        10 x
    Digital Zoom        4x
    Installed Memory    16 MB
    Lens Aperture       F/8-2.8/3.7
    Focal Length min    6.3 mm
    Focal Length max    63.0 mm


                                         54
Digital Camera

      Olympus C-750 Ultra Zoom

      Sensor Resolution:   4.2 megapixels
      Optical Zoom:        10 x
      Digital Zoom:        4x
      Installed Memory:    16 MB
      Lens Aperture:       F/8-2.8/3.7
      Focal Length min:    6.3 mm
      Focal Length max:    63.0 mm


                                            55
   ?


Flight #   Class   From   Time/Date   To    Time/Date   Stops

Delta 16   Coach JFK      6:05 pm     CDG 7:35 am        0
                          02 01 04        03 01 04

Delta 119 Coach CDG       10:20 am    JFK   1:00 pm      0
                           09 01 04         09 01 04



                                                         56
   ?


Flight #   Class   From   Time/Date   To    Time/Date   Stops

Delta 16   Coach JFK      6:05 pm     CDG 7:35 am        0
                          02 01 04        03 01 04

Delta 119 Coach CDG       10:20 am    JFK   1:00 pm      0
                           09 01 04         09 01 04



                                                         57
   Airline Itinerary


Flight #   Class   From   Time/Date   To    Time/Date   Stops

Delta 16   Coach JFK      6:05 pm     CDG 7:35 am        0
                          02 01 04        03 01 04

Delta 119 Coach CDG       10:20 am    JFK   1:00 pm      0
                           09 01 04         09 01 04



                                                         58
?

    Place       Bonnie Lake
    County      Duchesne
    State       Utah
    Type        Lake
    Elevation   10,000 feet
    USGS Quad   Mirror Lake
    Latitude    40.711ºN
    Longitude   110.876ºW



                              59
?

    Place       Bonnie Lake
    County      Duchesne
    State       Utah
    Type        Lake
    Elevation   10,000 feet
    USGS Quad   Mirror Lake
    Latitude    40.711ºN
    Longitude   110.876ºW



                              60
?

    Place       Bonnie Lake
    County      Duchesne
    State       Utah
    Type        Lake
    Elevation   10,000 feet
    USGS Quad   Mirror Lake
    Latitude    40.711ºN
    Longitude   110.876ºW



                              61
Maps



  Place       Bonnie Lake
  County      Duchesne
  State       Utah
  Type        Lake
  Elevation   10,100 feet
  USGS Quad   Mirror Lake
  Latitude    40.711ºN
  Longitude   110.876ºW

                            62
Table normalization
            Raw table



 take any table,
 produce a standard
 row-column table with
 all data cells
 containing expanded
 values and type
                         Country            GDP/PPP         GDP/PPP   Real-    Inflation
                                                              Per     Growth
                                                             Capita   Rate

 information             Afghanistan      $21,000,000,000      $800   ?        ?

                         Albania          $13,200,000,000    $3,800   7.3%     3.0%

                         Algeria         $177,000,000,000    $5,600   3.8%     3.0%

                         Andorra           $1,300,000,000   $19,000   3.8%     4.3%
         Normalized
                         Angola           $13,300,000,000    $1,330   5.4%     110.0%
         table
                         Antigua and        $674,000,000    $10,000   3.5%     0.4%
                               Barbuda
                                                                                   63
                         …                             …          …   …        …
Normalizing across hyperlinks




                                64
Normalized table
??                  Population   Population   Population    Birth   Death   Migration   Life            Life          Infant
                                 Growth       Density       Rate    Rate    Rate        Expectancy      Expectancy    Mortality
                                      rate                                              Male            Female
Afghanistan        25,824,882    3.95%        39.88         4.19%   1.70%   1.46%       47.82           46.82 years   14.06%
                                              persons/km2                                       years
Albania             3,364,571    1.05%        122.79        2.07%   0.74%   -0.29%      65.92           72.33 years   4.29%
                                              persons/km2                                       years
Algeria            31,133,486    2.10%        13.07         2.70%   0.55%   -0.05%      68.07           70.46 years   4.38%
                                              persons/km2                                       years
American Samoa         63,786    2.64%        320.53        2.65%   0.40%   0.39%       71.23           79.95 years   1.02%
                                              persons/km2                                       years
Andorra                65,939    2.24%        146.53        1.03%   0.55%   1.76%       80.55           86.55 years   0.41%
                                              persons/km2                                       years
Angola                 11,510    2.84%        8.97          4.31%   1.64%   0.16%       46.08           50.82 years   12.92%
                                              persons/km2                                       years
…                          …     …            …             …       …       …           …               …             …
Western Sahara        239,333    2.34%        0.90          4.54%   1.66%   -0.54%      47.98           50.57 years   13.67%
                                              persons/km2                                       years
World            5,995,544,836   1.30%        14.42         2.20%   0.90%   ?           61.00           65.00 years   5.60%
                                              persons/km2                                       years
Yemen              16,942,230    3.34%        32.09         4.33%   0.99%   0.00%       58.17           61.88 years   6.98%
                                              persons/km2                                       years
Zambia              9,663,535    2.12%        13.05         4.45%   2.26%   0.08%       36.72           37 21 years   9.19%
                                              persons/km2                                       years
                                                                                                                              65
Zimbabwe           11,163,160    1.02%        28.87         3.06%   2.04%   ?           38.77           38.94 years   6.12%
                                              persons/km2                                       years
How to understand tables
l   Captions – in vicinity of table (above, below
    etc)
l   Footnotes – on annotated column labels or
    data cells
l   Embedded information – in rows, columns
    or cells {e.g., $, %, (1,000), billions, etc}
l   Links to other views of the table, possibly
    with new information

                                                66
Use of normalized data
l       Take a table as an input and produce standard records in
        the form of attribute-value pairs as output
l       Discover constraints among columns
l       Understand the data values
                                                                                  {<Country: Afghanistan>,
Left-most,        {has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),
                                                                                  <GDP/PPP:
primary key         has(Country,Real-growth rate*), has(Country, Inflation*)
                                                                                  $21,000,000,000>,
                                                                                  <GDP/PPP per capita:
    Country              GDP/PPP         GDP/PPP      Real-Growth     Inflation   $800>, <Real-growth rate:
                                        Per Capita    Rate                        ?>, <Inflation: ?>}
    Afghanistan     $21,000,000,000          $800     ?               ?

    Albania         $13,200,000,000        $3,800     7.3%            3.0%

    Algeria        $177,000,000,000        $5,600     3.8%            3.0%

    Andorra          $1,300,000,000       $19,000     3.8%            4.3%

    Angola          $13,300,000,000        $1,330     5.4%            110.0%

    Antigua and          $674,000,000     $10,000     3.5%            0.4%
    Barbuda
    …                              …              …   …               …


    Country names              Dollar amount                 Percentage
                                                                                                       67
     (from data frame)        (from data frame)           (from data frame)
Ontology generation overview
            Sample Documents




               Concepts of Interest




                                          Data extraction
                                          ontology
                Concepts with Relations



                                                            68
Example:
Creating a domain ontology
                                       Longitude      Latitude

                           Latitude and longitude
                           designates location


 Name           Geopolitical Entity          Location        Distances
        names                          has                                Includes procedural
                                                      Has
                                                      GMT                 knowledge
                                                             Duration between
                                               Time
                                                             Time zones
         Country                City




                   Has associated
                   data frames

                                                                                                69
Example:
Table understanding to mini-ontology generation
                              Agglomeration        Population   Continent        Country

                              Tokyo                31,139,900   Asia             Japan

                              New York-            30,286,900   The Americas     United States of
                                    Philadelphia                                        America

                              Mexico               21,233,900   The Americas     Mexico

                              Seoul                19,969,100   Asia             Korea (South)

                              Sao Paulo            18,847,400   The Americas     Brazil

                              Jakarta              17,891,000   Asia             Indonesia

                              Osaka-Kobe-Kyoto     17,621,500   Asia             Japan

                              …                            …    …                …

                              Niigata                 503,500   Asia             Japan

                              Raurkela                503,300   Asia             India

                              Homjel                  502,200   Europe           Belarus

                              Zunyi                   501,900   Asia             China

                              Santiago                501,800   The Americas     Dominican
                                                                                       Republic

                              Pingdingshan            501,500   Asia             China

                              Fargona                 501,000   Asia             Uzbekistan

                              Kirov                   500,200   Europe           Russia

Agglomeration   Population    Newcastle               500,000   Australia
                                                                       /Oceani
                                                                                 Australia

                                                                       a



  Country       Continent
                                                                                                    70
 Example:
 Concept matching to ontology Merging
                                  Longitude      Latitude
                                                                      Agglomeration               Population
                     Latitude and longitude
                     designates location
                                                            Merge
Name       Geopolitical Entity          Location                         Country                  Continent
   names                          has
                                                  Has
                                                  GMT
                                                                                         Results

                                          Time

                                                                Population           Longitude     Latitude
       Country             City
                                                                         Latitude and longitude
                                                                         designates location


                                                   Name        Geopolitical Entity         Location




                                              Continent        Country         Agglomeration            City
                                                                                                               71
Ontology merging/growing
l   Direct merge (no conflicts)
    l   Use results of matching phase to find similar concepts
        in ontologies (e.g., data value similarities, data frames,
        NLP, etc)
l   Conflict resolution
    l   Interactively identify evidence and counter evidence of
        functional relationships among mini-ontologies using
        constraint resolution
l   IDS Interaction with human knowledge engineer
    l   Issues – identify
    l   Default strategy – apply
    l   Suggestions – make

                                                                     72
Example:
Another mini-ontology generation




                     Longitude      Latitude

       Place Name                         Elevation


             State          Place         USGS Quad

          Country            ⊎            Area


         City/town   Lake    Reservoir    Mine

                                                      73
 Example:
 Another mini-ontology generation
              Longitude      Latitude

Place Name                         Elevation


      State          Place         USGS Quad

   Country            ⊎            Area


  City/town   Lake    Reservoir    Mine


                                               Merge
                                                               Population           Longitude      Latitude

                                                                        Latitude and longitude
                                                                        designates location


                                               Name           Geopolitical Entity         Location
                                                      names                         has
                                                                                                   has
                                                                                                   GMT
                                                                                            Time

                                                                                                         74
                                          Continent           Country         Agglomeration              City
Example:
 Concept Mapping to Ontology Merging

                               Population           Longitude       Latitude

                                        Latitude and longitude
                                        designates location


              Name            Geopolitical Entity           Location
                      names                         has
                                                                    has
                                                                    GMT
                                                             Time

                 Geopolitical
                 Entity with
                 population                                                                   Elevation


                                                             State               Place        USGS Quad

                                                           Country                ⊎           Area


Continent   Country           Agglomeration               City/town       Lake    Reservoir   Mine



                                                                                                     75
Recognize Table Information




                                                     Religion
                Population     Albanian            Roman      Shi’a    Sunni
 Country      (July 2001 est.) Orthodox   Muslim   Catholic   Muslim   Muslim   other

 Afganistan    26,813,057                                    15%        84%      1%
 Albania        3,510,484      20%         70%      30%
                                                                                        76
Construct Mini-Ontology
                                                     Religion
                Population     Albanian            Roman      Shi’a    Sunni
 Country      (July 2001 est.) Orthodox   Muslim   Catholic   Muslim   Muslim   other

 Afganistan    26,813,057                                    15%        84%      1%
 Albania        3,510,484      20%         70%      30%




                                                                                        77
Discover Mappings




                    78
Merge




        79
Review: the TANGO process
l   Start out with normalized table
l   Generate likely candidates for:
    l   Object Sets
    l   Relationship Sets
    l   Functional Constraints
    l   Inclusion Constraints/Hierarchical Structure
l   Get help from user when needed
l   Choose best candidate for the ontology
                                                       80
Generate concepts




Create list of candidate concepts (usually column names)

                                                           81
Example 1: Generate
Concepts




Determine lexicalization (columns with associated values
are lexical)
                                                           82
Example 1: Generate
Concepts




           Current ontology

                              83
Example 1: Generate
Relationships




l   Decide relationship sets
    l   Exponential number of combinations
    l   Basic assumption: one main concept relates to all others
        (attributes)
    l   Goal: find central column of interest
                                                               84
Example 1: Generate
Relationships




Look for mapping between one column and title of table

                                                         85
Example 1: Generate
Relationships




           Current ontology

                              86
Example 1: Generate
Constraints




l   FDs and Participation Constraints
    l   FD definition: X → Y iff (X[i] = X[j]) → (Y[i] = Y[j]) for all
        row indexes i and j.
    l   Unless solid case (two or more same values), only consider
        FDs from central object to attributes
    l   Use heuristics for setting exact participation (0:1,1:*, etc)

                                                                         87
Example 1: Generate
Concepts




Numerical values are usually functionally determined by
column of interest and have 0:* participation constraint.
                                                            88
Example 1: Generate
Constraints




         Completed mini-ontology

                                   89
Example 2: Generate
Concepts

                      l   SubFamily,
                          Group, and
                          SubGroup are
                          generic types
                      l   Enumerate
                          column values as
                          object sets
                          because less
                          than 5 divisions
                          (recursively)

                                       90
Example 2: Generate
Relationships
                      l   Found mapping of
                          central column of
                          interest to title
                          (Language)
                      l   Exceptions to basic
                          assumption
                          l   Hierarchy
                              (enumerated
                              object sets)
                          l   Transitive FDs (X
                              → Y, Y → Z,
                              remove X → Z)
                      l   Create ISA
                          hierarchy from
                          table structure
                                            91
Example 2: Generate
Relationships




           Current ontology   92
Example 2: Generate
Hierarchical Constraints
                       l   Assign members to
                           each object set for
                           easy calculation
                       l   Find inclusion
                           dependencies:
                           l   Union – All members
                               of parents are
                               members of one or
                               more child
                           l   Intersection (Less
                               common) – Child
                               members are always
                               in both parents
                           l   Mutual exclusion –
                               Intersection of any
                               two child members is
                               empty.

                                               93
Example 2: Generate
Hierarchical Constraints




         Completed mini-ontology   94
Future direction
l   Start with multiple tables (or URLs) and
    generate mini-ontologies
l   Identify most suitable mini-ontologies to
    merge by calculating which tables have
    most overlap of concepts
l   Generate multiple domain ontologies
l   Integrate with form-based data
    extraction tools (smarter Web search
    engines)
                                                95

								
To top