Ontology construction by yurtgc548

VIEWS: 5 PAGES: 42

									Ontology construction from text

                       Blaz Fortuna
Outline
       Big picture
       OntoGen
       Future work




    2
    Big picture




3
Vision

            Extracting structured information from text

       What is “text”?
           From single documents to large corpora
                different granularity
       What is “structured information”?
           From topic taxonomies to full-blown ontologies
                different expressivity




    4
Available tools
       Text mining
           … for dealing with large corpora
       Natural Language Processing (NLP)
           … for dealing with sentence level structure
       Machine learning
           … for abstracting structure from data (modeling)
           … inside of many text mining and NLP algorithms
       Visualization
           … for user interactions




    5
The Plan

document
                                Semantic
                                 Graphs

                                                  Template
        granularity




                                                 Extraction




                                           Q&A

     corpus           OntoGen


                                Expressiveness

 6
    OntoGen




7
OntoGen
       Tool for semi-automatic
        ontology construction from
        large text corpora

       Integrates several text-
        mining methods
           Clustering
           Active learning
           Classification
           Visualizations

       Publicly available at
           ontogen.ijs.si
                                     [Fortuna, Mladenić, Grobelnik, 2005]
    8
Ontology construction with OntoGen

       Semi-Automatic
           provide suggestions and insights into domain
           user interacts with parameters of methods
           final decisions taken by user

       Data-Driven
           most of the aid provided by the system is based on some
            underlying data
           instances are described by features extracted from the data
            (e.g. words-vectors)




    9
Ontology model in OntoGen
    Ontology is a data model representing:
        a set of concepts within a domain
        the relationships between these concepts

    OntoGen models ontology as a graph/network
     structure consisting from:
        a set of concepts (vertices in a graph),
        a set of instances assigned to a particular concepts (data
         records assigned to vertices in a graph)
        a set of relationships connecting concepts (directed edges
         in a graph)
        each instance is described by a set of features


    10
Example of a Topic Ontology




  11
Instance representation
        Bag of words:
            Vocabulary: {wi | i = 1, …, N }
            Documents are represented with vectors (word space):


        Example:

    Document set:                                         Document vector representation:
       d1 = “Canonical Correlation Analysis”                 x1 = (1, 1, 1, 0, 0, 0)
       d2 = “Numerical Analysis”                             x2 = (0, 0, 1, 1, 0, 0)
       d3 = “Numerical Linear Algebra”                       x3 = (0, 0, 0, 1, 1, 1,)

    Vocabulary:
    {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}




    12
Basic idea behind OntoGen

       Text corpus           Ontology



                      Concept A        Concept B



                                  Domain



                              Concept C




13                                                 13
Concept discovery – unsupervised

    Clustering based approach
        K-means clustering of the
         instances
        Clusters offered as
         suggestions
        Users selects relevant
         suggestions




    14
Concept discovery – unsupervised
    Visualization based
        Topic-landscape based
         visualization
            One instance one yellow
             point on the map
            Similar instances appear
             closer together
        User can make a concept
         by selecting a region of the
         map
            Pink points on the map are
             selected instances


    15
Concept discovery – supervised
    Active learning based
     approach
        User enters a query
        System ranks the instances
         according to the query
        User labels instances:
            Yes – belongs to the concept
            No – does not belong to the
             concept
        Once there are enough
         instances, system switches to
         SVM based active learning
        When done, concept added to
         the ontology.

    16
Concept discovery – supervised

    Classification based
     approach
        Instances are classified into a
         background ontology
            called OntoLight
        Concepts with the most
         instances provided as
         sub-concept suggestions




17
Concept naming – unsupervised
    Automatic extraction of
     keywords, for describing the
     concepts
        First approach based on TFIDF
         weights of words
        Second approach based on SVM
         based feature selection
         algorithm




    18
Concept naming – supervised

    Classification based
     approach
        Concept’s instances are classified
         into a background ontology
            called OntoLight
        Names from background
         ontology, with most classified
         instances, are provided as
         suggestions
        Shows what is the name in some
         pre-defined vocabulary




19
     Concept
     visualization
     Instances are visualized as
     points on 2D map.

     The distance between two
     instances on the map
     correspond to their
     similarity.

     Characteristic keywords
     are shown for all parts of
     the map.

     User can select groups of
     instances on the map to
     create sub-concepts.




20
     Ontology
     visualization
     Ontology concepts
     visualized as points on the
     2D topic map.

     Topic map generated from a
     set of text documents.




21
                                                                            Multiple views of
                                     Countries view
                                                                            the same data
                                                                            Simple taxonomy on top of
Lloyd’s CEO questioned in                                                   Reuters news articles
recovery suit in U.S.
Ronald Sandler, chief executive of
Lloyd's of London, on Tuesday                                               Two different views, one
underwent a second day of court
interrogation about …
                                                                            focuses on topics, one
                                                                            focuses on geography

                                                                            Each view offers yields a
                                                                            different taxonomy on the
Topics view
                                                                            data.
                                      UK takeovers and mergers
                                      The following are additions and       SVM based method detects
                                      deletions to the takeovers and
                                      mergers list for the week beginning   importance of keywords for
                                      August 19, as provided by the         each view.
                                      Takeover …




    22
Word weight learning
    The word weight learning method      Algorithm:
     is based on SVM feature selection.   1.   Calculate linear SVM classifier
    Besides ranking the words it also         for each category
     assigns them weights based on        2.   Calculate word weights for each
     SVM classifier.                           category from SVM normal
                                               vectors. Weight for
Notation:                                      i-th word and j-th category is:
 N – number of documents                                       N
                                                       1
 {x1, …, xN} – documents
 C(xi) – set of categories for
                                                  i j

                                                       N
                                                               x
                                                               k 1
                                                                      k ,i   n j ,i
         document xi
 n – number of words
                                          3.   Final word weights are
                                               calculated separately for each
 {w1, …, wn} – word weights                   document:
 {nj1, …, njn} – SVM normal
         vector for j-th category                                        
                                                 xk , i      i TFi j
                                                             jC ( x ) 
                                                                    k    
    23
Relations – preprocessing
    Name-Entity profile
        Extracted sentences from articles in which they name entity appears
        Example: Agassi
            Olympic champion Agassi meets of Morocco in the first round.

    Co-occurrence profiles
        Extracted sentences from articles in which two name entities appear together
        Example: Sampras – Agassi
            There will be no repeat of last year's men's final with eighth-ranked Agassi landing in
             Sampras's half of the draw.

    Relationship
        By extracting keywords from co-occurrence profiles we can get summary of
         relationship between two name entities.
        Keywords are extracted by from co-occurrence profile bag-of-words vectors




    24
Relations – example
Bill Clinton                                                     Chicago
    Iraq [476]                                                     Clinton [236]
        president, missiles, attacks, Kurdish, northern                 conventional, democrat, training, day, campaign
    Bob Dole [294]                                                 U.S. [164]
        republican, president, presidential, candidates, poll           trader, markets, purchasers, index, future
    United States [204]                                            New York [100]
        president, Monday, southern, move, election                     variety, mixed, critical, poll, bulletproof
    White House [146]                                              Dole [70]
        president, spokesman, reporters, Friday, campaign               conventional, democrat, campaign, drug, Sunday
    Iran [74]                                                      Kansas City [70]
        president, investment, gas, law, penalize                       basis, wheat, bushels, fob, red
    Congress [66]                                                  Los Angeles [60]
        president, calling, billion, republican, democrat               (variety, mixed, critical, poll, stg
    Chicago [42]                                                   Illinois [34]
        president, conventional, democrat, drug, campaign               democrat, state, conventional, trip, mayor
    Al Gore [40]                                                   Chicago Board of Trade [34]
        president, vice, bus, tour, election                            future, deliverable, stocks, bus, reporters
                                                                    San Francisco [34]
                                                                         operations, municipal, full, remain, services
                                                                    Boston [32]
                                                                         fared, comparatively, game, existed, American




    25
Relations – abstraction
    Clustering of name entities using k-means
     clustering                                           C1
    Relations between clusters are established
     based on the name-entities co-occurrence
     profiles:
        Let C1 and C2 be two clusters
        Let pij be a co-occurrence profile between
         document di and dj
        P = {pij | so that di from C1 and dj from C2 }
        Relation is defined by a profile set P
        Summary of relation is extracted from the
         centroid vector of profiles from P

                                                          C2




    26
Relations – example
    Example of clusters:                             Example of relations
            Cluster 1:                                   Cluster 1 vs. Cluster 3:
                Name Entities: Bosnia, Bosnian,              Name Entities: U.N., U.S., American,
                 Sarajevo                                      Washington, Bosnia,Turkey, Richard
                Keywords: serbs, moslems,                     Holbrooke, U.N. Security Council,
                 bosnian, election                             White House
            Cluster 2:                                       Keywords: election, serb, war,
                                                               bosnians, moslem, peace, tribunal,
                Name Entities: Russia, Britain,               police, spokesman, crime
                 Germany, France
                                                          Cluster 1 vs. Cluster 2:
                Keywords: meeting, country,
                 government, told                             Name Entities: NATO,Yugoslavia,
                                                               Bosnia, Croatia, Serbia, Belgrade,
            Cluster 3:                                        Balkan, OSCE, Burns
                Name Entities: Washington,                   Keywords: country, election, state,
                 United States                                 international, peace, meeting,
                Keywords: spokesman, military,                secretary, foreign, talks, member
                 missiles




    27
Relations – example
         Hashimoto, Romano                                      Bill Clinton, Jacques Chirac,
          Prodi, Benjamin                                        Suharto, Hosni Mubarak,
        Netanyahu, Jim Bolger                                          Leonid Kuchma




               minister, prime, meeting,                     president, meeting, visit, talks,
            foreign, talks, president, peace,                 leaders, minister, secretary,
                   visit, told, officials                            officials, state


                                   Russia, Britain, Germany,
                                         France, China, EU
                                  meeting, country, government, told,
                                  officials, union, minister, secretary,
                                               trade, report


              courts, case, year, told, rules,              plant, powerful, company, venture,
             trials, charges, sentenced, law,                electrical, projects, million, joint,
                            file                                       province, state



                                                                        Tennessee Valley
         Supreme Court, U.S.
                                                                   Authority, New Hill, TVA,
        District Court, Simpson,
                                                                   Florida Power & Light Co,
          Justice Department
                                                                            St Lucie



28
Relations – example

          Minister                    President


                Visit             Visit



                        Country



                Rule              Invest



          Court                      Power plant



29
Evaluation
    First prototype was                             User study performed for
     successfully used:                               the second prototype
        Applied in multiple domains:                    Main impression
            business, legislations and digital              the tool saves time
             libraries (SEKT project)                        is especially useful when
        Users were always domain                             working with large collections
         experts                                              of documents
            with limited knowledge and                  Main disadvantages
             experience with ontology                        abstraction
             construction / knowledge                        unattractive interface design
             engineering
        Feedback from first trails used
         as input for the second                     Used in several EU projects
         prototype                                       SWING, TAO, NEON,
            the one presented here                       ECOLEAD, E4, TOOLEAST



    30
From the users
    Many users use the program for                University of Cyprus
     exploration                                   Mehiläinen Medical Center, Finland
        New York Times uses it for                Food Safety Division, Alberta Agriculture,
            analyzing user comments,               Canada
            segmenting website users              University of South Carolina
                                                   National Institute of Telecommunications,
                                                    Poland
    Also used by people from:
                                                   Katholieke Universiteit Leuven
        Microsoft
                                                   University Amsterdam
        Honda, Japan
                                                   Txt eSolutions, Italy
        Siemens Austria
                                                   Insiel, Italy
        University of Washington
                                                   AMI communities (~1500 development
        University of Melbourne,Victoria,          engineers)
         Australia
                                                   Virtuele fabrik
        FIAT crf, Italy
                                                   Avtomobilski grozd
        Universitat Haifa, Izrael
                                                   University of Nova Gorica
        Motilal Nehru National Inistitute
         of Technology, India                      ISOIN (cluster of 1600 companies,
                                                    suppliers for Airbus)
        Slovenian Army
        Shanghai Jiao Tong University, China


    31
     Future work




32
The Plan

document
                              Semantic
                               Graphs

                                                Template
      granularity




                                               Extraction




                                         Q&A

  corpus            OntoGen


                              Expressiveness

 33
Move towards bigger granularity
    Semantic graphs
        Extract data-points from sentences
         level
            OntoGen does it on a document
             level
        Based on triplets extracted from
         sentence structure
            Subject
            Predicate
            Object
        Extraction can be done with
            Parsers
            Structured learning
        Triplets from one document can be
         merged into Semantic graphs
    Stronger then bag-of-words
    Example application:
        Document summarization

    34
The Plan

document
                              Semantic
                               Graphs

                                                Template
      granularity




                                               Extraction




                                         Q&A

  corpus            OntoGen


                              Expressiveness

 35
36
The Plan

document
                              Semantic
                               Graphs

                                                Template
      granularity




                                               Extraction




                                         Q&A

  corpus            OntoGen


                              Expressiveness

 37
Template extraction
    Hypothesis:
        People view events through “templates”
            Models of how things evolve, relate
            Use these models to understand, predict


    Goal:
        automatic extraction of such models from texts




    38
Search over triplets
    Triplet extraction ran over Reuters corpus
        800k news articles from 1996 to 1997




39
Search over triplets




40
Template earthquake
                                        Places

     Government
                                                            Time-period
                                         Hits
                     Measured by
                                                  Hits in

                                     Earthquake

                     Registered in                Kills

     Richter scale                   Collapses              People


                                      Buildings


41
     Thank you!

         Questions?




42

								
To top