06jun-ESWC-ontol-sem-learn

Document Sample
06jun-ESWC-ontol-sem-learn Powered By Docstoc
					Toward is this ‗Semantics‘ in
 What Large-Scale Shallow
    the Semantic Web,
Semantics for Higher-Quality
  and How can You Get It?
             NLP

            Eduard Hovy
     Information Sciences Institute
    University of Southern California
     www.isi.edu/natural-language
The Knowledge Base of the World…

• We live in the infosphere
• …but it‘s unstructured,
  inconsistent, often
  outdated,
• …in other words…a
  mess!                        Address      Latitude Longitude
                                                     -
                               642 Penn St 33.923413118.409809
                                                     -
                               640 Penn St 33.923412118.409809
                                                     -
                               636 Penn St 33.923412118.409809
                                                     -
                               604 Palm Ave33.923414118.409809
                                                     -
                               610 Palm Ave33.923414118.409810
                                                     -
                               645 Sierra St 33.923413118.409810


                     ?
                               639 Sierra St 33.923412118.409810
                                                     -




                 Is this the
                best we can
                     do?
Frank‘s two Semantic Webs
1. The Semantic Web as data definer:
  • Applies to circumscribed, structured data types:
     – Numbers, lists, tables, inventories, picture annotations…
  • Suitable for constrained, ‗context-free‘ semantics
  • Amenable to OWL, etc. — ‗closed‘ vocabularies and
    controllable relations

2. The Semantic Web as text enhancer:
  • Applies to open-ended, unstructured information
  • Requires open-ended, ‗context-sensitive‘ semantics
  • Requires what exactly? Where to find it?
Where‘s the semantics?
• It‘s in the words: insert standardized symbols for
  each (? content) word
   – Need: symbols, vocabularies, …ontologies

• It‘s in the links: create standardized set of links
  and use (? only) them
   – Need: links, operational semantics, …link interpreters
                                                    <xxx.yy.zzz/ajd8>         run
• It will somehow emerge, by
  magic, if we just do enough stuff         carry
                                                                                    shop


  with OWL and RDF                                  pleased
                                                                             teenage

   – Need: formalisms, definitions,              live

     operational semantics, …notation
                                                                    <xxx.yy.zzz/ffgh:56>
                                           eat                nothing

     interpreters                                                       <xxx.yy.zzz/fff:3>
NO to controlled vocabulary, says IR!
• 1960s: Cleverdon and the Cranfield aeronautics
  evaluations of text retrieval engines (Cleverdon 67):
   – Tested algorithms and lists of controlled vocabularies, also all
     words
   – SURPRISE: all words better than controlled vocabs!
   – …which led to Salton‘s vector space approach to IR
   – …which led to today‘s web search engines
• The IR position: forget ontologies and controlled
  lists…the semantics lies in multi-word combinations!
   – There‘s no benefit in artificial or controlled languages
   – Multi-word combinations (―kitchen knife‖) are good enough
   – Build ‗language models‘: frequency distributions of words in
     corpus/doc (Callan et al. 99; Ponte and Croft 98)
      Nonetheless…for Semantic Web uses, we need
     semantics. But WHAT is it? And how do we obtain
Toward semantics: Layers of
interpretation 1




                                                                             syntax
                                                                            POS
                                                                            surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN PRO V V PN DT AJ N N PUN

Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates,
announced at the inauguration ceremony ―we want to make Dubai a new trading center‖
Layers of interpretation 2

 P0                                         word senses
 act : announce1                             in Ontology
 agent : P1(Sheikh Mohammed)                frames and case       instances in
 theme : P9                                 roles in Ontology    Instance Base
 time : present
 P9                 P10                      P1(Sheikh Mohammed)
 act : want3        act : make8              P2(who)
                                                                           shallow
 agent : P6(we)     theme: P7(Dubai)         P3(Defense Minister)          semantics
 theme : P10        result : P8(center)      P4(United Arab Emirates)
                                             P5(inaug. ceremony)           coref
P1(Sheikh Mohammed) = P2(who)                P6(we)
P2(who) = P3(Defense Minister)               P7(Dubai)                     syntax
P4(United Arab Emirates) = P6(we)            P8(trading center)
                                                                           POS
                                                                           surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates,
announced at the inauguration ceremony ―we want to make Dubai a new trading center‖
Layers of interpretation 3
   P0                            P9                                 P10                          P11
   act : say-act3
   agent: P1(Sheikh)
                                 state : desire1
                                 experiencer : P1(Sheikh)
                                                                    act : change-state
                                                                    theme: P7(Dubai)
                                                                                                 state : essence1
                                                                                                 experiencer: P7(Dubai)
                                                                                                                          deep(er)
   theme : P9                    theme : P10                        old-state : ?                theme : P8(center)       semantics
   authortime : T1               statetime : T2                     new-state : P11              statetime : T4 > T3
   eventtime : T2 < T1                                              eventtime : T3 > T2
[ Sheikh Mohammed, who is also the Defense Minister ] of the United
Arab Emirates, [ announced at the inauguration ceremony ―we want                                                          info struc
to make Dubai a new trading center‖ ]                          topic
P0                                                                                                            (theme)     shallow
act : announce1
agent: P1(Sheikh Mohammed)
                                                                           P1(Sheikh Mohammed)
                                                                           P2(who)
                                                                                                             rheme        semantics
theme : P9
time : present
                                                                           P3(Defense Minister)
                                                                           P4(United Arab Emirates)
                                                                                                             focus        coref
P9               P10                                                       P5(inaug. ceremony)
act : want3      act : make8           P1(Sheikh Mohammed) = P2(who)       P6(we)
agent : P6(we)
theme : P10
                 theme: P7(Dubai)
                 result : P8(center)
                                       P2(who) = P3(Defense Minister)
                                       P4(United Arab Emirates) = P6(we)
                                                                           P7(Dubai)
                                                                           P8(trading center)
                                                                                                                          syntax
                                                                                                                          POS
                                                                                                                          surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates,
announced at the inauguration ceremony ―we want to make Dubai a new trading center‖
 Layers of interpretation 4

          formality : medium
          complexity : medium-high
                                                       author : New York Times
                                                       medium : paper-print
                                                                                               author-expertise : expert
                                                                                               trust-in-author : high
                                                                                                                                         pragmatics
          opinion : neutral                            readingtime : T0 > T1

    P0                              P9                               P10                             P11
                                                                                                                                         style
    act : say-act3                  state : desire1                  act : change-state              state : essence1
    agent: P1(Sheikh)               experiencer : P1(Sheikh)         theme: P7(Dubai)                experiencer: P7(Dubai)
    theme : P9
    authortime : T1
                                    theme : P10
                                    statetime : T2
                                                                     old-state : ?
                                                                     new-state : P11
                                                                                                     theme : P8(center)
                                                                                                     statetime : T4 > T3
                                                                                                                                         deep(er)
    eventtime : T2 < T1                                              eventtime : T3 > T2                                                 semantics
[ Sheikh Mohammed, who is also the Defense Minister ] of the United Arab Emirates, [
                                                                                                                                         info struc
announced at the inauguration ceremony ―we want to make Dubai a new trading center‖ ]
 P0
                                                                                                                                         shallow
 act : announce1
 agent: P1(Sheikh Mohammed)
                                                                                   P1(Sheikh Mohammed)
                                                                                   P2(who)
                                                                                                                              topic
                                                                                                                               (theme)
                                                                                                                                         semantics
 theme : P9                                                                        P3(Defense Minister)                       rheme
 time : present
 P9                 P10
                                                                                   P4(United Arab Emirates)
                                                                                   P5(inaug. ceremony)
                                                                                                                              focus      coref
 act : want3        act : make8             P1(Sheikh Mohammed) = P2(who)          P6(we)
 agent : P6(we)
 theme : P10
                    theme: P7(Dubai)
                    result : P8(center)
                                            P2(who) = P3(Defense Minister)
                                            P4(United Arab Emirates) = P6(we)
                                                                                   P7(Dubai)
                                                                                   P8(trading center)
                                                                                                                                         syntax
                                                                                                                                         POS
                                                                                                                                         surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates,
announced at the inauguration ceremony ―we want to make Dubai a new trading center‖
Shallow and deep semantics

• She sold him the book / He bought the book from her
                 Which symbols?
    (X1 :act Sell :agent She :patient (X1a :type Book) :recip He)
                                              Which roles?
    (X2a :act Transfer :agent She :patient (X2c :type Book) :recip He)
    (X2b :act Transfer :agent He :patient (X2d :type Money) :recip She)
                 How define states and state changes?
• He has a headache / He gets a headache
    (X3a :prop Headache :patient He)    (…?…)

How handle relations? (X4c :type Head :owner He) :state -3)
   (X4a :type State :object
    (X4b :type StateChange :object X4c :fromstate 0 :tostate -3)
      How handle negation?     How handle comparatives?
• Though it’s not perfect, democracy is the best system
    (X4 :type Contrast :arg1 (X4a …?…) :arg2 (X4b …?…))
Some semantic phenomena
 Somewhat easier                        More difficult
 Bracketing (scope) of predications     Quantifier phrases and numerical
 Word sense selection (incl. copula)        expressions
 NP structure: genitives, modifiers…    Comparatives
 Concepts: ontology definition          Coordination
 Concept structure (incl. frames and    Information structure (theme/rheme)
    thematic roles)                     Focus
 Coreference (entities and events)      Discourse structure
 Pronoun classification (ref, bound,    Other adverbials (epistemic modals,
    event, generic, other)                  evidentials)
 Identification of events               Identification of propositions (modality)
 Temporal relations (incl. discourse    Pragmatics/speech acts
    and aspect)                         Polarity/negation
 Manner relations                       Presuppositions
 Spatial relations                      Metaphors
 Direct quotation and reported speech
 Opinions and subjectivity
Talk overview
 1. Introduction: Semantics and the Semantic Web
 2. Approach: General methodology for building the
    resources
 3. Ontology framework: Terminology ontology as start
    • Creating Omega: recent work on connecting ontologies
 4. Concept level: terms and relations:
    • Learning concepts by clustering
    • Learning and using concept associations
 5. Instance level: instances and more:
    • Harvesting instances from text
    • Harvesting relations
 6. Corpus: manual shallow semantic annotation
    • OntoNotes project
 7. Conclusion
     2. Approach:
  General methodology
for building the resources
What‘s needed?

• Set of semantic symbols: democracy,
  eat
                                             Ontology
• For each symbol, some kind of
  definition, or at least, rules for its
  combination and treatment during
  notation transformations                   Formalism
• Notational conventions for each
  phenomenon of meaning:
  comparatives, time/tense, negation,        Corpus
  number, etc.
                                             Instance
• A collection of examples, as training      base
  data for learning systems to learn to do
  the work
Credo and methodology
  Ontologies (and even concepts) are too
   complex to build all in one step…

  …so build them bit by bit, testing each new
   (kind of) addition empirically…

  …and develop appropriate learning
   techniques for each bit, so you can
   automate the process…

  …so next time (since there‘s no ultimate
   truth) you can build a new one more
   quickly
Plan: Stepwise accretion of knowledge
• Initial framework:                     Existing ontologies
                                                       + +
   – Start with existing (terminological)
     ontologies as pre-metadata Dictionaries,
   – Weave them together           glossaries,
                                   encyclopedias          The web
• Build metadata/concepts:
   – Define/extract concept ‗cores‘
   – Extract/learn inter-concept
     relationships
   – Extract/learn definitional and other
     info
• Build (large) data/instance base:
   – Extract instance ‗cores‘
   – Link into ontology; store in databases
   – Extract more information, guided by
    Omega ontology: Content and framework
•   Concepts: 120,604 Concept/term entries [76 MB]: •         Instances [10.1 GB]:
     – Upper: own; Penman Upper Model (ISI; Bateman et         – 1.1 million persons
       al.)                                                      harvested from text
     – Upper: SUMO (Pease et al.); DOLCE (Guarino et           – 900,000+ facts harvested
       al.); …                                                   from text
     – Middle: WordNet (Princeton; Miller & Fellbaum)          – 5.7 million locations from
     – Upper & Middle: Mikrokosmos (NMSU; Nirenburg et           USGS and NGA
       al.)
                                                         •    Framework (over 28 million
     – Middle: 25,000+ Noun-noun compounds (ISI; Pantel)
                                                              statements of concepts,
•   Lexicon / sense space:                                    relations, & instances):
     – 156,142 English words; 33,822 Spanish words             –   Available in PowerLoom
     – 271,243 word senses                                     –   Instances in RDF
•   13,000 frames of verb arg structure with case              –   With database/MYSQL
    roles:                                                     –   Online browser
     –   LCS case roles (Dorr) [6.3MB]                         –   Clustering software
     –   PropBank roleframes (Palmer et al.) [5.3MB]           –   Term and ontology
                                                                   alignment software
     –   Framenet roleframes (Fillmore et al.) [2.8MB]
     –   WordNet verb frames (Fellbaum) [1.8MB]
•   Associated information (not all complete):
     – WordNet subj domains (Magnini & Cavaglia) [1.2
                                                             http://omega.isi.edu
       MB]
Talk overview
 1. Introduction: Semantics and the Semantic Web
 2. Approach: General methodology for building the
    resources
 3. Ontology framework: Terminology ontology as start
    • Creating Omega: recent work on connecting ontologies
 4. Concept level: terms and relations:
    • Learning concepts by clustering
    • Learning and using concept associations
 5. Instance level: instances and more:
    • Harvesting instances from text
    • Harvesting relations
 6. Corpus: manual shallow semantic annotation
    • OntoNotes project
 7. Conclusion
       3. Framework
Terminology ontology as starting
             point:
 semi-automated alignment and
            merging
      (This work with Andrew Philpot,
   Michael Fleischman, and Jerry Hobbs)
       4a. Concept level:
   Learning terms/concepts
 by clustering web information


(This work by Patrick Pantel, Marco Pennacchiotti,
 and Dekang Lin)
Where/how to find new
concepts/terms?
• Potential sources:
  – Existing ontologies (AI efforts, Yahoo!, etc.) and lists
    (SIC codes, etc.)
  – Manual entry, esp with reference to foreign-language
    text (EuroWordNet, IL-Annot, etc.)
  – Dictionaries and thesauri (Webster‘s, Roget‘s, etc.)
  – Automated discovery by text clustering (Pantel and Lin,
    etc.)
• Issues:
  – How large do you want it? — tradeoff size vs.
    consistency and ease of use
  – How detailed? — tradeoff granularity/domain-specificity
    vs. portability and wide acceptance (Semantic Web)
  – How language-independent? — tradeoff
    independence vs. utility for non/shallow-semantic NLP
Clustering By Committee                          (Pantel and Lin
02)
 • CBC clustering procedure:
      – Parse entire corpus using MINIPAR (D. Lin)
      – Define syntactic/POS patterns as features:
         • N-N; N-subj-V; Adj-N; etc.
      – Cluster words, using Pointwise Mutual Information on
        features:
                                 (e=word, f=pattern)
      – Disambiguate:
         • find cluster centroids: word committee
         • for non-centroid words, match their pattern features to
           committee words‘ features; if match, include word in cluster,
           remove features
         • if no match, then word has remaining features: so try to
           include in other clusters as well — split ambiguous words‘
           senses
 • Complexity: O(n2k) for n words in corpus, k
www.isi.edu/~pantel/
OMEGA::Lincoln
                                         Grammatical templates
-V:obj:N 1869 times:
•   {V1662 offer, provide, make} 156, have 108, {V1650 go, take, fly} 51,
    sell 45, {V1754 become, remain, seem} 34, … give 24, {V1647
    oppose, reject, support} 24, buy 21, {V1653 allocate, earmark, owe}
    21, win 20 …
-N:conj:N 536 times:
•   {N719 Toyota, Nissan, BMW} 65, {N257 Cadillac, Buick, Lexus} 59,
    {N549 Philadelphia, Seattle, Chicago} 41, American Continental 20,
    Cadillacs 11, …
-V:by:N 50 times:
•   {V1662 offer, provide, make} 12, own 5, hire 4, target 4, write 3, buy
    2, …
From words to concepts

 • How to find a name for a cluster?
   – Given term instances, search for frequently
     co-occurring terms, using apposition
     patterns:
      • ―the President, Thomas Jefferson, …‖
      • ―Kobe Bryant, famous basketball star…‖
   – Extract terms, check if present in ontology
   – Examples for Lincoln:
      • PRESIDENT(N891)         - 0.187331
      • BORROWER / THRIFT(N724)         - 0.166958
      • CAR / DIVISION(N257)    - 0.137333

 • Works ok for nouns, less so for others
Problems with clustering

• No text-based clustering is ever perfect…
  – How many concepts are there?
  – How are they arranged? (there is no reason to
    expect that a clustering taxonomy should
    correspond with an ISA hierarchy!)
  – What interrelationships exist between them?
• Clustering is only the start…
Talk overview
 1. Introduction: Semantics and the Semantic Web
 2. Approach: General methodology for building the
    resources
 3. Ontology framework: Terminology ontology as start
    • Creating Omega: recent work on connecting ontologies
 4. Concept level: terms and relations:
    • Learning concepts by clustering
    • Learning and using concept associations
 5. Instance level: instances and more:
    • Harvesting instances from text
    • Harvesting relations
 6. Corpus: manual shallow semantic annotation
    • OntoNotes project
 7. Conclusion
     4b. Concept level:
 Learning and using concept
        associations


(This work with Chin-Yew Lin, Mike Junk,
  Michael Fleischman, and Tom Murray)
Topic signature
    Related words in texts show Poisson distribution:
    In large set of texts, topic keywords concentrate
    around topics; so families of related words appear
    in ‗bursts‘. To find family, compare topical word
    frequency distributions against global background
    counts.
   Word family built around inter-word relations.
   • Def: Head word (or concept), plus set of related
     words (or concepts), each with strength:
          { Tk, (tk1,wk1), (tk2,wk2), … , (tkn,wkn) }


   • Problem: Scriptal co-occurrence, etc. — how to find
     it?
   • Approximate by simple textual term co-occurrence...
Learning signatures

                                        Need texts,
                                          sorted
Procedure:
                                         How to count
 1. Collect texts, sorted by topic      co-occurrence?
 2. Identify families of co-occurring words
                                             How to
 3. Evaluate their purity                   evaluate?
 4. Find the words‘ concepts in the Ontology
 5. Link together the concept signatures disambiguator
                                        Need
Calculating weights                                          Approximate
 tf.idf : wjk = tfjk * idfj                                  relatedness
                                                             using various
  2 : wjk = (tfjk - mjk)2/ mjk if tfjk > mjk
                                                             formulas
                     0          otherwise                               (Hovy &
              Lin, 1997)
     • tfjk : count of term j in text k (―waiter‖ often only in some texts)
     • idfj = log(N/nj) : within-collection frequency (―the‖ often in all texts),
       nj = number of docs with term j , N = total number of documents
     • tf.idf is the best for IR, among 287 methods (Salton & Buckley, 1988)
     • mjk = ( j tfjk k tfjk ) / jk tfjk : mean count for term j in text k


 likelihood ratio  : 2log  = 2N . I (R ;T)                   (Lin & Hovy, 2000)
    (more approp. for sparse data; -2log asymptotic to 2 )
     • N = total number terms in corpus
     • I = mutual information between text relevance R and given term T ,
         = H(R ) - H(R | T ) for H(R ) = entropy of terms over relevant texts R
                         and H(R | T ) = entropy of term T over rel and nonrel
 Early signature study                                             (Hovy & Lin 97)

• Corpus
                                        RANK ARO          BNK           ENV             TEL
   – Training set WSJ 1987:                 1 contract    bank          epa             at&t
                                            2 air_force   thrift        waste           network
      • 16,137 texts (32 topics)            3 aircraft    banking       environmental   fcc
                                            4 navy        loan          water           cbs
   – Test set WSJ 1988:                     5 army        mr.           ozone           cable
      • 12,906 texts (31 topics)            6 space
                                            7 missile
                                                          deposit
                                                          board
                                                                        state
                                                                        incinerator
                                                                                        bell
                                                                                        long-distance
   – Texts indexed into categories by       8 equipment
                                            9 mcdonnell
                                                          fslic
                                                          fed
                                                                        agency
                                                                        clean
                                                                                        telephone
                                                                                        telecomm.
     humans                                10 northrop    institution   landfill        mci
                                           11 nasa        federal       hazardous       mr.
• Signature data                           12 pentagon    fdic          acid_rain       doctrine
                                           13 defense     volcker       standard        service
   – 300 terms each, using tf.idf          14 receive     henkel        federal         news

   – Word forms: single words,
     demorphed words, multi-word
     phrases
                                                   ENV             TEL           FIN
• Topic distinctness...
   – Topic hierarchy                                                       BNK           STK
 Evaluating signatures
 Solution: Perform text categorization task:
      create N sets of texts, one per topic,
                                                              TS = {…}                         TS2=        {…} TS3= {…}
   
                                                                                  1
       create N topic signatures TSk ,
      for each new document, create document signature DSi ,
      compare DSi against all TSk ; assign document to best match                             ??
 Match function: vector space similarity measure:                                             DSi = {…}

   – Cosine similarity, cos  TSk · DSi / | TSk DSi|

 Test 1 (Hovy & Lin, 1997, 1999)                                Average Recall and Precision Trend of Test Set WSJ7 PH

    Training set: 10 topics; ~3,000 texts                      1



     (TREC)                                                    0.8                                                    300


    Contrast set (background): ~3,000 texts                   0.6
                                                                                                            5
    Conclusion: tf.idf and 2 signatures work     PRECISION
                                                               0.4

     ok but depend on signature length                         0.2


 Test 2 (Lin & Hovy, 2000):                                    0


    4 topics; 6,194 texts; uni/bi/trigram
                                                                     0     0.2        0.4            0.6        0.8         1
                                                                                            RECALL


     signats
Text pollution on the web
Goal: Create word families (signatures) for each
  concept in the Ontology. Get texts from Web.
Main problem: text pollution. What‘s the search
  term? w=20.9227> <STAR, w=75.1358>
 <MORTICE,w=33.7982>
 <WOODWORKING,       <ORION,w=55.8937>
                                       <AIRCRAFT, w=207.998>
                                       <ENGINE, w=178.677>
  <TENNON, w=20.9227>      <PYRAMID,w=42.1494>     <WING, w=138.36>
  <JOINERY, w=17.7038>     <DNA,w=41.2331>         <PROPELLER, w=122.317>
  <WOOD, w=15.8356>        <SOUL,w=31.1539>        <FLY, w=103.187>
  <HARDWOOD, w=14.4849>    <IMPLOSION,w=23.8236>   <AIRPLANE, w=98.0431>
  <JASON, w=14.4849>       <KHUFU,w=19.3133>       <AVIATION, w=96.5663>
  <DOTH, w=12.8755>        <GOLD,w=18.3897>        <FLIGHT, w=85.3079>
  <BRASH, w=12.8755>       <RECURSION,w=18.3258>   <AIR, w=80.1996>
  <OAK, w=12.8281>         <BELLATRIX,w=17.7038>   <WARBIRDS, w=72.4247>
  <WEDGE, w=11.9118>       <OSIRIS,w=17.7038>      <PILOT, w=71.4707>
  <FURNITURE, w=10.0792>   <PHI,w=16.4932>         <MPH, w=65.987>
  <TOOL, w=9.19486>        <EMBED,w=16.4932>       <CONTROL, w=65.9729>
  <SHAFT, w=8.17321>       <MAGNETIC,w=16.4932>    <FUEL, w=62.3078>




Purifying: In later work, used Latent Semantic
Purifying with Latent Semantic Analysis
• Technique used in Psychologists to determine basic
  cognitive conceptual primitives (Deerwester et al., 1990;
  Landauer et al., 1998).
• Singular Value Decomposition (SVD) used for text
  categorization, lexical priming, language learning…
• LSA automatically creates collections of items that are
  correlated or anti-correlated, with strengths:
       ice cream, drowning, sandals  summer
• Each such collection is a ‗semantic primitive‘ in terms of
  which objects in the world are understood.

• We tried LSA to find most reliable signatures in a
  collection— reduce number of signatures in contrast set.
LSA for signatures
• Create matrix A, one signature per column (words 
  topics).
• Applym  n orthonormal matrix of left that A = U  UT :
   – U : SVDPAC to compute U so
                                                              1 0
     singular vectors that span space                          2
                                                    =           3
   – UT : n  n orthonormal matrix of right                   0

     singular vectors                         mn       mn   nn    nn
   –  : diagonal matrix with exactly         A         U           UT
     rank(A) nonzero singular values; 1 >
      2 > … > n

• Use only the first k of the new concepts:  = {1,
  2…k}.
• Create matrix A out of these k vectors: A = U  UT ≈
  A.
  A is a new (words  topics) matrix, with different
  Some results with LSA                                                        (Hovy and Junk
  99)
 Contrast set (for idf and 2):                                TREC texts
  set of documents on very             Function     De morph?   Pa rtitions   U function       Re call      Precision
                                                                 Without contrast set
   different topic, for good idf.          tf           no                                     0.74 844 7     0.62 878 2

 Partitions: collect documents
                                           tf          ye s                                    0.76 642 8     0.73 797 6
                                           tf          ye s         10                 tf      0.82 060 9     0.88 066 3
                                           tf          ye s         20                 tf      0.82 418 0     0.88 253 3
  within each topic set into               tf          ye s         30                 tf      0.82 775 2     0.88 435 2

  partitions, for faster processing.    tf.i df         no
                                                                   With c ontra st s et
                                                                    10              tf.i df    0.62 688 8     0.68 144 6
  /n is a collecting parameter.         tf.i df         no          20              tf.i df    0.63 587 5     0.68 213 4
                                        tf.i df        ye s         10              tf.i df    0.71 817 7     0.76 092 5

 U function: function for              tf.i df
                                          
                                                2
                                                       ye s         20              tf.i df
                                                                                       
                                                                                           2
                                                                                               0.71 539 9     0.76 296 1
                                                        no          10                         0.84 739 3     0.84 151 3
  creation of LSA matrix.                 
                                                2
                                                        no          20                 
                                                                                           2
                                                                                               0.85 343 6     0.84 957 5
                                                2                                          2
                                                      ye s         10                        0.82 261 5     0.82 841 2
                                                2                                          2
                                                      ye s         20                        0.83 911 4     0.83 905 5
Results:                                  
                                                2
                                                                  Va rying partitions
                                                                                       
                                                                                           2

 Demorphing helps.
                                                       ye s        30 /0                       0.91 252 5     0.88 149 4
                                                2                                          2
                                                      ye s        30 /3                      0.90 353 4     0.87 911 5

  2 better than tf and tf.idf .
                                                2                                          2
                                                      ye s        30 /6                      0.90 361 1     0.87 344 4
                                                2                                          2
                                                      ye s        30 /9                      0.89 940 7     0.86 805 3
 LSA improves results, but not
  dramatically.
Weak semantics: Signature for every
concept
  Procedure:
  1. Create query from Ontology concept (word + defn.
     words)
  2. Retrieve ~5,000 documents (8 web search engines)
  3. Purify results (remove duplicates, html, etc.)
  4. Extract word family (using tf.idf, 2, LSA, etc.)
  5. Purify
  6. Compare to siblings and parents in the Ontology

  Problem: raw signatures overlap…
     – average parent-child node overlap: ~50%
     – Bakery—Edifice: ~35% …too far: missing generalization
     – Airplane—Aircraft: ~80% …too close?
  Remaining problem: web signatures still not pure...
  WordNet: In 2002–04, Agirre and students (U of the
Recent work using signatures
• Multi-document summarization (Lin and Hovy, 2002)
   –   Create  signature for each set of texts
   –   Create IR query from signature terms; use IR to extract sentences
   –   (Then filter and reorder sentences into single summary)
   –   Performance: DUC-01: tied first; DUC-02: tied second place

• Wordsense disambiguation (Agirre, Ansa, Martinez, Hovy,
  2001)
   – Try to use WordNet concepts to collect text sets for signature
     creation: (word+synonym > def-words > word .AND. synonym .NEAR.
       def-word > etc…)
  – Built competing signatures for various noun senses:
              (a) WordNet synonyms; (b) SemCor tagged corpus (2);
              (c) web texts (2); (d) WSJ texts (2)
• Email clustering (Murray and Hovy, 2004)
  – Performance: Web signatures > random, WordNet baseline
   – Social Network Analysis: Cluster emails and create signatures
   – Infer personal expertise, project structure, experts omitted, etc.
   – Corpora: ENRON (240K emails), ISI corpus, NSF eRulemaking
     corpus
Semantics from signatures
• Assuming we can create signatures and
  use them in some applications…
• How to integrate signatures into an
  ontology?
• How to employ signatures in inheritance,
  classification, inference, and other
  operations?
• How to compose signatures into new
  concepts?
• How to match signatures across
Talk overview
 1. Introduction: Semantics and the Semantic Web
 2. Approach: General methodology for building the
    resources
 3. Ontology framework: Terminology ontology as start
    • Creating Omega: recent work on connecting ontologies
 4. Concept level: terms and relations:
    • Learning concepts by clustering
    • Learning and using concept associations
 5. Instance level: instances and more:
    • Harvesting instances from text
    • Harvesting relations
 6. Corpus: manual shallow semantic annotation
    • OntoNotes project
 7. Conclusion
     5a. Instance level:
Harvesting instances from text


   (This work with Michael Fleischman)
Instance extraction++                         (Fleischman & Hovy

03)
• Goal: extract all instances from the web
• Method:
      – Download text from web (15GB)
      – Identify named entities (BBN‘s IdentiFinder (Bikel et al.
        93))
      – Extract ones with descriptive phrases (<APOS>,
        <CN/PN>)
         (―the vacuum manufacturer Horeck‖ / ―Saddam‘s physician
            Abdul‖)
      – Cluster them, and categorize in ontology
• Result: over 900,000 instances
                                                                       Performance on a Question
                                                                            Answ ering Task
                                                        50


      – Average: 2 mentions per instance, 40+ for George W.
                                                        45
                                                        40




                                                % Correct
                                                        35
        Bush                                            30
                                                        25


• Evaluation:
                                                        20
                                                        15
                                                        10

      – Tested with 200 ―who is X?‖ questions
                                                                 Partial         Correct          Incorrect
                                                             State of the Art System       Extraction System
   5b. Instance level:
  Harvesting relations



(This work with Deepak Ravichandran,
  Donghui Feng, and Patrick Pantel)
Shallow patterns for information
•   Goal: learn relationship data from the web
    –   (when was someone born? Where does he live?)
•   Procedure: automatically learn word-level
    patterns
        When was Mozart born?
        ―Mozart (1756–1792)…‖
        [ <NAME> ( <BIRTHYEAR> – <DEATHYEAR> ) ]
•   Apply patterns to Omega concepts/instances
•   Evaluation: test in TREC QA competition
•   Main problem: learning patterns
    –   (In TREC QA 2001, Soubbotin and Soubbotin got
        very high score with over 10,000 patterns built by
Learning extraction patterns from the
web
                                              (Ravichandran and Hovy 02)
• Prepare:
   – Select example for target relation: Q term (Mozart) and A term
     (1756)
• Collect data:
   – Submit Q and A terms as queries to a search engine (Altavista)
   – Download top 1000 web documents
• Preprocess:
   – Apply a sentence breaker to the documents
   – Retain only sentences with both Q and A terms
   – Pass retained sentences through suffix tree constructor
• Select and create patterns:
   – Filter each phrase in the suffix tree to retain only those phrases
     that contain both Q and A terms
   – Replace the Q term by the tag ―<NAME>‖ and the A term by the
     term by ―<ANSWER>‖
Some results
BIRTHYEAR:
1.0 <NAME> (<ANS> —
0.85  <NAME> was born on
                                Testing (TREC-10
<ANS>
0.6 <NAME> was born in <ANS>
                                questions)
…                               Question   Num TREC Web
                                 type       Qs MRR MRR
DEFINITION:
1.0 <NAME> and related          BIRTHYEAR   8    0.479
<ANS>s                           0.688
1.0 <ANS> (<NAME>,              INVENTOR    6    0.167
0.9 as <NAME> , <ANS> and        0.583
…                               DISCOVERER 4     0.125
                                 0.875
LOCATION:
                                DEFINITION 102   0.345
1.0 <ANS>‘s <NAME> .
                                 0.386
1.0 regional : <ANS> : <NAME>
                                WHY-FAMOUS 3     0.667 0.0
0.9 the <NAME> in <ANS> ,
                                LOCATION   16    0.75
Regular expressions                                      (Ravichandran et al.
2004)


  • New process: learn regular expression patterns
       Surface     Babe Ruth       was born in  Baltimore , on February 6 , 1895
                                                                                   <NAME>
       NE Tags        <NAME>                  <LOCATION>            <DATE>
    Part of Speech NNP NNP         VBD VBN IN     NNP     , IN NNP         CD      ―was born‖
                                                                                   <?>
           Surface     George   Herman  "Babe" Ruth was born here in 1895
           NE Tags                <NAME>                             <DATE>        _IN
        Part of Speech NNP       NNP     NNP NNP VBD VBN RB IN         CD          <DATE>



  • Results: over 2 million instances from 15GB
    corpus
  • Complexity: O(y2), for max string length y
  • Later work: downloaded and cleaned 1 TB text
    fro web; created 119MB corpus; used for
   Comparing clustering and surface
   patterns
    • Precision: took random 50 words, each with system‘s
       learned superconcepts (top 3 of system); added top 3
       from WordNet, 1 human superconcept. Used 2 judges
       (Kappa = 0.78–0.85)
    • Recall: Relative Recall = RecallPatt / RecallCo-occ = CP /
       CC
                                  Precision (correct+partial)
    • Relative Recall Patt up to 52%; Co-Occ up to 44%
       TREC-03 def‘ns:
                                   Pattern System Co-occurrence System
       MRR
                           Training   Prec   Top-3   MRR    Prec   Top-3   MRR
                                      56.6   60.0    60.0   12.4   20.0    15.2
                            1.5MB
                                       %      %       %      %      %       %
                                      57.3   63.0    61.0   23.2   50.0    37.3
                            15MB
                                       %      %       %      %      %       %
                                      50.7   56.0    55.0   60.6   78.0    73.2
                            150MB
                                       %      %       %      %      %       %

(Ravichandran and Pantel              52.6   51.0    51.0   69.7   93.0    85.8
                            1.5GB
                                       %      %       %      %      %       %
 Relation extraction from a small corpus
The challenge: apply RegExp pattern induction to a
small corpus (Chemistry textbook) (Pantel and
Pennacchiotti 06)
Sample seeds used for each semantic relation and sample outputs from Espresso. The number in the
parentheses for each relation denotes the total number of seeds.
                                           SEEDS                            ESPRESSO
                        NaCl :: ionic compounds            Na :: element
                        diborane :: substance              protein :: biopolymer
      Is-a (12)
                        nitrogen :: element                HCl :: strong acid
                        gold :: precious metal             electromagnetic radiation :: energy
                        ion :: matter                      oxygen :: air
                        oxygen :: water                    powdered zinc metal :: battery
      Part-Of (12)
 C                      light particle :: gas              atom :: molecule
 H                      element :: substance               ethylene glycol :: automotive antifreeze
 E                      magnesium :: oxygen                hydrogen :: oxygen
 M                      hydrazine :: water                 Ni :: HCl
      Reaction (13)
                        aluminum metal :: oxygen           carbon dioxide :: methane
                        lithium metal :: fluorine gas      boron :: fluorine
                        bright flame :: flares             electron :: ions
                        hydrogen :: solid metal hydrides   glycerin :: nitroglycerin
      Production (14)
                        ammonia :: nitric oxide            kidneys :: kidney stones
                        copper :: brown gas                ions :: charge
Espresso procedure                                           (Pantel and Pennachiotti
06)
                                                                                                      x,p,y
                                                                                  pmii,p  log
• Phase 1: Pattern Extraction, like Ravichandran, using MI:                                        x,*,y *,p,*

      – Measure reliability based on an approximation of pattern recall:
                                                                      pmi(i,p)    
                                                                     max  r i
• Phase 2: Instance Extraction                                      
                                                                     r  p 
                                                                                       
                                                                                 i  I    pmi       
                                                                         I 
      – Instantiate all patterns to extract all possible instances
      – Identify generic patterns using Google redundancy check with previously
        accepted patterns                                     pmi(i,p)
                                                        max  r  p
      – Measure reliability of each instance:r i  p P       pmi

                                                                  P 
      – Select top-K instances
• Phase 3: Instance Expansion (if too few instances extracted in phase
  2):                        

      – Syntactic: Drop nominal mods:
             proton is-a small particle  proton is-a particle
      – WordNet: Expand using hypernyms:
             hydrogen is-a element  nitrogen is-a element
      – Web: Apply patterns to the Web to extract additional instances
• Phase 4: Axiomatization (transform relations into axioms in HNF form)
             e.g., R is-a S becomes R(x)  S(x)
             e.g., R part-of S becomes (x)R(x)  (y)[S(y) & part-of(x,y)]
IE by pattern
(Feng, Ravichandran, Hovy 2005)

             P                      s recisions
              recisionvotedbyPatternÕP
     recis
    P ion
       1
     0.9
     0.8
     0.7
                                                         top1
     0.6
     0.5                                                 top5
     0.4
                                                         top10
     0.3
     0.2
     0.1
       0
           Birthdate B       ce eathdate D place S
                      irthpla D           eath           ttrib
                                                  pouse A ute




Why not Gorbachev?
      …gender
Why not Mrs. Roosevelt?
      …period
Why not Maggie Thatcher?
      …home?
Which semantics to check?
Talk overview
 1. Introduction: Semantics and the Semantic Web
 2. Approach: General methodology for building the
    resources
 3. Ontology framework: Terminology ontology as start
    • Creating Omega: recent work on connecting ontologies
 4. Concept level: terms and relations:
    • Learning concepts by clustering
    • Learning and using concept associations
 5. Instance level: instances and more:
    • Harvesting instances from text
    • Harvesting relations
 6. Corpus: manual shallow semantic annotation
    • OntoNotes project
 7. Conclusion
        6. OntoNotes:
Creating a Semantic Corpus by
      Manual Annotation

 (This work with Ralph Weischedel (BBN),
 Martha Palmer (U Colorado), Mitch Marcus
     (UPenn), and various colleagues)
Corpus creation by annotation
• Goal: create corpus of (sentence + semantic rep) pairs
• Use: enable machine learning algorithms to do this
• Process: humans add information into sentences (and
  their parses)
• Recent projects:
OntoNotes             coref links Interlingua Annotation
                                        (Dorr et al. 04)
(Weischedel et al. 05–)
                           ontology     I-CAB, Greek… banks
PropBank
(Palmer et al. 03–)       verb frames     TIGER/SALSA Bank
                                          (Pinkal et al. 04–)
Framenet                  noun frames
(Fillmore et al. 04)
                                         Prague Dependency
                                         Treebank (Hajic et al. 02–)
Penn Treebank             word senses
                                         NomBank
(Marcus et al. 99)
                            syntax       (Myers et al. 03–)
    OntoNotes: large-scale annotation
• Partners: BBN (Weischedel), U of Colorado (Palmer), U of Penn
  (Marcus), ISI (Hovy)
• Goal: In 4 years, annotate nouns and verbs and corefs in 1 mill
  words of English, Chinese, and Arabic text:
    – Manually provide semantic symbols for nouns, verbs, adjs, advs
    – Manually connect sentence structure in verb and noun frames
    – Manually link anaphoric references
• Validation: inter-annotator agreement of 90%
                                                                 Text
• Outcomes (2004–):
    – PropBank: verb annotation procedure developed
                                                               Treebank
    – Pilot corpus built, with coref annotation
    – New project started October 2005          Word Sense
                                                               PropBank    Co-reference
                                                wrt Ontology
       (English, Chinese; Arabic in 2006)
• Potential for the near future: semantics ‗bank‘
                                                               OntoNotes
    – May energize lots of research on semantic                Annotated
      analysis, reps, etc.                                        Text
    – May enable semantics-based IR, QA, MT, etc.
OntoNotes representation of literal meaning

                       E1: Person3
The                    Names: “Abdul Qadeer Khan”             Establish1        Omega
                       Descriptions: “The founder of          Agent:            ontology
founder                                                       Org:
                       Pakistan‘s nuclear department”, ―he‖
of
Pakistan’s             E2: Agency1                                         P1: :type Person3
nuclear department     Descriptions: “Pakistan‘s nuclear      Subsidiary    :name “Abdul
                       department”                                          Qadeer Khan”
Abdul Qadeer                                                  SubOrg:
                                                                           P2: :type Person3
    Khan                                                      SuperOrg:
                       E3: Nation2                                          :gender male
has                    Names: “Pakistan”                                   P3: :type Know-
                                                                            How4
admitted
                        E4: Know-How4                         Admit1       P4: :type Nation2
he                      Descriptions: ―nuclear technology‖                  :name “Iran”
                                                              Speaker:
transferred                                                   Saying:
                                                                           P5: :type Nation2
                        E5: Nation2                                         :name “Libya”
nuclear technology
                        Names: “Iran”                                      P6: :type Nation2
to                                                                          :name “N. Korea”
Iran,                   E6: Nation2                           Transfer     X0: :act Admit1
                        Names: “Libya”                        2             :speaker P1
Libya,                                                                      :saying X2
                                                              Agent:
and                     E7: Nation2                           Item:        X1: :act Transfer2
North Korea                                                                 :agent P2 :patient
                        Names: “North Korea”                  Dest:
                                                                            P3 :dest (P4 P5 P6)
                                                                           coref P1 P2
  (slide credit to M. Marcus and R. Weischedel, 2004)
Even so: Many words untouched!
  WSJ1428
    OPEC's ability to produce more petroleum than it can sell is beginning to
    cast a shadow over world oil markets. Output from the Organization of
    Petroleum Exporting Countries is already at a high for the year and most
    member nations are running flat out. But industry and OPEC officials
    agree that a handful of members still have enough unused capacity to
    glut the market and cause an oil-price collapse a few months from now if
    OPEC doesn't soon adopt a new quota system to corral its chronic
    cheaters. As a result, the effort by some oil ministers to get OPEC to
    approve a new permanent production-sharing agreement next month is
    taking on increasing urgency. The organization is scheduled to meet in
    Vienna beginning Nov. 25. So far this year, rising demand for OPEC oil
    and production restraint by some members have kept prices firm despite
    rampant cheating by others. But that could change if demand for
    OPEC's oil softens seasonally early next year as some think may
    happen. OPEC is currently producing more than 22 million barrels a
    day, sharply above its nominal, self-imposed fourth-quarter ceiling of
    20.5 million, according to OPEC and industry officials at an oil
    conference here sponsored by the Oil Daily and the International Herald
    Tribune. At that rate, a majority of OPEC's 13 members have reached
    their output limits, they said.
OntoNotes annotation: ―The 90%
Solution‖
  1. Sense creation:
     – Expert creates meaning options (shallow semantic senses)
       for verbs, nouns, [adjs, advs] … follows PropBank (Palmer et
       al.)
     – At same time, creates concepts and organizes/refines
       Omega ontology content and structure
  2. Sense annotation process goes by word, across
     docs. Process developed in PropBank. Annotators
     manually…
     – See each sentence in corpus containing the current word
       (noun, verb, [adjective, adverb]) to annotate
     – Select appropriate senses (= ontology concepts) for each
       one
     – Connect frame structure (for each verb and relational noun)
  3. Coref annotation process goes by doc.
     Annotators…
     – Connect co-references within each doc
 Sense annotation procedure
• Sense creator first creates senses for a word
                                                                                                 word



• Loop 1:                                                                            (Re-)partition senses; (re-)create
                                                                                      definitions and tests (1 person)

    – Manager selects next nouns from sensed list                                    Test: Annotate 50 sentences
      and assigns annotators                                                                  (2 people)

    – Programmer randomly selects 50 sentences                                  no         >90% agreement?

      and creates initial Task File                                                                yes


    – Annotators (at least 2) do the first 50                   Sense
                                                               problem
                                                                                          Annotate all sentences
                                                                                         with this word (2 people)

    – Manager checks their performance:                             Analyze                >90% agreement?
        • 90%+ agreement + few or no NoneOfAbove — send          disagreement   no
                                                                                                    yes
          on to Loop 2                                        Annotator
        • Else — Adjudicator and Manager identify reasons,     problem                Adjudicate the disagreements
                                                                                              (adjudicator)
          send back to Sense creator to fix senses and defs

• Loop 2:                                                                       All sentences with this word annotated


   – Annotators (at least 2) annotate all the remaining sentences
   – Manager checks their performance:
       • 90%+ agreement + few or no NoneOfAbove — send to Adjudicator to fix the rest
       • Else — Adjudicator annotates differences
       • If Adj agrees with one Annotator 90%+, then ignore other Annotator‘s work
         (assume a bad day for the other); else Adj agrees with both about equally often,
         then assume bad senses and send the problematic ones back to Sense creator
Pre-OntoNotes test: can it be done?

 • Annotation process and tools developed
   and tested in PropBank (Palmer et al.; U
   Colorado)
 • Typical results (10 words of each type, 100
   sentences each):  Round2  Round 3
                 Round1
                                                  time (min/100
           tagger agreement      # senses
                                                     tokens)
  verbs    .76  .86  .91    4.5  5.2  3.8    30  25  25
  nouns    .71  .85  .95    7.3  5.1  3.3    28  20  15
   adjs    .87  –    .90    2.8  –    5.5    24    –  18

          (by comparison: agreement using WordNet senses is 70%)
Creating the senses

1.   Should you create the               Use 90% rule to limit
     sense? How many must                     degree of delicacy
     there be?                           See if annotators can
2.   Is the term definition                   agree
     adequate?                           Perform manual
3.   Where should the term go                 insertion
     relative to the other terms?
     — species                           After manual creation,
4.   What is unique/different                 get annotator
     about this term? —                       feedback
     differentium/ae
     How to do this systematically?
     Developed method of graduated refinement using
     creation of sense ‗treelets‘ with differentiae
 Noun and verb sense creation

• Performed by Ann Houston in Boston                        <inventory lemma="price-n">
                                                            <sense n="1" type="" name="cost or monetary value of
  (who also does verb sense creation)                            goods or services" group="1">
• Sense groupings created:                                      <diff> +quantity +monetary_value </diff>
                                                                <comment> PRICE of NP -> NP's[+good/+service]
   – 4 nouns per day sense-created                               PRICE[+exchange_value] </comment>
   – Max: ―head‖, with 15 senses                                <examples>

   – Verb procedure creates senses by                             The price of gasoline has soared lately.
                                                                  I don't know the prices of these two fur coats.
     grouping WordNet senses (PropBank)                           The museum would not sell its Dutch Masters
   – Noun procedure taxonomizes senses                               collection for any price.
     into treelets, with differentiae at each                     The cattle thief has a price on his head in Maine.
     level, for insertion into ontology                           They say that every politician has a price.
                                                                </examples>
                                             examples           <mappings> <wn version="2.1">1,2,4,5,6</wn>
                                                                 <omega> </omega> </mappings>
   PRICE                                      and tests
                                                            </sense>
   +abstract                                 WN groups
      +quantity                              differentiae   <sense n="2" type="" name="sacrifice required to
             +monetary_value (group 1)                          achieve something" group="1">
   +physical                                                   <diff> +activity +complex +effort </diff>
      +activity                                                <comment> PRICE{+effort] PREP(of/for)/SCOMP
             +complex (not a single event or action)            NP[+goal/+result] </comment>
                +effort (group 2)                              <examples>
                                                                John has paid a high price for his risky life style.
Word senses: from lexemes to concepts
                The Sense Bridge
Lexical space    Sense space               Concept space
                    – hang-hanged            – Cause-to-die
   – ―hang‖         – hang-hung              – Suspend-body

                    – summon: ―they          – Summon
                      called them home‖      – Name-Describe
   – ―call‖
                    – name: ―he is           – Phone
                      called Joe‖
                    – phone: ―she called
                      her mother‖
                    – name2: ―he called       • How many
                      her a liar‖             concepts?
                    – describe: ―she          • How relate senses
                      called him ugly‖        to concepts?

  Monolingual       Multilingual              Interlingual
Omega after OntoNotes
• Current Omega:
  – 120,000 concepts: Middle Model mostly WordNet
  – Essentially no formally defined features
• Post-OntoNotes Omega:
  – 60,000 concepts? — the 90% rule
  – Each concept a sense cluster, defined with features
  – Each concept linked to many example sentences
• What problems do we face?
  –   Sense-to-concept compression
  –   Cross-sense identification
  –   Multiple languages‘ senses
  –   etc.
7. Conclusion
Summary: Obtaining semantics
Ingredients:
  –   small ontologies and metadata sets
  –   concept families (signatures)
  –   information from dictionaries, etc.
                                                    Xxx x x
  –   additional info from text and the web   Xx xx Xxx xx
                                              Xxx xx xxx x
                                                    X
                                              Xx xxxXxxx x
                                              X
Method:                                             Xx
                                                    x
                                              Xxx x xxxxxx
                                              xx

  1. Into a large database, pour all
    ingredients
  2. Stir together in the right way
  3. Bake
Evaluate—IR, QA, MT, and so
  on!
My recipe for SW research
• Take two large portions of KR
   – one of ontology work,
   – one of reasoning;
• Add a big slice of databases
   – for all the non-text collections,
• and 1 1/2 slices of NL
   – for the text collections, to insert the semantics.
• Mix with a medium pinch of Correctness / Authority /
  Recency validation,
• and add a large helping of Interfaces
   – to make the results presentable.
• Combine, using creativity and good methodology,
  (taste frequently to evaluate!)
• and deliver to everyone.
Extending your ontology…

  No ontology is ever static: need to develop
   methods to handle change…




  …congratulations to the people of
   Montenegro!
Thank you!

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:2
posted:3/11/2010
language:
pages:71