Building Wordnets - Slide 1 by fionan


									Building Wordnets

   Piek Vossen, Irion Technologies

   Starting points
   Semantic framework
   Process overview
   Methodologies in other projects
   Multilinguality
Starting points

   Purpose of the wordnet database:
       education, science, applications
       formal ontology or linguistic ontology
       making inferences or lexical substitution
       conceptual density or large coverage
   Distributed development
   Reproducability
   Available resources
   Language-specific features
   (Cross-language) compatibility
   Exploit cummunity resources by projecting
    conceptual relations on a target wordnet
Semantic framework

 Differences in wordnet structures
               Wordnet1.5                                        Dutch Wordnet
                   object                                             voorwerp
artifact, artefact                natural object (an
(a man-made object)               object occurring
                                  naturally)           blok            werktuig       lichaam
                                                       {block}         {tool}         {body}
 block          instrumentality                body

  implement                           device
 tool                                  instrument            bak       lepel       tas
         box      spoon       bag                            {box}     {spoon}    {bag}

- Artificial Classes versus Lexicalized Classes:
           instrumentality; natural object
- Lexicalization differences of classes:
           container and artifact (object) are not lexicalized in Dutch
  Linguistic versus conceptual ontologies
 Conceptual ontology:
    A particular level or structuring may be required to achieve a better
   control or performance, or a more compact and coherent structure.
        Introduce artificial levels for concepts which are not lexicalized in a
       language (e.g. instrumentality, hand tool),
        Neglect levels which are lexicalized but not relevant for the purpose
       of the ontology (e.g. tableware, silverware, merchandise).
    What properties can we infer for spoons?
       spoon -> container; artifact; hand tool; object; made of metal or plastic;
       for eating, pouring or cooking
 Linguistic ontology:
    Exactly reflects the relations between all the lexicalized words and
   expressions in a language.
   Valuable information about the lexical capacity of languages: what is the
   available fund of words and expressions in a language.
    What words can be used to name spoons?
        spoon -> object, tableware, silverware, merchandise, cutlery,
  Wordnets as Linguistic Ontologies
Main purpose is to predict what words can be used as substitutes in language,
considering all the lexicalized words in a language.

Classical Substitution Principle:
    Any word that is used to refer to something can be replaced by its synonyms,
    hyperonyms and
         horse              stallion, mare, pony, mammal, animal, being.
    It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:
         horse     X         cat, dog, camel, fish, plant, person, object.
Conceptual Distance Measurement:
    Number of hierarchical nodes between words is a measurement of closeness,
    where the level and the local density of nodes are additional factors.
Define a semantic framework

   Definition of relations
       Diagnostic frames (Cruse 1986)
       Examples and corpus data
   Top-level ontology
       Constraints on relations
       Type consistency
       Large scale validation
Process overview

   Manual encoding and verification
   Automatic extraction:
       definitions
       synonyms
       distribution and similarity patterns in copora
       defining contexts, e.g. “cats and other pets”
       parallel corpora, e.g. bible translations
       morphological structure
       bilingual dictionaries
   Encode source and status of data:
       who, when, based on what algorithm, validated, final
Encoding cycle

   1. Collecting data
       Vocabulary: what is the list of words of a language?
       Concepts: what is the list of concepts related to the
   2. Encoding data:
       Defining synsets
       Defining language internal relations: hyponymy, meronymy
        roles, causal relations
       Defining equivalence relations to English
       Defining other relations,e.g. Ontology types, Domains
   3. Validation
   4. Go to 1.
Where to start?

   How to get a first selection:
       Words (alphabetic, frequency) -> concepts -> relations
       Concept (hyperonym, domain, semantic feature) -> words -
        > concepts -> relations
   How to get a complete overview of words and
    expressions that belong to a segment of a wordnet?
       Up to 20 hyperonyms for instrumentality: instrument,
        instrumentality, means, tool, device, machine, apparatus,
       iterative process: collect, structure, collect, restructure...
       using multiple sources of evidence
       comparing results, e.g. tri-cycle is a toy or a vehicle
Synonymy as a basis?

   Synsets are the core unit of a wordnet database
   Synonymy is only vaguely defined: substitution in a
   Synonyms are very hard to detect
   Other relations (role relations, causal relations):
       easier to detect and encode
       easier to validate within a formal framework
       easier to validate in a corpus
   Rich set of relations per concept help alignment with
    other resources
Diagnostic frames and examples
Agent Involvement
(A/an) X is the one/that who/which does the Y, typically intentionally.
Conditions:               - X is a noun
                          - Y is a verb in the gerundive form
        A teacher is the one who does the teaching intentionally
        {to teach} (Y) INVOLVED_AGENT {teacher} (X)

Patient Involvement
(A/an) X is the one/that who/which undergoes the Y
Conditions:                - X is a noun
                           - Y is a verb in the gerundive form
         A learner is the one who undergoes the learning
         {to learn} (Y) INVOLVED_PATIENT {learner} (X)
Diagnostic frames and examples
Result Involvement
  A/an) X is comes into existence as a result of Y, where X is a noun
  and Y is a verb in the gerundive form and a hyponym of “make”,
  “produce”, “generate”.
      A crystal comes into existence as a result of crystalizing
      A crystal is the result of crystalizing
      A crystal is created by crystalizing
        {to crystalize} (Y) INVOLVED_RESULT {crystal} (X)
 Special kind of patient relation. The entity is not jut changed or
  affected but it comes into existence as a result of the event:
 Only applies to concrete entities (1stOrder) or mental objects such
  as ideas (3rdOrder).
 Situations that result from other situations are related by the CAUSE
Hyponymy overloading
(Guarino 1998, Vossen and Bloksma 1998).
   The vocabulary does not clearly differentiate
    between orthogonal roles and disjoint types:
       role: passenger, teacher, student
       type: dog; cat
       ?:
           knife ->weapon, cutlery; spoon -> container, cutlery
           food
           material <- building material <-?- stone; <-?-water; <- brick;
   Disjunctive and conjunctive hyperonyms:
       albino -> animal or plant
       spoon -> cutlery & container
 Hyponymy restructuring

                              ziekte (disease)

 ingewandsziekte                dierenziekte                       infectieziekte
  (bowel disease)             (animal disease)                 (infectious disease)

 haringwormziekte           kolder             veeziekte              vuilbroed
(anisakiasis: bowel    (staggers: brain                        (infectious infectious
                                            (cattle disease)
disease of herrings)   disease of cattle)                        disease of bees)
Methodologies in a number of projects

   Princeton Wordnet
   EuroWordNet:
       English, Dutch, German, French, Spanish, Italian,
        Czech, Estonian
       10,000 up to 50,000 synsets
   BalkaNet:
       Romanian, Bulgarian, Turkish, Slovenian, Greek,
       10,000 synsets
Main strategies for building wordnets
   Expand approach: translate WordNet synsets to another
    language and take over the structure
      easier and more efficient method

      compatible structure with WordNet

      vocabulary and structure is close to WordNet but also biased

      can exploit many resources linked to Wordnet: SUMO, Wordnet
       domains, selection restriction from BNC, etc...

   Merge approach: create an independent wordnet in another
    language and align it with WordNet by generating the appropriate
      more complex and labor intensive

      different structure from WordNet

      language specific patterns can be maintained, i.e. very precise
        substitution patterns
  Aligning wordnets
  Dutch wordnet                      English wordnet

                               artifact object natural object


muziekinstrument              musical instrument
                   ? orgel
   orgel        organ ?          organ             organ
                   ? orgaan
hammond orgel                 hammond organ
General criteria for approach:
   Maximize the overlap with wordnets for other
   Maximize semantic consistency within and
    across wordnets
   Maximally focus the manual effort where
   Maximally exploit automatic techniques
Top-down methodology
   Develop a core wordnet (5,000 synsets):
     all the semantic building blocks or foundation to define the
      relations for all other more specific synsets, e.g. building ->
      house, church, school
     provide a formal and explicit semantics

   Validate the core wordnet:
     does it include the most frequent words?

     are semantic constraints violated?

   Extend the core wordnet: (5,000 synsets or more):
     automatic techniques for more specific concepts with high-
      confidence results
     add other levels of hyponymy

     add specific domains

     add „easy‟ derivational words

     add „easy‟ translation equivalence

   Validate the complete wordnet
Developing a core wordnet
   Define a set of concepts(so-called Base Concepts) that play an
    important role in wordnets:
     high position in the hierarchy & high connectivity

     represented as English WordNet synsets

     Common base concepts: shared by various wordnets in different
     Local base concepts: not shared

   EuroWordNet: 1024 synsets, shared by 2 or more languages
   BalkaNet: 5000 synsets (including 1024)
   Common semantic framework for all Base Concepts, in the form of a
   Manually translate all Base Concepts (English Wordnet synsets) to
    synsets in the local languages (was applied for 13 Wordnets)
   Manually build and verify the hypernym relations for the Base
   All 13 Wordnets are developed from a similar semantic core closely
    related to the English Wordnet
     Top-down methodology
                       Hypero                                       Hypero
                       nyms                                         nyms

              Local      CBC
                         Represen-                         CBC                  Local
              BCs                        1024 CBCs
                          tatives                          Repre-               BCs
WMs                                                        senta.                       WMs
related via                                                                             related via
non-hypo                                Remaining                                       non-hypo
nymy          First Level Hyponyms      WordNet1.5          First Level Hyponyms        nymy
                      Remaining                                     Remaining
                      Hyponyms                                      Hyponyms
Advantages of the approach

   Well-defined semantics that can be inherited
    down to more specific concepts
       Apply consistency checks
       Automatic techniques can use semantic basis
   Most frequent concepts and words are
   High overlap and compatibility with other
   Manual effort is focussed on the most difficult
    concepts and words
   Distribution over the top ontology clusters
                         WN             NL                 ES                  IT
    Top-Concept      TC-    %of   TC- % of %of TC- %of es %of TC- %of it %of
                    Tokens wn Tokens nl        wn Tokens          wn Tokens          wn
Animal               14068 3.99% 1193 0.97% 8.5% 2458 1.81% 17.5% 1122 1.44% 8.0%
Artifact             19562 5.55% 10803 8.83% 55.2% 9969 7.36% 51.0% 6494 8.34% 33.2%
Building              1022 0.29%    707 0.58% 69.2%   628 0.46% 61.4%    434 0.56% 42.5%
Comestible            3377 0.96% 1393 1.14% 41.2% 1614 1.19% 47.8%       624 0.80% 18.5%
Container             1725 0.49%    778 0.64% 45.1%   799 0.59% 46.3%    432 0.55% 25.0%
Covering              2030 0.58% 1208 0.99% 59.5% 1027 0.76% 50.6%       690 0.89% 34.0%
Creature               664 0.19%    159 0.13% 23.9%   254 0.19% 38.3%     27 0.03% 4.1%
Function             34081 9.68% 17668 14.44% 51.8% 18904 13.96% 55.5% 11043 14.18% 32.4%
Furniture              298 0.08%    171 0.14% 57.4%   147 0.11% 49.3%     87 0.11% 29.2%
Garment                756 0.21%    494 0.40% 65.3%   426 0.31% 56.3%    292 0.37% 38.6%
Gas                     93 0.03%     67 0.05% 72.0%    62 0.05% 66.7%     49 0.06% 52.7%
Group                27805 7.90% 3357 2.74% 12.1% 3630 2.68% 13.1% 2337 3.00% 8.4%
Human                11543 3.28% 6372 5.21% 55.2% 7683 5.67% 66.6% 4488 5.76% 38.9%
ImageRepresentation    780 0.22%    412 0.34% 52.8%   426 0.31% 54.6%    294 0.38% 37.7%
Instrument            7036 2.00% 4102 3.35% 58.3% 3590 2.65% 51.0% 2564 3.29% 36.4%
LanguageRepresent.    2844 0.81% 1273 1.04% 44.8% 1218 0.90% 42.8%       691 0.89% 24.3%
Liquid                1629 0.46%    617 0.50% 37.9%   500 0.37% 30.7%    339 0.44% 20.8%
Living               47104 13.37% 10225 8.36% 21.7% 13661 10.08% 29.0% 7408 9.51% 15.7%
Wordnet                                         Wordnet
Domains           Concepts        Proportion    Domains          Concepts    Proportion
acoustics                104           0.092%   linguistics           1545       1.363%
administration          2974           2.624%   literature             686       0.605%
aeronautic               154           0.136%   mathematics            575       0.507%
agriculture              306           0.270%   mechanics              532       0.469%
alimentation                 28        0.025%   medicine              2690       2.374%
anatomy                 2705           2.387%   merchant_navy          485       0.428%
anthropology             896           0.791%   meteorology            231       0.204%
applied_science              28        0.025%   metrology             1409       1.243%
archaeology                  68        0.060%   military              1490       1.315%
archery                      5         0.004%   money                  624       0.551%
architecture             255           0.225%   mountaineering          28       0.025%
art                      420           0.371%   music                  985       0.869%
artisanship              148           0.131%   mythology              314       0.277%
astrology                    17        0.015%   number                 220       0.194%
astronautics                 29        0.026%   numismatics             43       0.038%
astronomy                376           0.332%   occultism               52       0.046%
athletics                    22        0.019%   oceanography            10       0.009%
EWN Interlingual Relations

•   EQ_SYNONYM: there is a direct match between a synset and an ILI-record

•   EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,

•   HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.

•   HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.

•   other relations:   CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE,
Complex equivalence relations
   1. Multiple Targets
        One sense for Dutch schoonmaken (to clean) which simultaneously matches with
        at least 4 senses of clean in WordNet1.5:

        •{make clean by removing dirt, filth, or unwanted substances from}
        •{remove unwanted substances from, such as feathers or pits, as of chickens or
        •(remove in making clean; "Clean the spots off the rug")
        •{remove unwanted substances from - (as in chemistry)}

        The Dutch synset schoonmaken will thus be linked with an eq_near_synonym
        relation to all these sense of clean.
   2. Multiple Source meanings
        Synsets inter-linked by a near_synonym relation can be linked to same target ILI-
        record(s), either with an eq_synonym or an eq_near_synonym relation:
        Dutch wordnet: toestel near_synonym apparaat
        ILI-records:   {machine}; {device}; {apparatus}; {tool}
Complex equivalence relations

 Typically used for gaps in WordNet1.5 or in English:

      • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which
      is a kind of gin made out of lemon skin,
      • pragmatic, in the sense that the concept is known but is not expressed by a single
      lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop
      only refers to animal head, English uses head for both.

 Used when wordnet1.5 only provides more narrow terms. In this case there can only be a
 pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both
 finger and toe.
    Overview of equivalence relations to the ILI

Relation         POS     Sources: Targets     Example
eq_synonym       same    1:1                  auto : voiture
eq_near_synonym any      many : many          apparaat, machine, toestel:
                                              apparatus, machine, device
eq_hyperonym     same    many : 1 (usually)   citroenjenever:
eq_hyponym       same    (usually) 1 : many   dedo :
                                              toe, finger
eq_metonymy      same    many/1 : 1           universiteit, universiteitsgebouw:
eq_diathesis     same    many/1 : 1           raken (cause), raken:
eq_generalization same   many/1 : 1           schoonmaken :
Filling gaps in the ILI

  Types of GAPS
  1.  genuine, cultural gaps for things not known in English
      culture, e.g. citroenjenever, which is a kind of gin made out of
      lemon skin,
       •      Non-productive
       •      Non-compositional
  2.       pragmatic, in the sense that the concept is known but is not
           expressed by a single lexicalized form in English, e.g.:
           container, borrower, cajera (female cashier)
       •      Productive
       •      Compositional
  3.       Universality of gaps: Concepts occurring in at least 2
   Productive and Predictable Lexicalizations
   exhaustively linked to the ILI
                 hypernym                hypernym
{doodslaanV}NL                                      {totschlagenV}DE
                   hypernym              hypernym
{doodstampenV}NL                                    {tottrampelnV}DE

                               cashier     hypernym
              hypernym                                   {cajeraN}ES
{casière}NL                                in_state
              in_state         female

                                fish       hypernym
  Top-down methodology                                           Core wordnet
                                                                 5000 synsets
            =                                     nyms
                   1000                                            Arabic
                  Synsets                                           word
EuroWordNet                                  SBCCBC       ABC    frequency
                   5000           English
                  Synsets          Arabic
Base Concepts
                                  Lexicon       Next Level
                  WordNet          teach       WordNet
 WordNet          WordNet
                  Synsets            -         Synsets            Arabic
 Domains          Synsets
                 1045678-v        darrasa         More
                                              1045678-v            roots
                  {teach}                       Hyponyms
                                              {darrasa}              &
 Named                                        Domain               rules
                  WordNet     Easy
 Entities                                             Named
                  Synsets     Translations     Domain Entities

                English Wordnet              Arabic Wordnet
  Top-down methodology
            =                                     nyms
                   1000                                            Arabic
                  Synsets                                           word
EuroWordNet                                               ABC    frequency
                   5000                      CBC
Base Concepts                     English
                                                Next Level
 WordNet          WordNet         Lexicon                         Arabic
 Domains          Synsets                         More             roots
                                                Hyponyms             &
 Named                                        Domain               rules
                  WordNet     Easy
 Entities                                             Named
                  Synsets     Translations     Domain Entities

                English Wordnet              Arabic Wordnet
                              ziekte (disease)

 ingewandsziekte                dierenziekte                       infectieziekte
  (bowel disease)             (animal disease)                 (infectious disease)

 haringwormziekte           kolder             veeziekte              vuilbroed
(anisakiasis: bowel    (staggers: brain                        (infectious infectious
                                            (cattle disease)
disease of herrings)   disease of cattle)                        disease of bees)

 ingewandsziekte               dierenziekte                  infectieziekte
  (bowel disease)           (animal disease)              (infectious disease)

 haringwormziekte               veeziekte                    vuilbroed
(anisakiasis: bowel                                   (infectious infectious
                             (cattle disease)
disease of herrings)                                       disease of bees)
                    (staggers: brain disease of cattle)

   Monolingual dictionaries:
       definitions
       synonym relations
       other relations
   Bi-lingual dictionaries: L-English, English-L
   Ontologies
   Thesauri
   Corpora:
       monolingual
       parallel

To top