Document Sample
pptx Powered By Docstoc
					Corpora and Statistical Methods
          Lecture 6
Semantic similarity, vector space models and word-
               sense disambiguation
        Part 2

Word sense disambiguation
What are word senses?
 Cognitive definition:
   mental representation of meaning
   used in psychological experiments
   relies on introspection (notoriously deceptive)

 Dictionary-based definition:
   adopt sense definitions in a dictionary
   most frequently used resource is WordNet
 Taxonomic representation of words (“concepts”)

 Each word belongs to a synset, which contains near-

 Each synset has a gloss

 Words with multiple senses (polysemy) belong to multiple

 Synsets organised by hyponymy (IS-A) relations
How many senses?
 Example: interest
   pay 3% interest on a loan
   showed interest in something
   purchased an interest in a company.
   the national interest…
   have X’s best interest at heart
   have an interest in word senses
   The economy is run by business interests
Wordnet entry for interest (noun)
1. a sense of concern with and curiosity about someone or something …
     (Synonym: involvement)
2.   the power of attracting or holding one’s interest… (Synonym:
3.   a reason for wanting something done ( Synonym: sake)
4.   a fixed charge for borrowing money…
5.   a diversion that occupies one’s time and thoughts … (Synonym: pastime)
6.   a right or legal share of something; a financial involvement with something
     (Synonym: stake)
7.   (usually plural) a social group whose members control some field of activity
     and who have common aims, (Synonym: interest group)
Some issues
 Are all these really distinct senses? Is WordNet too fine-

 Would native speakers distinguish all these as different?

 Cf. The distinction between sense ambiguity and
  underspecification (vagueness):
   one could argue that there are fewer senses, but these are
    underspecified out of context
Translation equivalence
 Many WSD applications rely on translation equivalence

   Given: parallel corpus (e.g. English-German)
   if word w in English has n translations in German, then each
    translation represents a sense

   e.g. German translations of interest:
     Zins: financial charge (WN sense 4)
     Anteil: stake in a company (WN sense 6)
     Interesse: all other senses
Some terminology
 WSD Task: given an ambiguous word, find the intended sense
  in context

 Sense tagging: task of labelling words as belonging to one
  sense or another.
    needs some a priori characterisation of senses of each relevant word

 Discrimination:
   distinguishes between occurrences of words based on senses
   not necessarily explicit labelling
Some more terminology
 Two types of WSD task:

   Lexical sample task: focuses on disambiguating a small set of
    target words, using an inventory of the senses of those words.

   All-words task: focuses on entire texts and a lexicon, where
    every word in the text has to be disambiguated
     Serious data sparseness problems!
Approaches to WSD
 All methods rely on training data. Basic idea:
   Given word w in context c
   learn how to predict sense s of w based on various features of w

 Supervised learning: training data is labelled with correct senses
    can do sense tagging

 Unsupervised learning: training data is unlabelled
    but many other knowledge sources used
    cannot do sense tagging, since this requires a priori senses
Supervised learning
 Words in training data labelled with their senses
   She pays 3% interest/INTEREST-MONEY on the loan.
   He showed a lot of interest/INTEREST-CURIOSITY in the painting.

 Similar to POS tagging
   given a corpus tagged with senses
   define features that indicate one sense over another
   learn a model that predicts the correct sense given the features
Features (e.g. plant)
 Neighbouring words:
   plant life
   manufacturing plant
   assembly plant
   plant closure
   plant species

 Content words in a larger window
   animal
   equipment
   employee
   automatic
Other features
 Syntactically related words
   e.g. object, subject….

 Topic of the text
   is it about SPORT? POLITICS?

 Part-of-speech tag, surrounding part-of-speech tags
Some principles proposed (Yarowsky 1995)
 One sense per discourse:
   typically, all occurrences of a word will have the same sense in the
    same stretch of discourse (e.g. same document)

 One sense per collocation:
   nearby words provide clues as to the sense, depending on the distance
    and syntactic relationship
   e.g. plant life: all (?) occurrences of plant+life will indicate the botanic
    sense of plant
Training data
   Shared Task competition
   datasets available for WSD, among other things
   annotated corpora in many languages

 Pseudo-words
   create training corpus by artificially conflating words
   e.g. all occurrences of man and hammer with man-hammer
   easy way to create training data

 Multi-lingual parallel corpora
   translated texts aligned at the sentence level
   translation indicates sense
Data representation
 Example sentence: An electric guitar and bass player stand off to
  one side...
   Target word: bass
   Possible senses: fish, musical instrument...

 Relevant features are represented as vectors, e.g.:
        wi 2 , POS i 2 , wi 1 , POS i 1 , wi 1 , POS i 1 , wi 2 , POS i 2 

         guitar, NN, and, CC, player, NN, stand,VB 
Supervised methods
Naïve Bayes Classifier
 Identify the features (F)
    e.g. surrounding words
    other cues apart from surrounding context

 Combine evidence from all features

 Decision rule: decide on sense s’ iff

                 sk , sk  s': P(s'| F )  P(sk | F )
 Example: drug. F = words in context
   medication sense: price, prescription, pharmaceutical
   illegal substance sense: alcohol, illicit, paraphernalia
Using Bayes’ rule
 We usually don’t know P(sk|f) but we can compute from
  training data: P(sk) (the prior) and P(f|sk)
                                                    P ( f | s k ) P ( sk )
                                 P ( sk | f ) 
                                                             P( f )

   P(f) can be eliminated because it is constant for all senses in the
                    sbest    arg max sk P( sk | f )
                                            P ( f | sk ) P ( sk )
                             arg max sk
                                              P( f )
                             arg max sk P( f | sk ) P( sk )
The independence assumption
 It’s called “naïve” because:
                 P ( f | sk )   P ( f j | s k )
                                    j 1

   i.e. all features are assumed to be independent

   Obviously, this is often not true.
     e.g. finding illicit in the context of drug may not be independent of finding pusher.
     cf. our discussion of collocations!
   Also, topics often constrain word choice.
Training the naive Bayes classifier
 We need to compute:
   P(s) for all senses s of w
                            Count ( sk , w)
               P ( sk ) 
                             Count ( w)

   P(f|s) for all features f

                                   Count( f j , sk )
                P ( f j | sk ) 
                                     Count( sk )
Information-theoretic measures
 Find the single, most informative feature to predict a sense.

 E.g. using a parallel corpus:
   prendre (FR) can translate as take or make
   prendre une décision: make a decision
   prendre une mesure: take a measure [to…]

    Informative feature in this case: direct object
       mesure indicates take
       décision indicates make

 Problem: need to identify the correct value of the feature that
   indicates a specific sense.
Brown et al’s algorithm
1.    Given: translations T of word w

2.    Given: values X of a useful feature (e.g. mesure, décision as values of DO)

3.    Step 1: random partition P of T

4.    While improving, do:
        create partition Q of X that maximises I(P;Q)
        find a partition P of T that maximises I(P;Q)
         comment: relies on mutual info to find clusters of translations mapping to clusters
         of feature values
Using dictionaries and thesauri
    Lesk (1986): one of the first to exploit dictionary
        the definition corresponding to a sense can contain words which
         are good indicators for that sense

    Method:
    1.   Given: ambiguous word w with senses s1…sn with glosses g1…gn.
    2.   Given: the word w in context c
    3.   compute overlap between c & each gloss
    4.   select the maximally matching sense
Expanding a dictionary

 Problem with Lesk:
   often dictionary definitions don’t contain sufficient information
   not all words in dictionary definitions are good informants

 Solution: use a thesaurus with subject/topic categories
   e.g. Roget’s thesaurus
Using topic categories
 Suppose every sense sk of word w has subject/topic tk

 w can be disambiguated by identifying the words related to tk
  in the thesaurus

 Problems:
   general-purpose thesauri don’t list domain-specific topics
   several potentially useful words can be left out
   e.g. … Navratilova plays great tennis …
   proper name here useful as indicator of topic SPORT
Expanding a thesaurus: Yarowsky 1992
1.    Given: context c and topic t

2.    For all contexts and topics, compute p(c|t) using Naïve Bayes
         by comparing words pertaining to t in the thesaurus with words in c
         if p(c|t) > α, then assign topic t to context c

3.    For all words in the vocabulary, update the list of contexts in which the
      word occurs.
         Assign topic t to each word in c

4.    Finally, compute p(w|t) for all w in the vocabulary
         this gives the “strength of association” of w with t
   Yarowsky 1992: some results
space object   UNIVERSE           96%
celebrity      ENTERTAINER        95%
shape          INSIGNIA           82.%

punishment     LEGAL_ACTION 99%
set of words   GRAMMAR            98%
    Yarowsky (1995) suggested the one sense per
     discourse/collocation constraints.

    Yarowsky’s method:
    1.   select the strongest collocational feature in a specific context
    2.   disambiguate based only on this feature
        (similar to the information-theoretic method discussed earlier)
One sense per collocation
1.   For each sense s of w, initialise F, the collocations found in s’s
     dictionary definition.

2.   One sense per collocation:
        identify the set of contexts containing collocates of s
        for each sense s of w, update F to contain those collocates such that
                          P( s | f )
         for all s’ ≠ s                
                          P( s ' | f )

         (where alpha is a threshold)
One sense per discourse

3.   For each document:
        find the majority sense of w out of those found in previous step
        assign all occurrences of w the majority sense

    This is implemented as a post-processing step. Reduces
     error rate by ca. 27%.
Unsupervised disambiguation
 Recall: unsupervised learning can do sense discrimination
  not tagging

   akin to clustering occurrences with the same sense

   e.g. Brown et al 1991: cluster translations of a word

   this is akin to clustering senses
Brown et al’s method
    Preliminary categorisation:

1.   Set P(w|s) randomly for all words w and senses s of w.

2.   Compute, for each context c of w the probability P(c|s)
     that the context was generated by sense s.

    Use (1) and (2) as a preliminary estimate. Re-estimate
     iteratively to find best fit to the corpus.
Characteristics of unsupervised
 Can adapt easily to new domains, not covered by a dictionary
  or pre-labelled corpus
   Very useful for information retrieval

 If there are many senses (e.g. 20 senses for word w), the
  algorithm will split contexts into fine-grained sets
   NB: can go awry with infrequent senses
Some issues with WSD
The task definition
 The WSD task traditionally assumes that a word has one and
  only one sense in a context.
   Is this true?

 Kilgarriff (1993) argues that co-activation (one word
  displaying more than one sense) is frequent:
   this would bring competition to the licensed trade
   competition = “act of competing”; “people/organisations who are
Systematic polysemy
 Not all senses are so easy to distinguish. E.g. competition in the
  “agent competing” vs “act of competing” sense.
    The polysemy here is systematic
    Compare bank/bank where the senses are utterly distinct (and most
     linguists wouldn’t consider this a case of polysemy, but homonymy)

 Can translation equivalence help here?
   depends if polysemy is systematic in all languages
Logical metonymy
 Metonymy = usage of a word to stand for something else
   e.g. the pen is mightier than the sword
   pen = the press

 Logical metonymy arises due to systematic polysemy
   good cook vs. good book
   enjoy the paper vs enjoy the cake

 Should WSD distinguish these? How could they do this?
Which words/usages count?
 Many proper names are identical to common nouns (cf.
  Brown, Bush,…)

 This presents a WSD algorithm with systematic ambiguity
  and reduces performance.

 Also, names are good indicators of senses of neighbouring
   But this requires a priori categorisation of names.
   Brown’s green stance vs. the cook’s green curry

Shared By:
jiang lifang jiang lifang