coll by liaoxiuli3

VIEWS: 6 PAGES: 32

									               LING 406
Intro to Computational Linguistics

               Collocations
             Richard Sproat

    URL: http://catarina.ai.uiuc.edu/L406_08
            This Lecture
• What are collocations?
• Measures of association
  – Pointwise Mutual Information
  – Frequency-Weighted Mutual Information
  – Pearson’s 2
  – Dunning’s likelihood ratios
  – Non-binary collocations

03/29/08         Linguistics 406            2
           Some characteristics of
               collocations
• Firth (1957): “Collocations of a given word are
  statements of the habitual or customary places
  of that word”
• In plain English: collocations are expressions
  constructed out of two or more words that have
  some special property
     – Non-compositionality: kick the bucket, white wine
     – Non-substitutability: *kick the pail, *yellow wine
     – Non-modifiability: *kick the big bucket, *very white
       wine
03/29/08                  Linguistics 406                     3
           Some kinds of collocations
• Idioms: kick the bucket, red herring
• Nominal compounds: dog catcher, brown
  bread, sump pump, white wine
• Verb particle constructions: give up, bowl
  over, chew out




03/29/08             Linguistics 406           4
              Why care?
• Lexicography

• Machine translation

• Word segmentation

• Sense disambiguation

03/29/08          Linguistics 406   5
  Simple frequency: NY Times Newswire 1990 (4 months)




03/29/08               Linguistics 406                  6
   Simple frequency: Justeson-Katz
               filtration




03/29/08       Linguistics 406       7
     Statistical approaches to binary
                collocations
• Frequency in and of itself doesn’t tell you
  that words are particularly associated with
  each other: if both words are frequent you
  might expect their combination to be
  frequent just by chance.
• Statistical measures of association can
  give an estimate of how much more likely
  than chance a given combination is.

03/29/08           Linguistics 406              8
  (Pointwise) Mutual Information
• Mutual Information was originally
  proposed as an information-theoretic
  measure of channel capacity (Fano 1961).




03/29/08         Linguistics 406             9
1995 AP Newswire Collocations




03/29/08    Linguistics 406   10
           1995 AP Newswire Non-
                Collocations




03/29/08           Linguistics 406   11
   Problems with mutual information
• It is unreliable for small counts. (But this is really a
  problem with the MLE)
• The second, and more serious problem is that mutual
  information relates to estimated probability in a
  counterintuitive way:




03/29/08                  Linguistics 406                    12
           Frequency-weighted MI




03/29/08           Linguistics 406   13
 1995 AP Newswire collocations




03/29/08     Linguistics 406   14
Problems with Frequency-Weighted
       Mutual Information



• Main problem is that it tends to overreward
  frequency




03/29/08           Linguistics 406          15
           Pearson’s χ-square




03/29/08         Linguistics 406   16
           Pearson’s χ-square




03/29/08         Linguistics 406   17
 1995 AP Newswire collocations




03/29/08     Linguistics 406   18
           Problems with χ-square




03/29/08           Linguistics 406   19
   Dunning’s (1993) likelihood ratios

                                   n! / (n-k)!k!




03/29/08         Linguistics 406                   20
 1995 AP Newswire collocations




03/29/08     Linguistics 406   21
   Problems with likelihood ratios




03/29/08       Linguistics 406       22
     Some Chinese examples: MI




03/29/08       Linguistics 406   23
     Weighted mutual information




03/29/08        Linguistics 406    24
           χ-square




03/29/08    Linguistics 406   25
           Likelihood ratios




03/29/08        Linguistics 406   26
    Errors on top 500 by each Measure (10 Million
            Character ROCLING Corpus)




03/29/08              Linguistics 406               27
  Extracting non-binary collocations




03/29/08        Linguistics 406        28
           Smadja’s 1993 Xtract




03/29/08          Linguistics 406   29
           Smadja’s 1993 Xtract




03/29/08          Linguistics 406   30
           Smadja’s 1993 Xtract




03/29/08          Linguistics 406   31
                Summary
• Various statistical measures of collocation

• Each has their advantages and drawbacks

• Collocations are useful in a number of
  areas, which we’ll turn to next


03/29/08           Linguistics 406          32

								
To top