Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Statistical NLP by yurtgc548

VIEWS: 3 PAGES: 71

									COMP791A: Statistical Language
Processing


           Collocations
             Chap. 5




                                 1
A collocation…
    is an expression of 2 or more words that
     correspond to a conventional way of saying
     things.
        broad daylight
        Why not? ?bright daylight or ?narrow darkness

        Big mistake but not ?large mistake
    overlap with the concepts of:
        terms, technical terms & terminological phrases
             Collocations extracted form technical domains
                  Ex: hydraulic oil filter, file transfer protocol



                                                                      2
Examples of Collocations

    strong tea
    weapons of mass destruction
    to make up
    to check in
    heard it through the grapevine
    he knocked at the door
    I made it all up




                                      3
Definition of a collocation

   (Choueka, 1988)
    [A collocation is defined as] “a sequence of two or more
      consecutive words, that has characteristics of a
      syntactic and semantic unit, and whose exact and
      unambiguous meaning or connotation cannot be derived
      directly from the meaning or connotation of its
      components."
   Criteria:
       non-compositionality
       non-substitutability
       non-modifiability
       non-translatable word for word


                                                               4
Non-Compositionality

   A phrase is compositional if its meaning can
    be predicted from the meaning of its parts
   Collocations have limited compositionality
       there is usually an element of meaning added to
        the combination
       Ex: strong tea
   Idioms are the most extreme examples of
    non-compositionality
       Ex: to hear it through the grapevine

                                                          5
Non-Substitutability

   We cannot substitute near-synonyms for the
    components of a collocation.
       Strong is a near-synonym of powerful
          strong tea ?powerful tea
       yellow is as good a description of the color of white wines
          white wine   ?yellow wine




                                                                  6
Non-modifiability

   Many collocations cannot be freely modified
    with additional lexical material or through
    grammatical transformations
       weapons of mass destruction --> ?weapons of
        massive destruction
       to be fed up to the back teeth --> ?to be fed up
        to the teeth in the back




                                                           7
Non-translatable (word for word)
   English:
       make a decision    ?take a decision
   French:
       ?faire une décision      prendre une décision

   to test whether a group of words is a
    collocation:
       translate it into another language
       if we cannot translate it word by word
       then it probably is a collocation

                                                        8
Linguistic Subclasses of Collocations

   Phrases with light verbs:
       Verbs with little semantic content in the collocation
       make, take, do…
   Verb particle/phrasal verb constructions
       to go down, to check out,…
   Proper nouns
       John Smith
   Terminological expressions
       concepts and objects in technical domains
       hydraulic oil filter

                                                                9
Why study collocations?
   In NLG or MT
       The output should be natural
         make a decision            ?take a decision
   In lexicography
       Identify collocations to list them in a dictionary
       To distinguish the usage of synonyms or near-synonyms
   In parsing
       To give preference to most natural attachments
         plastic (can opener)  ? (plastic can) opener
   In corpus linguistics and psycholinguists
       Ex: To study social attitudes towards different types of
        substances
         strong cigarettes/tea/coffee
         powerful drug


                                                                   10
A note on (near-)synonymy

   To determine if 2 words are synonyms-- Principle
    of substitutability:
       2 words are synonym if they can be substituted for one
        another in some?/any? sentence without changing the
        meaning or acceptability of the sentence
           How big/large is this plane?
           Would I be flying on a big/large or small plane?

           Miss Nelson became a kind of big / ?? large sister to Tom.
           I think I made a big / ?? large mistake.




                                                                         11
A note on (near-)synonymy (con’t)

   True synonyms are rare...
   Depend on:
       shades of meaning:
            words may share central core meaning but have
             different sense accents
       register/social factors
            speaking to a 4-yr old VS to graduate students!
       collocations:
            conventional way of saying something / fixed
             expression



                                                               12
Approaches to finding collocations

   Frequency
   Mean and Variance
   Hypothesis Testing
       t-test
       2-test
   Mutual Information




                                 13
Approaches to finding collocations

   --> Frequency
   Mean and Variance
   Hypothesis Testing
       t-test
       2-test
   Mutual Information




                                 14
    Frequency

    (Justeson & Katz, 1995)

    Hypothesis:
       if 2 words occur together very often, they must
        be interesting candidates for a collocation

    Method:
       Select the most frequently occurring bigrams
        (sequence of 2 adjacent words)


                                                       15
Results
   Not very interesting…
   Except for “New York”, all bigrams are
    pairs of function words

So, let’s pass the results through a part-
    of- speech filter
      Tag Pattern   Example
          AN        linear function
          NN        regression coefficient
          AAN       Gaussian random variable
          ANN       cumulative distribution function
          NAN       mean squared error
          NNN       class probability function
          NPN       degrees of freedom



                                                       16
Frequency + POS filter



Simple method that
  works very well




                         17
“Strong” versus “powerful”




   On a 14 million word corpus from the New-York Times (Aug.-
    Nov. 1990)


                                                             18
Frequency: Conclusion
   Advantages:
       works well for fixed phrases
       Simple method & accurate result
       Requires small linguistic knowledge


   But: many collocations consist of two words in more
    flexible relationships
     she knocked on his door

     they knocked at the door

     100 women knocked on Donaldson’s door

     a man knocked on the metal front door



                                                     19
Approaches to finding collocations

   Frequency
   --> Mean and Variance
   Hypothesis Testing
       t-test
       2-test
   Mutual Information




                                 20
    Mean and Variance
   (Smadja et al., 1993)
   Looks at the distribution of distances between two words in
    a corpus
   looking for pairs of words with low variance
       A low variance means that the two words usually occur at about
        the same distance
       A low variance --> good candidate for collocation
   Need a Collocational Window to capture collocations of
    variable distances

                   knock           door
                   knock                      door



                                                                    21
    Collocational Window
   This is an example of a three word window.

   To capture 2-word collocations
     this is         this an
     is an           is example
     an example      an if
     example if      example a
     of a            of three
     a three         a word
     three word      three window
     word window




                                                 22
    Mean and Variance (con’t)
   The mean is the average offset (signed distance) between two
    words in a corpus
   The variance measures how much the individual offsets deviate
    from the mean
                                         i1
                                           n
                                              (di  d )2
                                 var 
                                               n 1
             n is the number of times the two words (two candidates) co-occur
             di is the offset of the ith pair of candidates
             d is the mean offset of all pairs of candidates


   If offsets (di) are the same in all co-occurrences
        --> variance is zero
        --> definitely a collocation
   If offsets (di) are randomly distributed
        --> variance is high
        --> not a collocation

                                                                            23
An Example
   window size = 11 around knock (5 left, 5 right)
       she knocked on his door
       they knocked at the door
       100 women knocked on Donaldson’s door
       a man knocked on the metal front door


   Mean d = (3  3  5  5)  4.0
                    4
   Std. deviation s =   (3  4.0)2  (3  4.0)2  (5  4.0)2  (5  4.0)2
                                                                            1.15
                                                 3



                                                                               24
Position histograms
   “strong…opposition”
       variance is low
       --> interesting
        collocation


   “strong…support”




   “strong…for”
       variance is high
       --> not interesting
        collocation



                              25
Mean and variance versus Frequency
std. dev. ~0 & mean offset ~1 --> would
be found by frequency method




std. dev. ~0 & high mean offset
--> very interesting, but would
not be found by frequency
method




      high deviation --> not
           interesting



                                          26
Mean & Variance: Conclusion

   good for finding collocations that have:
       looser relationship between words
       intervening material and relative position




                                                     27
Approaches to finding collocations

   Frequency
   Mean and Variance
   --> Hypothesis Testing
       t-test
       2-test
   Mutual Information




                                 28
Hypothesis Testing
    If 2 words are frequent… they will frequently occur
     together…
    Frequent bigrams and low variance can be accidental
     (two words can co-occur by chance)
    We want to determine whether the co-occurrence is
     random or whether it occurs more often than chance
    This is a classical problem in statistics called
     Hypothesis Testing
        When two words co-occur, Hypothesis Testing measures
         how confident we have that this was due to chance or not




                                                                    29
Hypothesis Testing (con’t)
    We formulate a null hypothesis H0
      H0 : no real association (just chance…)
      H0 states what should be true if two words do not form a
       collocation
      if 2 words w1 and w2 do not form a collocation, then w1
       and w2 are independently of each other:
                  P(w1 , w2 )  P(w1 )P(w2 )

    We need a statistical test that tells us how probable or
     improbable it is that a certain combination occurs
    Statistical tests:
       t test
       2 test


                                                                  30
Approaches to finding collocations

   Frequency
   Mean and Variance
   Hypothesis Testing
       --> t-test
       2-test
   Mutual Information




                                 31
Hypothesis Testing: the t-test

   (or Student's t-test)

   H0 states that: P(w1 , w2 )  P(w1 )P(w2 )
   We calculate the probability p-value that a
    collocation would occur if H0 was true
   If p-value is too low, we reject H0
       Typically if under a significant level of p < 0.05, 0.01, or
        0.001
   Otherwise, retain H0 as possible

                                                                   32
Some intuition
    Assume we want to compare the heights of men and women
    we cannot measure the height of every adult…
    so we take a sample of the population
    and make inferences about the whole population
    by comparing the sample means and the variation of each
     mean



    Ho: women and men are equally tall, on
     average
    We gather data from 10 men and 10 women


                                                               33
Some intuition (con't)
   t-test compares:
       the sample mean (computed from observed values)
       to a expected mean
   determines the likelihood (p-value) that the
    difference between the 2 means occurs by chance.
       a p-value close to 1 --> it is very likely that the expected
        and sample means are the same
       a small p-value (ex: 0.01) --> it is unlikely (only a 1 in 100
        chance) that such a difference would occur by chance
   so the lower the p-value --> the more certain we
    are that there is a significant difference between
    the observed and expected mean, so we reject H0
                                                                         34
Some intuition (con’t)
       t-test assigns a probability to describe the likelihood that
        the null hypothesis is true
                high p-value --> Accept Ho
                                                                                    Accept Ho
                                 low p-value --> Reject Ho
                                                                   Reject Ho               Reject Ho
 frequency




                                                       frequency
                           0
                                         value of t                            0
                                                                                        value of t
             Critical value c
             (value of t where
             we decide to              Confidence level a = probability
             reject Ho)                that t-score > critical value c


              t distribution (1-tailed)                              t distribution (2-tailed)


                                                                                                       35
Some intuition (con’t)
1.   Compute t score
2.   Consult the table of critical values with
     df = 18 (10+10-2)
3.   If t > critical value (value in table), then
     the 2 samples are significantly different
     at the probability level that is listed

    Assume t=2.7
    if there is no difference in height
     between women and men (H0 is true)
     then the probability of finding t=2.7 is
     between 0.025 & 0.01
    … that’s not much…
    so we reject the null hypothesis H0
    and conclude that there is a difference        Probability table based on the t distribution
                                                    (2-tailed test)
     in height between men and woman

                                                                                           36
The t-Test

   looks at the mean and variance of a sample of
    measurements
   the null hypothesis is that the sample is drawn
    from a distribution with mean 
   The test :
       looks at the difference between the observed and
        expected means, scaled by the variance of the data
       tells us how likely one is to get a sample of that mean and
        variance
       assuming that the sample is drawn from a normal
        distribution with mean .

                                                                  37
The t-Statistic
                          Difference between the observed mean

             x μ
                                 and the expected mean


    t
                 s  2
                              x is the sample mean
                               is the expected mean of the distribution

                 N            s2 is the sample variance
                              N is the sample size




 the higher the value of t, the greater the confidence that:
     •there is a significant difference
     •it’s not due to chance
     •the 2 words are not independent


                                                                           38
t-Test for finding Collocations
    is w1 w2 a collocation?
    Think of a corpus of N words as a long
     sequence of N bigrams

    let's randomly generate one bigram:
        if the bigram is w1 w2 ==> success
        if the bigram is not w1 w2 ==> failure
        …in effect a Bernoulli trial



                                                  39
Bernoulli Distribution
   Also known as Binomial distribution
   Each trial has only two outcomes (success or failure)
   The trials are independent
   There are a fixed number of trials

   Distribution has 2 parameters:
       nb of trials n
       probability of success p in 1 trial

   Example: Flipping a coin 10 times and counting the number
    of heads that occur
       Can only get a head or a tail (2 outcomes)
       The probability of success for each trial is p= ½
       The coin flips do not effect each other (independence)
       There are 10 coin flips (n = 10)




                                                                 40
Properties of binomial distribution
   Mean (or expectation) E(X) = μ = np
       Ex: Flipping a coin 10 times ==> E(head) = μ = 10 x ½ = 5
   Variance σ2= np(1-p)
       Ex: Flipping a coin 10 times ==> σ2 = 10 x ½ ( ½ ) = 2.5

   A binomial distribution is made of a sequence of
    independent trials (n > 1)
   If we only have 1 trial (n=1), we have Bernoulli
    trial, and:
       s2 = p x (1-p)
       where p is the probability of success
       (1-p) is the probability of failure

                                                                    41
t-Test: Example with collocations
   In a corpus:
       new occurs 15,828 times
       companies occurs 4,675 times
       new companies occurs 8 times
       there are 14,307,668 tokens overall

   Is new companies a collocation?

                   x μ
              t
                      s 2


                      N
                                              42
Example (Cont.)
                                           8
                                   x             5.591 10 7
 
     x : the observed mean is           14307668


     μ : If the null hypothesis is true, then:
          Independence assumption -- P(new companies) = P(new) P(companies)
          the probability of having new companies is expected to be
                     15 828      4 675
                                       3.615  10 7
                   14 307 668 14 307 668

     s2 = sample variance = p x (1-p)
          where p is the probability of success according to the observations
           (i.e. getting the bigram new companies)
          p is small for most bigrams
           so s2 ≈ p                 8
                         s2  p           5.591 10 7
                                 14 307 668

     N : total number of bigrams = 14,307,668

                                                                             43
Example (Cont.)
   By applying the t-test, we have:          x -μ       5.591 10 7  3.615 10 7
                                         t                                           1
                                               s2                5.591 10 7
                                               N                 14307668



   With a confidence level a=0.005, critical value is 2.576 (t should be at
    least 2.576)




   Since t=1 < 2.576
        we cannot reject the Ho
        so we cannot claim that new and companies form a collocation




                                                                                            44
t test: Some results
   t test applied to 10 bigrams that occur with frequency = 20

pass the t-test                                                                       fail the t-test
                         t       C(w1)    C(w2)    C(w1 w2)           w1          w2   (t < 2.756) so:
(t > 2.756) so:
we can reject
                        4.4721       42       20     20       Ayatollah    Ruhollah    we cannot
the null
                        4.4721       41       27     20       Bette        Midler      reject the null
hypothesis
                        1.2176   14093     14776     20       like         people      hypothesis
so they form
                       0.8036    15019     15629     20       time         last        so they do not
collocation                                                                            form a
                                                                                       collocation


   Notes:
      Frequency-based method could not have seen the difference in
       these bigrams, because they all have the same frequency
      the t test takes into account the frequency of a bigram
       relative to the frequencies of its component words
              If a high proportion of the occurrences of both words occurs in
               the bigram, then its t is high.
         The t test is mostly used to rank collocations


                                                                                                   45
    Hypothesis testing of differences
   Used to see if 2 words (near-synonyms) are used in the
    same context or not
      “strong” vs “powerful”

   can be useful in lexicography
   we want to test:
       if there is a difference in 2 populations
            Ex: height of woman / height of man
       the null hypothesis is that there is no difference
       i.e. the average difference is 0 ( =0)
                                    x1 is the sample mean of population 1
                    x1  x2
             t                     x2 is the sample mean of population 2
                                    s12 is the sample variance of population 1
                    s 2
                      1 s  2
                           2        s22 is the sample variance of population 2
                                   n1 is the sample size of population 1
                    n1 n2           n2 is the sample size of population 2

                                                                                 46
Difference test example

   Is there a difference in how we use “powerful”
    and how we use “strong”?
      t        C(w) C(strong w)   C(powerful w)   Word
      3.1622     933       0              10      computers
      2.8284    2377       0               8      computer
      2.4494     289       0               6      symbol
      2.2360    2266       0               5      Germany
      7.0710   3685       50               0      support
      6.3257   3616       58               7      enough
      4.6904    986       22               0      safety
      4.5825    3741      21               0      sales




                                                              47
Approaches to finding collocations

   Frequency
   Mean and Variance
   Hypothesis Testing
       t-test
       --> 2-test
   Mutual Information




                                 48
Hypothesis testing: the 2-test

    problem with the t test is that it assumes
     that probabilities are approximately
     normally distributed…
    the 2-test does not make this assumption
    The essence of the 2-test is the same as
     the t-test
        Compare observed frequencies and expected
         frequencies for independence
        if the difference is large
        then we can reject the null hypothesis of
         independence

                                                     49
2-test
   In its simplest form, it is applied to a 2x2
    table of observed frequencies
   The 2 statistic:
       sums the differences between observed frequencies
        (in the table)
       and expected values for independence
       scaled by the magnitude of the expected values:
                                            i - ranges over rows
                     (Obsij  Expij )   2
                                            j - ranges over columns
        X 
          2
                                            Oij - the observed value for cell (i, j)
              i, j        Expij             Eij - the expected value




                                                                                       50
2-test- Example

   Observed frequencies Obsij

Observed                   w1 = new            w1 ≠ new                  TOTAL
w2 = companies                    8               4 667                    4 675
                    (new companies) (ex: old companies)             c(companies)
w2 ≠ companies              15 820           14 287 181               14 303 001
                 (ex: new machines) (ex: old machines)             c(~companies)
TOTAL                       15 828           14 291 848               14 307 676
                             c(new)             c(~new)   N = 4 675 + 14 303 001
                                                           = 15 828 +14 291 848




                                                                                   51
    2-test- Example (con’t)
   Expected frequencies Expij
        If independence
        Computed from the marginal probabilities (the totals of the rows and columns
         converted into proportions)
          Expected                                 w1 = new                        w1 ≠ new

          w2 = companies                                5.17                      4669.83
                                   c(new) x c(companies) / N    c(companies) x c(˜new) / N
                                  15828 x 4675 / 14307676     4675 x 14291848 / 14307676
          w2 ≠ companies                          15 822.83                  14 287 178.17
                                  c(new) x c(˜companies) / N   c(˜new) x c(˜companies) / N
                               15828 x 14303001 /14307676 14291848 x 14303001 / 14307676

        Ex: expected frequency for cell (1,1) (new companies)
             marginal probability of new occurring as the first part of a bigram times marginal
              probability of companies occurring as the second part of bigram:

                            8  4667 8  15820
                                    x          x N  5.17
                                N         N
        If “new” and “companies” occurred completely independent of each other
        we would expect 5.17 occurrences of “new companies” on average



                                                                                                   52
    2-test- Example (con’t)
   But is the difference significant?
        (8  5.17) 2 (46 667  46 669.83) 2 (15 820  15 822.83) 2 (14 287 181  14 287 178.17) 2
    χ 
     2
                                                                                                1.55
           5.17             46 669                 15 823                   14 287 186


     df in an nxc table = (n-1)(c-1) = (2-1)(2-1) =1 (degrees of freedom)




   The probability level of a=0.05 the critical value is 3.84
   Since 1.55 < 3.84:
      So we cannot reject H0 (that new and companies occur independently
         of each other)
        So new companies is not a good candidate for a collocation



                                                                                                     53
2-test: Conclusion

   Differences between the t statistic and 2
    statistic do not seem to be large
   But:
       the 2 test is appropriate for large probabilities
            where t test fails because of the normality
             assumption
       the 2 is not appropriate with sparse data (if numbers in
        the 2 by 2 tables are small)
   2 test has been applied to a wider range of
    problems
       Machine translation
       Corpus similarity

                                                                    54
2-test for machine translation
   (Church & Gale, 1991)
   To identify translation word pairs in aligned corpora
   Ex:                                 Nb of aligned sentence pairs
                                                    containing “cow” in English and
                                                          “vache” in French

           Observed        “cow”       ~”cow”        TOTAL
           frequency
           “vache”              59              6            65
           ~”vache”                8   570 934         570 942
           TOTAL                67     570 940         571 007

   2 = 456 400 >> 3.84 (with a= 0.05)
   So “vache” and “cow” are not independent… and so are translations
    of each other

                                                                                  55
2-test for corpus similarity

   (Kilgarriff & Rose, 1998)
   Ex:
           Observed    Corpus 1       Corpus 2   Ratio
           frequency
           Word1                 60          9    60/9 =6.7
           Word2                500         76           6.6
           Word3                124         20           6.2
           …                      …          …            …
           Word500                …          …            …

   Compute 2 for the 2 populations (corpus1 and corpus2)
   Ho: the 2 corpora have the same word distribution



                                                               56
Collocations across corpora

   Ratios of relative frequencies between two or more
    different corpora
   can be used to discover collocations that are characteristic
    of a corpus when compared to other corpus
              Likelihood       NY Times NY Times           w1 w 2
                 ratio          (1990)   (1989)
            0.0241                     2       68      Karim Obeid
            (2/14 307 668) /
            (68/11 731 564)
            0.0372                      2         44   East Berliners
            0.0372                      2         44   Miss Manners
            0.0399                      2         41   17 earthquake
            …                           …          …        ……
            TOTAL              14 307 668 11 731 564




                                                                        57
    Collocations across corpora (con’t)

   most useful for the discovery of subject-
    specific collocations
        Compare a general text with a subject-specific
         text
        words and phrases that (on a relative basis)
         occur most often in the subject-specific text
         are likely to be part of the vocabulary that is
         specific to the domain



                                                           58
Approaches to finding collocations

   Frequency
   Mean and Variance
   Hypothesis Testing
       t-test
       2-test
   --> Mutual Information




                                 59
Pointwise Mutual Information
   Uses a measure from information-theory
   Pointwise mutual information between 2 events x
    and y (in our case the occurrence of 2 words) is
    roughly:
       a measure of how much one event (e.g. a word) tells us
        about the other
       or a measure of the independence of 2 events (or 2
        words)
          If 2 events x and y are independent, then I(x,y) = 0




                                                                  60
Essential Information Theory
(back to section 2.2 of book)

   Developed by Shannon in the 40s
   To maximize the amount of information that
    can be transmitted over an imperfect
    communication channel (the noisy channel)
   Notion of entropy (informational content):
       How informative is a piece of information?
            ex. How informative is the answer to a question
            If you already have a good guess about the answer, the
             actual answer is less informative… low entropy


                                                                      61
Entropy - intuition
   Ex: Betting 1$ to the flip of a coin
       If the coin is fair:
            Expected gain is ½ (+1) + ½ (-1) = 0$
            So you’d be willing to pay up to 1$ for advanced information
               (1$ - 0$ average win)

       If the coin is rigged
            P(head) = 0.99            P(tail) = 0.01

            assuming you bet on head (!)
            Expected gain is 0.99(+1) + 0.01(-1) = 0.98$
            So you’d be willing to pay up to 2¢ for advanced information
               (1$ - 0.98$ average win)

       Entropy of fair coin is 1$ > entropy of rigged coin 0.02$

                                                                            62
Entropy
   Let X be a discrete Random Variable
    (e.g. the function of tossing a coin with outputs xi)
   Entropy (or self-information)
                              n
                 H(X)   p(xi )log2p(xi )
                              i1

   measures the amount of information in a RV
      average uncertainty of a RV

      the average length of the message needed to transmit an
       outcome xi of that variable
      the size of the search space consisting of the possible values
       of a RV and its associated probabilities
   measured in bits
   Properties:
      H(X) ≥ 0

      If H(X) = 0 then it provides no new information


                                                                    63
Example: The coin flip
    Fair coin: H(X)   p(xi )log2p(xi )  -  1 log2 1  1 log2 1   1 bit
                        n
                                                                  
                       i1                    2       2 2        2
    Rigged coin: H(X)   p(xi )log2p(xi )  -  99 log2 99  1 log2 1   0.08 bits
                             n
                                                                      
                            i1                  100     100    100     100 
           Entropy




                                                              P(head)

                                                                                  64
Example: Simplified Polynesian
   In simplified Polynesian, we have 6 letters with
    frequencies:
                                          p       t       k       a       i       u
                                     1/8 1/4 1/8 1/4 1/8 1/8
        The per-letter entropy is
                                 1        1 1    1 1    1 1    1 1    1 1    1
 H(p)        p(i)log p(i)  ( 8 log
         i{p, t,k,a,i,u}
                            2         2     log2  log2  log2  log2  log2 )  2.5 bits
                                          8 4    4 8    8 4    4 8    8 8    8


        We can design a code that on average takes 2.5bits to
         transmit a letter
                                              p       t       k       a       i       u
                                          100     00      101     01      110     111

        Can be viewed as the average nb of yes/no questions you
         need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?)


                                                                                          65
Entropy in NLP

   Entropy is a measure of uncertainty
       The more we know about something the lower its entropy
   So if a language model captures more of the
    structure of the language, then its entropy should
    be lower
   in NLP, language models are compared by using
    their entropy.
       ex: given 2 grammars and a corpus, we use entropy to
        determine which grammar better matches the corpus.



                                                               66
Mutual Information
   H(X) - H(X|Y) = H(Y) - H(Y|X) = I(X;Y)

   The reduction in uncertainty of a RV by knowing about
    another RV
   e.g. if you see "Merry"… how surprised are you if it is
    followed by:
      "hippopotamus"

            --> very surprised so I(Merry; hippo) ≈ 0
       "Christmas"
            --> not surprised so I(Merry; Christmas) is very high
                                  p(x, y)   also known as pointwise
                I(x; y)  log2
                                 p(x)p(y)   mutual information


                                                                      67
Example: Finding Collocations
   Assume:
       c(Ayatollah) = 42
       c(Ruhollah) = 20
       c(Ayatollah, Ruhollah) = 20
       N = 14 307 668
   Then:                   p(x, y)
          I(x; y)  log2
                           p(x)p(y)
                                                   20          
                                                               
          I(Ayatollah; Ruhollah)  log2        14 307 668        18.38
                                             42          20    
                                                              
                                         14 307 668 14 307 668 

   So? The occurrence of “Ayatollah” at position i increases by
    18.38bits if “Ruhollah” occurs at position i+1
   works particularly badly with sparse data

                                                                            68
Pointwise Mutual Information (con’t)
   With pointwise mutual information:
       I(w1;w2)   C(w1)    C(w2)    C(w1 w2)              w1            w2
          18.38       42       20              20 Ayatollah     Ruhollah
          17.98       41       27              20 Bette         Midler
           0.46   14093     14776              20 like          people
           0.29   15019     15629              20 time          last

   With t-test (see p.43 of slides)
           t      C(w1)    C(w2)    C(w1 w2)              w1           w2
         4.4721       42       20              20 Ayatollah    Ruhollah
         4.4721       41       27              20 Bette        Midler
         1.2176   14093     14776              20 like         people
         0.8036    15019    15629              20 time         last


   Same ranking as t-test




                                                                             69
Pointwise Mutual Information (con’t)
   good measure of independence
       values close to 0 --> independence

   bad measure of dependence
       because PMI does not depend on frequency
       all things being equal, bigrams of low frequency words will
        receive a higher score than bigrams of high frequency words

       so sometimes we take C(w1 w2) I(w1 ; w2)




                                                                  70
Automatic vs manual detection of collocations

   Manual detection finds a wider variety of grammatical patterns
       Ex: in the BBI combinatory dictionary of English

                      strength              power
                      to build up ~         to assume ~
                      to find ~             emergency ~
                      to save ~             discretionary ~
                      to sap somebody's ~   fire ~
                      brute ~               supernatural ~
                      tensile ~             to turn off the ~
                      the ~ to [do X]       the ~ to [do X]


   Quality of collocations is better that computer-generated ones
   But… long and requires expertise


                                                                     71

								
To top