SIMS Applied Natural Language Processing Marti Hearst (PowerPoint) by liaoqinmei


Applied Natural Language Processing

Marti Hearst
Oct 9, 2006


 Finish Conditional Probabilities and Bayesian Learning
 Intro to Classification; Identification of

    Conditional Probability

        A way to reason about the outcome of an experiment
        based on partial information
             In a word guessing game the first letter for the word
             is a “t”. What is the likelihood that the second letter
             is an “h”?
             How likely is it that a person has a disease given that
             a medical test was negative?
             A spot shows up on a radar screen. How likely is it
             that it corresponds to an aircraft?

Slide adapted from Dan Jurafsky's                                      3
    Conditional Probability
         Conditional probability specifies the probability given
         that the values of some other random variables are
                P(Sneeze | Cold) = 0.8
                P(Cold | Sneeze) = 0.6
         The probability of a sneeze given a cold is 80%.
         The probability of a cold given a sneeze is 60%.

Slides adapted from Mary Ellen Califf                              4
    More precisely

        Given an experiment, a corresponding sample space S, and the
        probability law
        Suppose we know that the outcome is within some given event B
             The first letter was „t‟
        We want to quantify the likelihood that the outcome also belongs
        to some other given event A.
             The second letter will be „h‟
        We need a new probability law that gives us the conditional
        probability of A given B
        P(A|B) “the probability of A given B”

Slide adapted from Dan Jurafsky's                                          5
    Joint Probability Distribution

      The joint probability distribution for a set of random variables X1…Xn
      gives the probability of every combination of values

                     Sneeze             ¬Sneeze
            Cold       0.08              0.01
           ¬Cold       0.01              0.9

      The probability of all possible cases can be calculated by summing
      the appropriate subset of values from the joint distribution.
      All conditional probabilities can therefore also be calculated
           P(Cold | ¬Sneeze)

Slides adapted from Mary Ellen Califf                                          6
    An intuition

    •   Let’s say A is “it’s raining”.
    •   Let’s say P(A) in dry California is .01
    •   Let’s say B is “it was sunny ten minutes ago”
    •   P(A|B) means
          •   “what is the probability of it raining now if it was sunny 10
              minutes ago”
    • P(A|B) is probably way less than P(A)
          • Perhaps P(A|B) is .0001
    • Intuition: The knowledge about B should change our estimate of
      the probability of A.

Slide adapted from Dan Jurafsky's                                             7
    Conditional Probability
       Let A and B be events
            P(A,B) and P(A  B) both means “the probability that
            BOTH A and B occur”
       p(B|A) = the probability of event B occurring given
       event A occurs
       definition: p(A|B) = p(A  B) / p(B)

                                      P( A, B)
                          P( A | B) 
                                       P( B)

            P(A, B) = P(A|B) * P(B)        (simple arithmetic)
            P(A, B) = P(B, A)

Slide adapted from Dan Jurafsky's                                  8
Bayes Theorem

 We start with conditional probability definition:

                          P( A, B)
              P( A | B) 
                           P( B)
 So say we know how to compute P(A|B). What if we
 want to figure out P(B|A)? We can re-arrange the
 formula using Bayes Theorem:

                    P( A | B) P( B)
       P ( B | A) 
                        P ( A)
Deriving Bayes Rule
           P(A  B)
P(A | B)                      P(A  B)
             P(B)   P(B | A) 

P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)

                          P(A | B)P(B)  P(B | A)P(A)
                                       P(B | A)P(A)
                            P(A | B) 
                                        P(B)
Slide adapted from Dan Jurafsky's                       10
    How to compute probilities?
         We don’t have the probabilities for most NLP
         We can try to estimate them from data
              (that‟s the learning part)
         Usually we can’t actually estimate the probability that
         something belongs to a given class given the
         information about it
         BUT we can estimate the probability that something
         in a given class has particular values.

Slides adapted from Mary Ellen Califf                              11
    Simple Bayesian Reasoning
         If we assume there are n possible disjoint tags, t1 … tn
                        P(ti | w) = P(w | ti) P(ti)
              Want to know the probability of the tag given the word.

              P(w| ti ) = number of times we see this tag with this word
              divided by how often we see the tag

              P(w| ti ) = Sum(word with tag i) / (count of tag i in corpus)

              P(ti ) = Sum(count of tag i in corpus) / (count of all tags)

              P(w) = Sum(count of word w in corpus) / (count of all words)

Slides adapted from Mary Ellen Califf                                        12
Some notation

  P(fi| Sentence)
 This means that you multiple all the features
  P(f1| S) * P(f2 | S) * … * P(fn | S)

 There is a similar one for summation.

Naïve Bayes Classifier
 The simpler version of Bayes was:
   P(B|A) = P(A|B)P(B)
   P(Sentence | feature) = P(feature | S) P(S)

 Using Naïve Bayes, we expand the number of feaures by
 defining a joint probability distribution:
   P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence)
   We learn P(Sentence) and P(fi| Sentence) in training

 Test: we need to state P(Sentence | f1, f2, … fn)
    P(Sentence| f1, f2, … fn) =
       P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)
     Bayes Independence Example
         If there are many kinds of evidence, we need to combine them
         By assuming independence, we ignore the possible interactions:

         Imagine there are diagnoses ALLERGY, COLD, and WELL
         Symptoms SNEEZE, COUGH, and FEVER

    Prob                    Well        Cold              Allergy
    P(d)                    0.9         0.05              0.05
    P(sneeze|d)             0.1         0.9               0.9
    P(cough | d)            0.1         0.8               0.7
    P(fever | d)            0.01        0.7               0.4

Slides adapted from Mary Ellen Califf                                     15
   Bayes Independence Example
         If symptoms are: sneeze & cough & no fever:
         P(well | s, c, not(f)) = P(e | well) P(well) / P (e)
          = (P(s | well) * P (c | well) * 1 - P(f|well)) * P(well) / P(e)
          = (0.1)(0.1)(0.99)(0.9)/P(e) = 0.0089/P(e)

         P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e)
         P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)

         P(e) = .0089 + .01 + .019 = .0379
              P(well | e) = .23
              P(cold | e) = .26
              P(allergy | e) = .50

         Diagnosis: allergy

Slides adapted from Mary Ellen Califf                                       16
Kupiec et al. Feature Representation
 Fixed-phrase feature
   Certain phrases indicate summary, e.g. “in summary”
 Paragraph feature
   Paragraph initial/final more likely to be important.
 Thematic word feature
   Repetition is an indicator of importance
 Uppercase word feature
   Uppercase often indicates named entities. (Taylor)
 Sentence length cut-off
   Summary sentence should be > 5 words.

   Details: Bayesian Classifier
                               P( F1 , F2 ,... Fk | s  S ) P( s  S )
P( s  S | F1 , F2 ,... Fk ) 
                                         P( F1 , F2 ,... Fk )
                                                      Probability of feature-value pair
Assuming statistical independence:                    occurring in a source sentence
                                                      which is also in the summary

                                          j 1
                                                 P( F j | s  S ) P( s  S )
  P( s  S | F , F ,...F ) 
                  1    2      k                        k
                                                       j 1
                                                            P( F j )         compression
Probability that sentence s is included
in summary S, given that sentence’s
feature value pairs
                                                   Probability of feature-value pair
                                                   occurring in a source sentence
Language Identification

Language identification
  Tutti gli esseri umani nascono liberi ed eguali
  in dignità e diritti. Essi sono dotati di
  ragione e di coscienza e devono agire gli uni
  verso gli altri in spirito di fratellanza.

  Alle Menschen sind frei und gleich an Würde und
  Rechten geboren. Sie sind mit Vernunft und
  Gewissen begabt und sollen einander im Geist
  der Brüderlichkeit begegnen.

Universal Declaration of Human Rights, UN, in 363 languages

Language identification


 How to do determine, for a stretch of text, which
 language it is from?

Language Identification
Turns out to be really simple
Just a few character bigrams can do it   (Sibun & Reynar 96)
  Used Kullback Leibler distance (relative entropy)
  Compare probability distribution of the test set to
  those for the languages trained on
  Smallest distance determines the language
  Using special character sets helps a bit, but barely

Language Identification
(Sibun & Reynar 96)

Confusion Matrix

 A table that shows, for each class, which ones your
 algorithm got right and which wrong

                             Gold standard

 Algorithm’s guess

Author Identification

Author Identification

 Also called Stylometry in the humanities

 An example of a Classification Problem

    Decide which of N buckets to put an item in
    (Some classifiers allow for multiple buckets)

The Disputed Federalist Papers
In 1787-1788, Jay, Madison, and Hamilton
wrote a series of anonymous essays to
convince the voters of New York to ratify the
new U. S. Constitution.
Scholars have consensus that:
   5 authored by Jay
  51 authored by Hamilton
  14 authored by Madison
    3 jointly by Hamilton and Madison

  12 remain in dispute … Hamilton or Madison?

Author identification

 Federalist papers
   In 1963 Mosteller and Wallace solved the problem

   They identified function words as good candidates for
   authorships analysis

   Using statistical inference they concluded the author
   was Madison

   Since then, other statistical techniques have
   supported this conclusion.

Function vs. Content Words

   High rates for “by” favor M, low favor H
   High rates for “from” favor M, low says little
   High rats for “to” favor H, low favor M
Function vs. Content Words

      No consistent pattern for “war”
              Federalist Papers Problem

Fung, The Disputed Federalist Papers: SVM Feature Selection
Via Concave Minimization, ACM TAPIA’03                        32

 Can Pseudonymity Really Guarantee Privacy?
   Rao and Rohatgi, 2000

Next Time

 Guest lecture by Elizabeth Charnock and Steve
 Roberts of Cataphora


To top