Docstoc

LING 581 Advanced Computational Linguistics

Document Sample
LING 581 Advanced Computational Linguistics Powered By Docstoc
					  LING 581: Advanced
Computational Linguistics
       Lecture Notes
       January 12th
                   Course
• Webpage for lecture slides
  – http://dingo.sbs.arizona.edu/~sandiway/ling581-
    12/
• Meeting information
            Course Objectives
• Follow-on course to LING/C SC/PSYC 438/538
  Computational Linguistics:
  – continue with J&M: 25 chapters, a lot of material not covered
    in 438/538
• And gain project experience
  –   dealing with natural language software packages
  –   Installation, input data formatting
  –   operation
  –   project exercises
  –   useful “real-world” computational experience
  – abilities gained will be of value to employers
        Computational Facilities
•   Advise using your own laptop/desktop
    – we can also make use of this computer lab
         • but you don’t have installation rights on these computers
•   Platforms
    Windows is possible but you really should run some variant of Unix…
    (your task #1 for this week)
    – Linux (separate bootable partition or via virtualization software)
         •   de facto standard for advanced/research software
         •   https://www.virtualbox.org/ (free!)
    – Cygwin on Windows
         •   http://www.cygwin.com/
         •   Linux-like environment for Windows making it possible to port software running on POSIX
             systems (such as Linux, BSD, and Unix systems) to Windows.
    – MacOS X
         •   Not quite Linux, some porting issues, especially with C programs, can use Virtual Box
              Grading
• Completion of all homework task and
  course project will result in a
  satisfactory grade (A)
Homework Task 1: Install Tregex

                    Runs in java
          Last semester
• 438/538: Language models
    Language Models and N-grams
•   given a word sequence
    – w1 w2 w3 ... wn
•   chain rule
    –   how to compute the probability of a sequence of words
    –   p(w1 w2) = p(w1) p(w2|w1)
    –   p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2)
    –   ...
    –   p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

•   note
    – It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1)
      for all possible word sequences
    Language Models and N-grams
•   Given a word sequence
    – w1 w2 w3 ... wn
•   Bigram approximation
    –   just look at the previous word only (not all the proceedings words)
    –   Markov Assumption: finite length history
    –   1st order Markov Model
    –   p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)


    – p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)


•   note
    – p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than
      p(wn|w1...wn-2 wn-1)
    Language Models and N-grams
•   Trigram approximation
    – 2nd order Markov Model
    – just look at the preceding two words only
    – p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2)
      p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1)

    – p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-
       2 wn-1)



•   note
    – p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1)
      but harder than p(wn|wn-1 )
 Language Models and N-grams
• estimating from corpora
  – how to compute bigram probabilities
  –   p(wn|wn-1) = f(wn-1wn)/f(wn-1w)   w is any word

  –   Since f(wn-1w) = f(wn-1)          f(wn-1) = unigram frequency for wn-1

  –   p(wn|wn-1) = f(wn-1wn)/f(wn-1)    relative frequency



• Note:
  – The technique of estimating (true) probabilities using a
    relative frequency measure over a training corpus is known
    as maximum likelihood estimation (MLE)
     Motivation for smoothing
• Smoothing: avoid zero probability estimates
• Consider
     p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
• what happens when any individual probability
  component is zero?
   – Arithmetic multiplication law: 0×X = 0
   – very brittle!
• even in a very large corpus, many possible n-grams
  over vocabulary space will have zero frequency
   – particularly so for larger n-grams
       Language Models and N-grams
                       wn-1wn bigram
  •    Example:   wn   frequencies


wn-1
                                                       unigram
                                                       frequencies




                                       bigram
                                       probabilities
                                  sparse matrix
                                  zeros render probabilities
                                  unusable
                                  (we’ll need to add fudge
                                  factors - i.e. do smoothing)
            Smoothing and N-grams
•   sparse dataset means zeros are a problem
    – Zero probabilities are a problem
        •   p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model
        • one zero and the whole product is zero
    – Zero frequencies are a problem
        •   p(wn|wn-1) = f(wn-1wn)/f(wn-1)                                 relative frequency
        • bigram f(wn-1wn) doesn’t exist in dataset


•   smoothing
    – refers to ways of assigning zero probability n-grams a non-zero value
               Smoothing and N-grams
•   Add-One Smoothing (4.5.1 Laplace Smoothing)
     –   add 1 to all frequency counts
     –   simple and no more zeros (but there are better methods)
                                                                               must
                                                                               rescale so
•   unigram
                                                                               that total
     –   p(w) = f(w)/N               (before Add-One)
           •   N = size of corpus                                              probability
     –   p(w) = (f(w)+1)/(N+V)    (with Add-One)                               mass stays
     –   f*(w) = (f(w)+1)*N/(N+V) (with Add-One)                               at 1
            • V = number of distinct words in corpus
            • N/(N+V) normalization factor adjusting for the effective increase in the corpus size
               caused by Add-One
•   bigram
     –   p(wn|wn-1) = f(wn-1wn)/f(wn-1)                                   (before Add-One)
     –   p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)               (after Add-One)
     –   f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)               (after Add-One)
             Smoothing and N-grams
•   Add-One Smoothing                                             Remarks:
     –   add 1 to all frequency counts
                                                                  perturbation problem
•   bigram
     –   p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)
                                                                  add-one causes large
     –   (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
•   frequencies                                                   changes in some
                                                                  frequencies due to
                                                                  relative size of V (1616)

                                                                  want to: 786  338
                                                  = figure 6.4




                                                                 = figure 6.8
             Smoothing and N-grams
•   Add-One Smoothing                             Remarks:
     –   add 1 to all frequency counts
                                                  perturbation problem
•   bigram
     –   p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V)
                                                  similar changes in
     –   (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
•   Probabilities                                 probabilities




                                                      = figure 6.5



                                                       = figure 6.7
               Smoothing and N-grams

• let’s illustrate the                                  probability mass
  problem                                  f(wn-1wn)
   – take the bigram case:
   – wn-1wn

   –   p(wn|wn-1) = f(wn-1wn)/f(wn-1)

                                                        f(wn-1)

   – suppose there are cases
   –   wn-1w01 that don’t occur in the
       corpus


                                         f(wn-1w01)=0
                                              ...
                                         f(wn-1w0m)=0
             Smoothing and N-grams

• add-one                                  probability mass
  –   “give everyone 1”      f(wn-1wn)+1




                                           f(wn-1)




                            f(wn-1w01)=1
                                 ...
                            f(wn-1w0m)=1
               Smoothing and N-grams

• add-one                                               probability mass
    –   “give everyone 1”                 f(wn-1wn)+1




•   redistribution of probability mass
     – p(wn|wn-1) = (f(wn-                              f(wn-1)

       1wn)+1)/(f(wn-1)+V)




                                         f(wn-1w01)=1
                                              ... )=1    V = |{wi)|
                                              w0
                                         f(wn-1   m
     Smoothing and N-grams

• Excel spreadsheet available
  – addone.xls
             Smoothing and N-grams
•   Good-Turing Discounting (4.5.2)
     –   Nc = number of things (= n-grams) that occur c times in the corpus
     –   N = total number of things seen
     –   Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc
     –   Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet
     –   estimate N0 in terms of N1…
     –   Formula: P*(things with zero freq) = N1/N
     –   smaller impact than Add-One


•   Textbook Example:
     –   Fishing in lake with 8 species
     –   Bass, carp, catfish, eel, perch, salmon, trout, whitefish
     –   Sample data:
     –   10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

     –   P(new fish unseen) = 3/18
     –   P(next fish=trout) = 1/18 (but, we have reassigned probability mass…)
     –   C*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1)
 Language Models and N-grams

• N-gram models
  – they’re technically easy to compute
    • (in the sense that lots of training data are
      available)
  – but just how good are these n-gram
    language models?
  – and what can they show us about
    language?
        Language Models and N-grams
approximating Shakespeare
         –   generate random sentences using n-grams
         –   train on complete Works of Shakespeare
•       Unigram (pick random, unconnected words)




    •   Bigram
        Language Models and N-grams
•       Approximating Shakespeare (section 6.2)        Remarks:
         –   generate random sentences using n-grams
                                                       dataset size problem
         –   train on complete Works of Shakespeare
•       Trigram
                                                       training set is small
                                                       884,647 words
                                                       29,066 different words

                                                       29,0662 = 844,832,356
                                                       possible bigrams
    •   Quadrigram
                                                       for the random sentence
                                                       generator, this means
                                                       very limited choices for
                                                       possible continuations,
                                                       which means program
                                                       can’t be very innovative
                                                       for higher n
    Language Models and N-grams
•   Aside: http://hemispheresmagazine.com/contests/2004/intro.htm
 Language Models and N-grams

• N-gram models + smoothing
  – one consequence of smoothing is that
  – every possible concatentation or sequence
    of words has a non-zero probability
           Colorless green ideas
•   examples
     – (1) colorless green ideas sleep furiously
     – (2) furiously sleep ideas green colorless
•   Chomsky (1957):
     – . . . It is fair to assume that neither sentence (1) nor (2) (nor
       indeed any part of these sentences) has ever occurred in an
       English discourse. Hence, in any statistical model for
       grammaticalness, these sentences will be ruled out on
       identical grounds as equally `remote' from English. Yet (1),
       though nonsensical, is grammatical, while (2) is not.
•   idea
     – (1) is syntactically valid, (2) is word salad
•   Statistical Experiment (Pereira 2002)
           Colorless green ideas
•   examples
     – (1) colorless green ideas sleep furiously
     – (2) furiously sleep ideas green colorless
•   Statistical Experiment (Pereira 2002)          wi-1   wi

                                                   bigram language model
    Interesting things to Google
•   example
     – colorless green ideas sleep furiously
•   First hit
    Interesting things to Google
•   example
     – colorless green ideas sleep furiously
•   first hit
     – compositional semantics
     –    a green idea is, according to well established usage of the word "green" is
         one that is an idea that is new and untried.
     –   again, a colorless idea is one without vividness, dull and unexciting.
     –   so it follows that a colorless green idea is a new, untried idea that is
         without vividness, dull and unexciting.
     –   to sleep is, among other things, is to be in a state of dormancy or inactivity,
         or in a state of unconsciousness.
     –   to sleep furiously may seem a puzzling turn of phrase but one reflects that
         the mind in sleep often indeed moves furiously with ideas and images
         flickering in and out.
    Interesting things to Google
•   example
     – colorless green ideas sleep furiously
•   another hit: (a story)
     – "So this is our ranking system," said Chomsky. "As you can see,
       the highest rank is yellow."
     – "And the new ideas?"
     – "The green ones? Oh, the green ones don't get a color until
       they've had some seasoning. These ones, anyway, are still too
       angry. Even when they're asleep, they're furious. We've had to
       kick them out of the dormitories - they're just unmanageable."
     – "So where are they?"
     – "Look," said Chomsky, and pointed out of the window. There below,
       on the lawn, the colorless green ideas slept, furiously.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:10/29/2013
language:English
pages:32