Document Sample
em Powered By Docstoc
					          Three Basic Problems
1. Compute the probability of a text (observation)
   •   language modeling – evaluate alternative texts and
2. Compute maximum probability tag (state) sequence
   •   Tagging/classification
               arg maxT1,N Pm(T1,N | W1,N)
3. Compute maximum likelihood model
   •   training / parameter estimation
                     arg maxm Pm(W1,N)
     Compute Text Probability

• Recall: P(W,T) = i P(ti-1ti) P(wi | ti)
• Text probability: need to sum P(W,T) over
  all possible sequences – an exponential
• Dynamic programming approach – similar
  to the Viterbi algorithm
• Will be used also for estimating model
  parameters from an untagged corpus
                Forward Algorithm
Define:          Ai(k) = P(w1,k, tk= ti);
                 Nt – total num. of tags

For i = 1 To Nt: Ai(1) = m(t0ti)m(w1 | ti)
1. For k = 2 To N; For j = 1 To Nt:
     i.     Aj(k) =   [ A (k-1)m(t t )]m(w | t )
                             i   i
                                       i    j

2.        Then:
          Pm(W1,N) =    A (N)
                         i       i

Complexity = O(Nt2 N) (like Viterbi,  instead of max)
                        Forward Algorithm
       w1                            w2                        w3
                          m(t1t1)                  m(t1t1)
           t1   A1(1)                t1   A1(2)                t1 A1(3)
                          m(t2t1)                  m(t2t1)

           t2   A2(1) m(t3t1)       t2 A2(2) m(t3t1)         t2 A2(3)

           t3   A3(1)                t3 A3(2)                  t3 A3(3)
                        m(t4t1)                  m(t4t1)

           t4   A4(1)                t4 A4(2)                  t4 A4(3)
                    m(t5t1)                  m(t5t1)

           t5   A5(1)                t5 A5(2)                  t5 A5(3)
          Backward Algorithm
Define Bi(k) = P(wk+1,N | tk=ti)

1. For i = 1 To Nt: Bi(N) = 1
2. For k = N-1 To 1; For j = 1 To Nt:
  i. Bj(k) =     [ m(t t )m(w
                          j    i
                                       k+1 | ti)Bi(k+1)   ]
3. Then:
    Pm(W1,N) =     m(t t )m(w | t )B (1)
                     i    0

Complexity = O(Nt2 N)
 Pm(W1,3)           Backward Algorithm
      w1                           w2                        w3
                      m(t1t1)                  m(t1t1)
       t1   B1(1)                  t1   B1(2)                t1 B1(3)
m(t0ti)              m(t2t1)                  m(t2t1)

       t2   B2(1)
                       m(t3t1)    t2 B2(2)      m(t3t1)    t2 B2(3)

       t3   B3(1)                  t3 B3(2)                  t3 B3(3)
                        m(t4t1)                  m(t4t1)

       t4   B4(1)                  t4 B4(2)                  t4 B4(3)
                         m(t5t1)                  m(t5t1)

       t5   B5(1)                  t5 B5(2)                  t5 B5(3)
 Estimation from Untagged Corpus:
  EM – Expectation-Maximization
1. Start with some initial model
2. Compute the probability of (virtually) each state
    sequence given the current model
3. Use this probabilistic tagging to produce
    probabilistic counts for all parameters, and use
    these probabilistic counts to estimate a revised
    model, which increases the likelihood of the
    observed output W in each iteration
4. Repeat until convergence
Note: No labeled training required. Initialize by
    lexicon constraints regarding possible POS for
    each word (cf. “noisy counting” for PP’s)
• aij = Estimate of P(titj)
• bjk = Estimate of P(wk | tj)
• Ai(k) = P(w1,k, tk=ti)
     (from Forward algorithm)
• Bi(k) = P(wk+1,N | tk=ti)
     (from Backwards algorithm)
  Estimating transition probabilities

Define pk(i,j) as prob. of traversing arc titj at
  time k given the observations:
pk(i,j)    = P(tk = ti, tk+1 = tj | W)
           = P(tk = ti, tk+1 = tj,W) / P(W)
                Ai (k )aijb jk B j (k  1)
                     r 1
                            Ar (k ) Br (k )
                          Ai (k )aijb jk B j (k  1)
                 
                   Nt        Nt
                   r 1      s 1
                                    Ar (k )arsb jk Bs (k  1)
             Expected transitions
• Define gi(k) = P(tk = ti | W), then:
  gi(k) =  j 1 pk (i, j )

• Now note that:
   – Expected number of transitions from tag i =
                          k 1
                                   g i (k )

   – Expected transitions from tag i to tag j =
                         k 1
                                  pk (i, j )
      Re-estimation of Maximum
        Likelihood Parameters
            expected # of transitio ns from tag i to j
• a‟ij =
              expected # of transitio ns from tag i

               k 1
                      pk (i, j )
                k 1
                       g i (k )

            expected # of observatio ns of k for tag i
• b‟ik =
           expected number of transitio ns from tag i

                           
               r:wr  wk           j 1
                                          pr (i, j )
                           k 1
                                  g i (k )
             EM Algorithm
1. Choose initial model = <a,b,g(1)>
2. Repeat until results don‟t improve (much):
  1. Compute pk based on current model, using
     Forward & Backwards algorithms to compute
     A and B (Expectation for counts)
  2. Compute new model <a’,b’,g‟(1)>
     (Maximization of parameters)

Note: Output likelihood is guaranteed to
   increase in each iteration, but might
   converge to a local maximum!
     Initialize Model by Dictionary
• Training should be directed to correspond to the
  linguistic perception of POS (recall local max)
• Achieved by a dictionary with possible POS for
  each word
• Word-based initialization:
   – P(w|t) = 1 / #of listed POS for w, for the listed POS;
     and 0 for unlisted POS
• Class-based initialization (Kupiec, 1992):
   – Group all words with the same possible POS into a
   – Estimate parameters and perform tagging for
   – Frequent words are handled individually
Some extensions for HMM POS tagging

• Higher-order models: trigrams, possibly
  interpolated with bigrams
• Incorporating text features:
  – Output prob = P(wi,fj | tk) where f is a vector of
    features (capitalized, ends in –d, etc.)
  – Features useful to handle unknown words
• Combining labeled and unlabeled training
  (initialize with labeled then do EM)
  Transformational Based Learning
         (TBL) for Tagging
• Introduced by Brill (1995)
• Can exploit a wider range of lexical and syntactic
  regularities via transformation rules – triggering
  environment and rewrite rule
• Tagger:
   – Construct initial tag sequence for input – most frequent
     tag for each word
   – Iteratively refine tag sequence by applying
     “transformation rules” in rank order
• Learner:
   – Construct initial tag sequence for the training corpus
   – Loop until done:
      • Try all possible rules and compare to known tags, apply the
        best rule r* to the sequence and add it to the rule ranking
            Some examples
1. Change NN to VB if previous is TO
  – to/TO conflict/NN with  VB
2. Change VBP to VB if MD in previous three
  – might/MD vanish/VBP  VB
3. Change NN to VB if MD in previous two
  – might/MD reply/NN  VB
4. Change VB to NN if DT in previous two
  – the/DT reply/VB  NN
       Transformation Templates
• Specify which transformations are possible
For example: change tag A to tag B when:
  1.   The preceding (following) tag is Z
  2.   The tag two before (after) is Z
  3.   One of the two previous (following) tags is Z
  4.   One of the three previous (following) tags is Z
  5.   The preceding tag is Z and the following is W
  6.   The preceding (following) tag is Z and the tag
       two before (after) is W
New templates to include dependency on surrounding
   words (not just tags):
Change tag A to tag B when:
  1. The preceding (following) word is w
  2. The word two before (after) is w
  3. One of the two preceding (following) words is w
  4. The current word is w
  5. The current word is w and the preceding (following)
     word is v
  6. The current word is w and the preceding (following) tag
     is X (Notice: word-tag combination)
  7. etc…
        Initializing Unseen Words
•  How to choose most likely tag for unseen
Transformation based approach:
    –   Start with NP for capitalized words, NN for
    –   Learn “morphological” transformations from:
        Change tag from X to Y if:
           1.   Deleting prefix (suffix) x results in a known word
           2.   The first (last) characters of the word are x
           3.   Adding x as a prefix (suffix) results in a known word
           4.   Word W ever appears immediately before (after) the word
           5.   Character Z appears in the word
                        TBL Learning Scheme
 Input Text

Setting Initial
                               Ground Truth for
                                  Input Text

    Greedy Learning Algorithm
• Initial tagging of training corpus – most
  frequent tag per word
• At each iteration:
  – Compute “error reduction” for each
    transformation rule:
     • #errors fixed - #errors introduced
  – Find best rule; If error reduction greater than a
    threshold (to avoid overfitting):
     • Apply best rule to training corpus
     • Append best rule to ordered list of transformations
      Morphological Richness
• Parts of speech really include features:
  – NN2  Noun(type=common,num=plural)
  This is more visible in other languages with
    richer morphology:
  – Hebrew nouns: number, gender, possession
  – German nouns: number, gender, case, …
  – And so on…