Dependency Parsing by Belief Propagation Deficit

Document Sample
Dependency Parsing by Belief Propagation Deficit Powered By Docstoc
					Dependency Parsing
by Belief Propagation


    David A. Smith (JHU  UMass Amherst)
    Jason Eisner (Johns Hopkins University)


                                          1
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      2
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      3
   Word Dependency Parsing
   Raw sentence
   He reckons the current account deficit will narrow to only 1.8 billion in September.
                                                 Part-of-speech tagging
   POS-tagged sentence
    He reckons the current account deficit will narrow to only 1.8 billion in September.
    PRP     VBZ   DT   JJ      NN      NN   MD      VB   TO   RB    CD    CD   IN    NNP       .


                                                 Word dependency parsing

   Word dependency parsed sentence
   He reckons the current account deficit will narrow to only 1.8 billion in September .
                              MOD                                  MOD              COMP
     SUBJ              MOD               SUBJ
                                                              COMP
                       SPEC
                                      S-COMP
                                                    ROOT


slide adapted from Yuji Matsumoto                                                          4
What does parsing have to do
with belief propagation?


loopy   belief propagation




loopy   belief propagation



                               5
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing       New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      6
Great ideas in NLP: Log-linear models
    (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

   In the beginning, we used generative models.
        p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
          each choice depends on a limited part of the history

        but which dependencies to allow?  p(D | A,B,C)?
         what if they‟re all worthwhile?  p(D | A,B,C)?
                           … p(D | A,B) * p(C | A,B,D)?




                                                                         7
Great ideas in NLP: Log-linear models
    (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

   In the beginning, we used generative models.
          p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
           which dependencies to allow? (given limited training data)

   Solution: Log-linear (max-entropy) modeling
          (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)
                         * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *
          …throw them all in!
       Features may interact in arbitrary ways
       Iterative scaling keeps adjusting the feature weights
        until the model agrees with the training data.


                                                                         8
How about structured outputs?
   Log-linear models great for n-way classification
   Also good for predicting sequences
        v        a         n
                                 but to allow fast dynamic
                                 programming,
       find   preferred   tags
                                 only use n-gram features



   Also good for dependency parsing
                                 but to allow fast dynamic
                                 programming or MST parsing,
     …find preferred links…      only use single-edge features



                                                                 9
How about structured outputs?
                           but to allow fast dynamic
                           programming or MST parsing,
  …find preferred links…   only use single-edge features




                                                           10
Edge-Factored Parsers (McDonald et al. 2005)
   Is this a good edge?


             yes, lots of green ...




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou




“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     11
Edge-Factored Parsers (McDonald et al. 2005)
   Is this a good edge?

jasný  den
(“bright day”)




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou




“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     12
Edge-Factored Parsers (McDonald et al. 2005)
   Is this a good edge?

                               jasný  N
jasný  den
                            (“bright NOUN”)
(“bright day”)




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V      A      A        A       N   J      N      V         C


“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     13
Edge-Factored Parsers (McDonald et al. 2005)
   Is this a good edge?

                               jasný  N
jasný  den
                            (“bright NOUN”)
(“bright day”)

                                                  A N

    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V      A      A        A       N   J      N      V         C


“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     14
   Edge-Factored Parsers (McDonald et al. 2005)
      Is this a good edge?

                                  jasný  N
  jasný  den
                               (“bright NOUN”)
   (“bright
 A  N day”)
 preceding
conjunction                                          A N

       Byl    jasný studený dubnový den a hodiny odbíjely třináctou
       V       A      A       A       N   J      N      V         C


   “It was a bright cold day in April and the clocks were striking thirteen”


                                                                        15
Edge-Factored Parsers (McDonald et al. 2005)
   How about this competing edge?


               not as good, lots of red ...




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V      A       A       A       N   J    N        V         C


“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     16
Edge-Factored Parsers (McDonald et al. 2005)
   How about this competing edge?
jasný  hodiny
 (“bright clocks”)
... undertrained ...




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V       A          A   A       N   J    N        V         C


“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     17
Edge-Factored Parsers (McDonald et al. 2005)
   How about this competing edge?
jasný  hodiny              jasn  hodi
 (“bright clocks”)          (“bright clock,”
                              stems only)
... undertrained ...




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V       A          A     A       N    J     N      V       C
    byl jasn         stud   dubn     den a     hodi   odbí    třin

“It was a bright cold day in April and the clocks were striking thirteen”


                                                                     18
Edge-Factored Parsers (McDonald et al. 2005)
   How about this competing edge?
jasný  hodiny              jasn  hodi
 (“bright clocks”)          (“bright clock,”
                              stems only)             Aplural  Nsingular
... undertrained ...




    Byl   jasný studený dubnový den a hodiny odbíjely třináctou
    V       A          A     A       N    J     N        V       C
    byl jasn         stud   dubn     den a     hodi     odbí     třin

“It was a bright cold day in April and the clocks were striking thirteen”


                                                                        19
 Edge-Factored Parsers (McDonald et al. 2005)
    How about this competing edge?
 jasný  hodiny             jasn  hodi
   (“brightN
     A  clocks”)           (“bright clock,”
where N follows ...           stems only)             Aplural  Nsingular
 ... undertrained
 a conjunction



     Byl   jasný studený dubnový den a hodiny odbíjely třináctou
     V      A         A      A       N    J     N        V       C
     byl jasn     stud     dubn      den a     hodi     odbí     třin

  “It was a bright cold day in April and the clocks were striking thirteen”


                                                                        20
Edge-Factored Parsers (McDonald et al. 2005)
    Which edge is better?
         “bright day” or “bright clocks”?




         Byl   jasný studený dubnový den a hodiny odbíjely třináctou
         V      A      A        A      N     J    N      V         C
         byl jasn    stud     dubn     den a     hodi   odbí      třin

    “It was a bright cold day in April and the clocks were striking thirteen”


                                                                         21
Edge-Factored Parsers (McDonald et al. 2005)
                                           our current weight vector
    Which edge is better?
    Score of an edge e =   features(e)
    Standard algos  valid parse with max total score




      Byl   jasný studený dubnový den a hodiny odbíjely třináctou
       V      A      A         A       N     J    N      V         C
      byl   jasn    stud     dubn     den a      hodi   odbí      třin

    “It was a bright cold day in April and the clocks were striking thirteen”


                                                                         22
Edge-Factored Parsers (McDonald et al. 2005)
                               our current weight vector
   Which edge is better?
   Score of an edge e =   features(e)
   Standard algos  valid parse with max total score



      can‟t have both                   can„t have both
    (one parent per word)                (no crossing links)


                             Thus, an edge may lose (or win)
    Can‟t have all three     because of a consensus of other
          (no cycles)        edges.


                                                               23
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      24
Finding Highest-Scoring Parse
   Convert to context-free grammar (CFG)
   Then use dynamic programming


          The cat in the hat wore a stovepipe. ROOT

                                                  let‟s vertically stretch
                                                  this graph drawing

                                                              ROOT
                                       wore
                cat                               stovepipe
          The         in                      a
                                 hat
                           the         each subtree is a linguistic constituent
                                       (here a noun phrase)
                                                                             25
Finding Highest-Scoring Parse
   Convert to context-free grammar (CFG)
   Then use dynamic programming
       CKY algorithm for CFG parsing is O(n3)
       Unfortunately, O(n5) in this case
           to score “cat  wore” link, not enough to know this is NP
           must know it‟s rooted at “cat”
           so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}
           so CKY‟s “grammar constant” is no longer constant 
                                                  ROOT
                                             wore
                      cat                               stovepipe
                The         in                      a
                                       hat
                                 the         each subtree is a linguistic constituent
                                             (here a noun phrase)
                                                                                 26
Finding Highest-Scoring Parse
   Convert to context-free grammar (CFG)
   Then use dynamic programming
       CKY algorithm for CFG parsing is O(n3)
       Unfortunately, O(n5) in this case
       Solution: Use a different decomposition (Eisner 1996)
           Back to O(n3)

                                                                    ROOT
                                             wore
                      cat                               stovepipe
                The         in                      a
                                       hat
                                 the         each subtree is a linguistic constituent
                                             (here a noun phrase)
                                                                                 27
      Spans vs. constituents
Two kinds of substring.
  » Constituent of the tree: links to the rest
    only through its headword (root).

         The cat in the hat wore a stovepipe. ROOT

  » Span of the tree: links to the rest
    only through its endwords.

         The cat in the hat wore a stovepipe. ROOT

                                                     28
Decomposing a tree into spans

        The cat in the hat wore a stovepipe. ROOT



    The cat   +     cat in the hat wore a stovepipe. ROOT



              cat in the hat wore     +   wore a stovepipe. ROOT




        cat in     +   in the hat wore



                  in the hat   +   hat wore
Finding Highest-Scoring Parse
   Convert to context-free grammar (CFG)
   Then use dynamic programming
       CKY algorithm for CFG parsing is O(n3)
       Unfortunately, O(n5) in this case
       Solution: Use a different decomposition (Eisner 1996)
           Back to O(n3)
   Can play usual tricks for dynamic programming parsing
       Further refining the constituents or spans
           Allow prob. model to keep track of even more internal information
       A*, best-first, coarse-to-fine       require “outside” probabilities
       Training by EM etc.                  of constituents, spans, or links


                                                                         30
Hard Constraints on Valid Trees
                               our current weight vector

   Score of an edge e =   features(e)
   Standard algos  valid parse with max total score



      can‟t have both                   can„t have both
    (one parent per word)                (no crossing links)


                             Thus, an edge may lose (or win)
    Can‟t have all three     because of a consensus of other
          (no cycles)        edges.


                                                               31
  Non-Projective Parses

ROOT   I    „ll   give   a   talk   tomorrow      on bootstrapping
           subtree rooted at “talk”
           is a discontiguous noun phrase



                                              can„t have both
                                              (no crossing links)


                                     The “projectivity” restriction.
                                     Do we really want it?




                                                                    32
  Non-Projective Parses

ROOT    I   „ll   give    a    talk    tomorrow       on bootstrapping
            occasional non-projectivity in English



 ROOT        ista        meam          norit       gloria       canitiem
             thatNOM     myACC        may-know     gloryNOM    going-grayACC


                       That glory may-know my going-gray
                        (i.e., it shall last till I go gray)
            frequent non-projectivity in Latin, etc.

                                                                        33
     Finding highest-scoring non-projective tree
        Consider the sentence “John saw Mary” (left).
        The Chu-Liu-Edmonds algorithm finds the maximum-
         weight spanning tree (right) – may be non-projective.
        Can be found in time O(n2).
                                      9
          root                                               root
                           10                                           10
                                              30                                   30
 9                              saw                                          saw
                 20
                                          0
                           30                                           30
          John                                 Mary          John                    Mary


                      11                              Every node selects best parent
                                                      If cycles, contract them and repeat
                       3


slide thanks to Dragomir Radev                                                          34
         Summing over all non-projective trees
   Finding highest-scoring non-projective tree
     Consider the sentence “John saw Mary” (left).
     The Chu-Liu-Edmonds algorithm finds the maximum-
      weight spanning tree (right) – may be non-projective.
     Can be found in time O(n2).

     How about total weight Z of all trees?
     How about outside probabilities or gradients?
     Can be found in time O(n3) by matrix determinants
      and inverses (Smith & Smith, 2007).



slide thanks to Dragomir Radev                            35
  Graph Theory to the Rescue!
    O(n3) time!

       Tutte‟s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row and
column r is equal to the sum of scores of all directed
spanning trees of G rooted at node r.

                    Exactly the Z we need!


                                                         36
                Building the Kirchoff
                 (Laplacian) Matrix
   s(1,0) s(2,0)
   0 s(1,0)       s(2,0)         s(n,0) 
  s (1, j )  s (2,1)
   0
  
    s(1, j) s(2,1)
   0
                                  s(n,0)
                                    s (n,1
                                            )
                                   s(n,1) 
                                                  • Negate edge scores
  
j 1
   0
  s (1,2)0      s(2,1)          s(n,1) 
                                                • Sum columns
  j1
  
   0
   s(1,2)
   0
                 s (2, j )
   s(1,2)j  2  s(2, j)
                      0
                                   s (n,2
                                   s(n,2)
                                   s(n,2)
                                            
                                             )      (children)
                                              • Strike root row/col.
               j2
                                       
  
                                              • Take determinant
0s (1,s(1,n) ss(2,n)
   s(1,n) s(2,n)
   0
   n)
  
                   (2, n)         s(n,nj))
                                     0 j
                                      s ( , 
                                             
                                  j
                                    jnn     

      N.B.: This allows multiple children of root, but see Koo et al. 2007.

                                                                         37
            Why Should This Work?
Clear for 1x1 matrix; use induction
                                               Chu-Liu-Edmonds analogy:
 s(1, j)   s(2,1)            s(n,1)         Every node selects best parent
j1                                            If cycles, contract and recur
s(1,2)      s(2, j)          s(n,2)
            j2



s(1,n)     s(2,n)            s(n, j)
                               jn

K  K with contracted edge 1,
                             2
  
K  K({1,2} |{1,2})
                   
K  s(1,2) K  K 
                      Undirected case; special root cases for directed
                                                                         38
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing         New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      39
Exactly Finding the Best Parse
                                   but to allow fast dynamic
                                   programming or MST parsing,
        …find preferred links…     only use single-edge features

   With arbitrary features, runtime blows up
       Projective parsing: O(n3) by dynamic programming
                    grandp.
                   + sibling      POS          sibling pairs
    grandparents                                 (non-adjacent)
                   bigrams         trigrams

          O(n4)       O(n5)    O(n3g6) … O(2n)
       Non-projective: O(n2) by minimum spanning tree
                                 • any of the above features
                                 • soft penalties for crossing links
                                 • pretty much anything else!
                     NP-hard

                                                                   40
    Let’s reclaim our freedom (again!)
    This paper in a nutshell

   Output probability is a product of local factors
       Throw in any factors we want! (log-linear model)
        (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ
   How could we find best parse?
       Integer linear programming (Riedel et al., 2006)
             doesn‟t give us probabilities when training or parsing
       MCMC
             Slow to mix? High rejection rate because of hard TREE constraint?
       Greedy hill-climbing (McDonald & Pereira 2006)
                                                  none of these exploit
                                                  tree structure of parses
                                                  as the first-order methods do

                                                                                  41
    Let’s reclaim our freedom (again!)
    This paper in a nutshell                certain global factors ok too

   Output probability is a product of local factors
       Throw in any factors we want! (log-linear model)
        (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ
   Let local factors negotiate via “belief propagation”
        Links (and tags) reinforce or suppress one another
       Each iteration takes total time O(n2) or O(n3)
                       each global factor can be handled fast via some
                     traditional parsing algorithm (e.g., inside-outside)

   Converges to a pretty good (but approx.) global parse

                                                                            42
Let’s reclaim our freedom (again!)
This paper in a nutshell
Training with many features Decoding with many features
   Iterative scaling               Belief propagation

   Each weight in turn is          Each variable in turn is
   influenced by others            influenced by others

   Iterate to achieve              Iterate to achieve
   globally optimal weights        locally consistent beliefs

   To train distrib. over trees,   To decode distrib. over trees,
   use dynamic programming         use dynamic programming
   to compute normalizer Z         to compute messages New!



                                                                43
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      44
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging

          Possible tagging (i.e., assignment to remaining variables)



         …        v                 v                v        …




                 find        preferred          tags
                   Observed input sentence (shaded)

                                                                       45
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging

          Possible tagging (i.e., assignment to remaining variables)
          Another possible tagging


         …        v                 a                n        …




                 find        preferred          tags
                   Observed input sentence (shaded)

                                                                       46
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging
 ”Binary” factor         v   n   a        v   n   a       Model reuses
  that measures        v 0   2   1      v 0   2   1     same parameters
compatibility of 2     n 2   1   0      n 2   1   0
                                                         at this position
  adjacent tags        a 0   3   1      a 0   3   1
         …                                                    …




                find             preferred            tags



                                                                     47
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging

                           “Unary” factor evaluates this tag
                           Its values depend on corresponding word

         …                                                    …
                                                      v 0.2
                                                      n 0.2
                                                      a 0



                find         preferred         tags      can‟t be adj



                                                                  48
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging

                           “Unary” factor evaluates this tag
                           Its values depend on corresponding word

         …                                                    …
                                                      v 0.2
                                                      n 0.2
                                                      a 0



                find         preferred         tags
                                         (could be made to depend on
                                          entire observed sentence)
                                                                  49
Local factors in a graphical model
   First, a familiar example
       Conditional Random Field (CRF) for POS tagging

                                “Unary” factor evaluates this tag
                                Different unary factor at each position

         …                                                          …
                       v 0.3              v 0.3             v 0.2
                       n 0.02             n 0               n 0.2
                       a 0                a 0.1             a 0



                find             preferred           tags



                                                                        50
   Local factors in a graphical model
      First, a familiar example
          Conditional Random Field (CRF) for POS tagging

p(v a n) is proportional      v     n   a          v    n   a
 to the product of all      v 0     2   1        v 0    2   1
                            n 2     1   0        n 2    1   0
factors‟ values on v a n    a 0     3   1        a 0    3   1
            …        v                      a                    n             …
                           v 0.3                v 0.3                  v 0.2
                           n 0.02               n 0                    n 0.2
                           a 0                  a 0.1                  a 0



                   find                 preferred               tags



                                                                                   51
   Local factors in a graphical model
      First, a familiar example
          Conditional Random Field (CRF) for POS tagging

p(v a n) is proportional      v     n   a          v    n   a
 to the product of all      v 0     2   1        v 0    2   1
                                                                = … 1*3*0.3*0.1*0.2 …
                            n 2     1   0        n 2    1   0
factors‟ values on v a n    a 0     3   1        a 0    3   1
            …        v                      a                    n             …
                           v 0.3                v 0.3                  v 0.2
                           n 0.02               n 0                    n 0.2
                           a 0                  a 0.1                  a 0



                   find                 preferred               tags



                                                                                   52
Local factors in a graphical model
   First, a familiar example           v          a           n
       CRF for POS tagging
   Now let‟s do dependency parsing!
       O(n2) boolean variables for the possible links




    …      find               preferred                links   …


                                                                   53
Local factors in a graphical model
   First, a familiar example            v           a           n
       CRF for POS tagging
   Now let‟s do dependency parsing!
       O(n2) boolean variables for the possible links
           Possible parse — encoded as an assignment to these vars

                                     t
                                 f

                           f                     f
                       f                     t
    …      find                preferred                 links   …


                                                                     54
Local factors in a graphical model
   First, a familiar example           v           a           n
       CRF for POS tagging
   Now let‟s do dependency parsing!
       O(n2) boolean variables for the possible links
           Possible parse — encoded as an assignment to these vars
           Another possible parse
                                    f
                                 f

                           t                    t
                       f                    f
    …      find                preferred                links   …


                                                                     55
Local factors in a graphical model
   First, a familiar example              v           a           n
       CRF for POS tagging
   Now let‟s do dependency parsing!
       O(n2) boolean variables for the possible links
           Possible parse — encoded as an assignment to these vars
           Another possible parse
           An illegal parse         f
                                  t

                           t                       t
                       f         (cycle)       f
    …      find                preferred                   links   …


                                                                       56
Local factors in a graphical model
   First, a familiar example             v           a           n
       CRF for POS tagging
   Now let‟s do dependency parsing!
       O(n2) boolean variables for the possible links
           Possible parse — encoded as an assignment to these vars
           Another possible parse
           An illegal parse         f
           Another illegal parse t

                           t                      t
                       f        (cycle)       t
    …      find              preferred                    links   …
                           (multiple parents)


                                                                      57
Local factors for parsing
    So what factors shall we multiply to define parse probability?
        Unary factors to evaluate each link in isolation
         But what if the best
         assignment isn‟t a tree??             as before, goodness
                                               of this link can
                                          t 2
                                          f 1
                                               depend on entire
                                               observed input context

                       t 1                           t 1   some other links
                       f 8                           f 2     aren‟t as good
                                                            given this input
                                    t 1
                                    f 3
                                                               sentence


  …       find   t 1         preferred         t 1     links     …
                 f 6                           f 2


                                                                       58
Global factors for parsing
    So what factors shall we multiply to define parse probability?
        Unary factors to evaluate each link in isolation
        Global TREE factor to require that the links form a legal tree
            this is a “hard constraint”: factor is either 0 or 1


                                                                    ffffff   0
                                                                    ffffft   0
                                                                    fffftf   0
                                                                      …      …
                                                                    fftfft   1
                                                                      …      …
                                                                    tttttt   0


  …      find                     preferred                         links    …


                                                                                 59
Global factors for parsing                             optionally require the
                                                       tree to be projective
      So what factors shall we multiply to define (no crossing links)
                                                       parse probability?
        Unary factors to evaluate each link in isolation

        Global TREE factor to require that the links form a legal tree

             this is a “hard constraint”: factor is either 0 or 1
So far, this is equivalent to
edge-factored parsing
                                                                      ffffff     0
(McDonald et al. 2005).                                               ffffft     0
                                             t                        fffftf     0
                                         f                              …        …      we‟re
                                                                      fftfft     1      legal!
                                                                        …        …
                                f                           f         tttttt     0
                           f                           t             64 entries (0/1)

    …     find                      preferred                         links    …
               Note: McDonald et al. (2005) don‟t loop through this table
                  to consider exponentially many trees one at a time.
                   They use combinatorial algorithms; so should we! 60
Local factors for parsing
    So what factors shall we multiply to define parse probability?
        Unary factors to evaluate each link in isolation
        Global TREE factor to require that the links form a legal tree
            this is a “hard constraint”: factor is either 0 or 1
        Second-order effects: factors on 2 variables
            grandparent
                                                           f   t
                                            t
                                                       f   1   1
                                                       t   1 3



                                                      t
  …      find                     preferred                         links   …


                                                                                61
Local factors for parsing
    So what factors shall we multiply to define parse probability?
        Unary factors to evaluate each link in isolation
        Global TREE factor to require that the links form a legal tree
            this is a “hard constraint”: factor is either 0 or 1
        Second-order effects: factors on 2 variables
                                                                        f       t
            grandparent
                                                                    f   1       1
            no-cross
                                            t                       t   1 0.2



                                                                            t


  …      find                     preferred                         links           by    …


                                                                                         62
Local factors for parsing
    So what factors shall we multiply to define parse probability?
        Unary factors to evaluate each link in isolation
        Global TREE factor to require that the links form a legal tree
            this is a “hard constraint”: factor is either 0 or 1
        Second-order effects: factors on 2 variables
            grandparent
            no-cross
            siblings
            hidden POS tags
            subcategorization
            …


  …      find                     preferred                         links   by    …


                                                                                 63
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      64
Good to have lots of features, but …
   Nice model 
   Shame about the NP-hardness 

   Can we approximate?

   Machine learning to the rescue!
       ML community has given a lot to NLP
       In the 2000‟s, NLP has been giving back to ML
            Mainly techniques for joint prediction of structures
            Much earlier, speech recognition had HMMs, EM, smoothing …


                                                                    65
Great Ideas in ML: Message Passing
Count the soldiers
                         there‟s
                         1 of me
          1            2             3        4         5
        before       before        before   before    before
         you          you           you      you       you



         5             4         3            2        1
       behind        behind    behind       behind   behind
        you           you       you          you      you




adapted from MacKay (2003) textbook
                                                               66
Great Ideas in ML: Message Passing
Count the soldiers
                          there‟s       Belief:
                          1 of me       Must be
                        2             22 +11 +33 =
                      before            6 of us
                       you




                      only see     3
                     my incoming behind
                                  you
                      messages




adapted from MacKay (2003) textbook
                                                     67
Great Ideas in ML: Message Passing
Count the soldiers
             there‟s           Belief:     Belief:
             1 of me          Must be      Must be
          1                1 1 +11 +44 = 22 +11 +33 =
        before                 6 of us     6 of us
         you




       only see     4
      my incoming behind
                   you
       messages




adapted from MacKay (2003) textbook
                                                        68
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree

                                            3 here

                              7 here
                                                   1 of me

                                 11 here
                                (= 7+3+1)




adapted from MacKay (2003) textbook
                                                             69
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree

                                            3 here


                            7 here
                           (= 3+3+1)


                                       3 here




adapted from MacKay (2003) textbook
                                                          70
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
                                           11 here
                                          (= 7+3+1)


                              7 here



                                    3 here




adapted from MacKay (2003) textbook
                                                          71
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree

                                             3 here

                              7 here

                                                          Belief:
                                                          Must be
                                    3 here                14 of us




adapted from MacKay (2003) textbook
                                                                     72
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree

                                             3 here

                              7 here

                                                          Belief:
                                                          Must be
                                    3 here                14 of us




adapted from MacKay (2003) textbook
                                                                     73
Great ideas in ML: Forward-Backward
   In the CRF, message passing = forward-backward
                                      belief
                                      v 1.8
                                      n 0
                     message          a 4.2        message
        α                       α              β                       β
        v 7            v   n   av 3            v 2v     n   a          v 3
    …   n 2
        a 1
                     v 0
                     n 2
                           2
                           1
                               1n 1
                               0a 6
                                               n v1 0
                                               a n7 2
                                                        2
                                                        1
                                                            1
                                                            0
                                                                       n 6
                                                                       a 1
                                                                             …
                     a 0   3   1                 a 0    3   1
                                      v 0.3
                                      n 0
                                      a 0.1

              find               preferred                      tags



                                                                             74
Great ideas in ML: Forward-Backward
   Extend CRF to “skip chain” to capture non-local factor
       More influences on belief 
                                     v 5.4
                                     n 0
                                     a 25.2

                             α             β
                             v 3           v 2
    …                        n 1
                             a 6
                                           n 1
                                           a 7
                                                        …

                       v 3         v 0.3
                       n 1         n 0
                       a 6         a 0.1

            find              preferred          tags



                                                        75
Great ideas in ML: Forward-Backward
   Extend CRF to “skip chain” to capture non-local factor
       More influences on belief 
                                      Red
                                   v 5.4` messages not independent?
       Graph becomes loopy          Pretend they are!
                                     n 0
                                     a 25.2`

                             α             β
                             v 3           v 2
    …                        n 1
                             a 6
                                           n 1
                                           a 7
                                                               …

                       v 3         v 0.3
                       n 1         n 0
                       a 6         a 0.1

            find              preferred               tags



                                                                76
Two great tastes that taste great together
   Upcoming attractions …
           You got belief
           propagation in       You got
            my dynamic         dynamic
           programming!     programming in
                               my belief
                             propagation!




                                             77
Loopy Belief Propagation for Parsing
   Sentence tells word 3, “Please be a verb”
   Word 3 tells the 3  7 link, “Sorry, then you probably don‟t exist”
   The 3  7 link tells the Tree factor, “You‟ll have to find another
    parent for 7”
   The tree factor tells the 10  7 link, “You‟re on!”
   The 10  7 link tells 10, “Could you please be a noun?”
   …




     …     find                preferred                  links    …

                                                                          78
Loopy Belief Propagation for Parsing
    Higher-order factors (e.g., Grandparent) induce loops
         Let‟s watch a loop around one triangle …
         Strong links are suppressing or promoting other links …




  …       find               preferred                links    …

                                                                    79
Loopy Belief Propagation for Parsing
    Higher-order factors (e.g., Grandparent) induce loops
         Let‟s watch a loop around one triangle …
    How did we compute outgoing message to green link?
         “Does the TREE factor think that the green link is probably t,
          given the messages it receives from all the other links?”



                                   ?
                             TREE factor
                             ffffff  0
                             ffffft  0
                             fffftf  0
                               …     …
                             fftfft  1
                               …     …
  …       find               preferred
                             tttttt  0                 links     …

                                                                       80
Loopy Belief Propagation for Parsing

    How did we compute outgoing message to green link?
         “Does the TREE factor think that the green link is probably t,
          given the messages it receives from all the other links?”


 But this is the outside probability of green link! factor
                                              TREE
                                               ffffff 0
                                     ?
 TREE factor computes all outgoing messages at once
                                               ffffft
                                               fffftf
                                                      0
                                                      0
                           (given all incoming messages)
                                                 …    …
                                               fftfft 1
 Projective case: total O(n3) time by inside-outside …
                                                 …
                                               tttttt 0
 Non-projective: total O(n3) time by inverting Kirchhoff
  …       find               preferred                 links     …
                                    matrix (Smith & Smith, 2007)

                                                                       81
   Loopy Belief Propagation for Parsing
         How did we compute outgoing message to green link?
              “Does the TREE factor think that the green link is probably t,
               given the messages it receives from all the other links?”


      But this is the outside probability of green link!
      TREE factor computes all outgoing messages at once
                             (given all incoming messages)
      Projective case: total O(n3) time by inside-outside
      Non-projective: total O(n3) time by inverting Kirchhoff
                               matrix (Smith & Smith, 2007)
Belief propagation assumes incoming messages to TREE are independent.
So outgoing messages can be computed with first-order parsing algorithms
                                            (fast, no grammar constant).
                                                                            82
Some connections …

   Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

   Global constraints in arc consistency
       ALLDIFFERENT constraint (Régin 1994)

   Matching constraint in max-product BP
       For computer vision (Duchi et al., 2006)
       Could be used for machine translation

   As far as we know, our parser is the first use of
    global constraints in sum-product BP.


                                                                   83
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      84
Runtimes for each factor type (see paper)
  Factor type        degree runtime           count total
  Tree               O(n2)  O(n3)             1     O(n3)
  Proj. Tree         O(n2)  O(n3)             1     O(n3)
  Individual links   1         O(1)           O(n2)     O(n2)
  Grandparent        2         O(1)           O(n3)     O(n3)
  Sibling pairs      2         O(1)           O(n3)     O(n3)
  Sibling bigrams O(n)         O(n2)          O(n)      O(n3)
  NoCross            O(n)      O(n)           O(n2)     O(n3)
  Tag                1         O(g)           O(n)      O(n)
  TagLink            3         O(g2)          O(n2)     O(n2)
  TagTrigram         O(n)      O(ng3)         1       + O(n)
                                                      = O(n )
                                                                   per
  TOTAL              Additive, not multiplicative!             3
                                                                   iteration


                                                                     85
Runtimes for each factor type (see paper)
  Factor type        degree runtime               count total
  Tree               O(n2)  O(n3)                 1     O(n3)
  Proj. Tree         O(n2)  O(n3)                 1     O(n3)
  Individual links   1            O(1)            O(n2)     O(n2)
  Grandparent        2            O(1)            O(n3)     O(n3)
  Sibling pairs      2            O(1)            O(n3)     O(n3)
  Sibling bigrams O(n)            O(n2)           O(n)      O(n3)
  NoCross            O(n)         O(n)            O(n2)     O(n3)
  Tag                1            O(g)            O(n)      O(n)
  TagLink            3            O(g2)           O(n2)     O(n2)
  TagTrigram         O(n)         O(ng3)          1  + O(n)
                     Additive, not an unbounded = O(n )
  Each “global” factor coordinates multiplicative! # of variables
  TOTAL                                                            3
  Standard belief propagation would take exponential time
         to iterate over all configurations of those variables
  See paper for efficient propagators                                  86
Experimental Details
   Decoding
       Run several iterations of belief propagation
       Get final beliefs at link variables
       Feed them into first-order parser
       This gives the Min Bayes Risk tree (minimizes expected error)
   Training
       BP computes beliefs about each factor, too …
       … which gives us gradients for max conditional likelihood.
                               (as in forward-backward algorithm)
   Features used in experiments
       First-order: Individual links just as in McDonald et al. 2005
       Higher-order: Grandparent, Sibling bigrams, NoCross

                                                                    87
Dependency Accuracy
The extra, higher-order features help! (non-projective parsing)

                                   Danish        Dutch       English
            Tree+Link               85.5         87.3         88.6
            +NoCross                 86.1         88.3            89.1
            +Grandparent             86.1         88.6            89.4
            +ChildSeq                86.5         88.5            90.1




                                                                     88
  Dependency Accuracy
  The extra, higher-order features help! (non-projective parsing)

                                        Danish     Dutch       English
               Tree+Link                 85.5      87.3         88.6
               +NoCross                  86.1       88.3            89.1
               +Grandparent              86.1       88.6            89.4
               +ChildSeq                 86.5       88.5            90.1
  exact, slow Best projective            86.0       84.5            90.2
               parse with all factors
   doesn‟t fix
enough edges
               +hill-climbing            86.1       87.6            90.2


                                                                       89
Time vs. Projective Search Error
     iterations


                            iterations



                                                 …DP 140



            iterations




   Compared with O(n4) DP   Compared with O(n5) DP


                                                     90
Outline

   Edge-factored parsing       Old
       Dependency parses
       Scoring the competing parses: Edge features
       Finding the best parse

   Higher-order parsing        New!
       Throwing in more features: Graphical models
       Finding the best parse: Belief propagation
       Experiments

   Conclusions

                                                      92
    Freedom Regained
    This paper in a nutshell

   Output probability defined as product of local and global factors
       Throw in any factors we want! (log-linear model)
       Each factor must be fast, but they run independently

   Let local factors negotiate via “belief propagation”
       Each bit of syntactic structure is influenced by others
       Some factors need combinatorial algorithms to compute messages fast
             e.g., existing parsing algorithms using dynamic programming
       Each iteration takes total time O(n3) or even O(n2); see paper
             Compare reranking or stacking

   Converges to a pretty good (but approximate) global parse
       Fast parsing for formerly intractable or slow models
       Extra features of these models really do help accuracy



                                                                            93
Future Opportunities

   Efficiently modeling more hidden structure
       POS tags, link roles, secondary links (DAG-shaped parses)
   Beyond dependencies
       Constituency parsing, traces, lattice parsing
   Beyond parsing
       Alignment, translation
       Bipartite matching and network flow
       Joint decoding of parsing and other tasks (IE, MT, reasoning ...)
   Beyond text
       Image tracking and retrieval
       Social networks


                                                                    94
thank   you


              95

				
DOCUMENT INFO