Document Sample

Dependency Parsing by Belief Propagation David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University) 1 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 2 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 3 Word Dependency Parsing Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsing Word dependency parsed sentence He reckons the current account deficit will narrow to only 1.8 billion in September . MOD MOD COMP SUBJ MOD SUBJ COMP SPEC S-COMP ROOT slide adapted from Yuji Matsumoto 4 What does parsing have to do with belief propagation? loopy belief propagation loopy belief propagation 5 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 6 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? p(D | A,B,C)? what if they‟re all worthwhile? p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)? 7 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) Solution: Log-linear (max-entropy) modeling (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …throw them all in! Features may interact in arbitrary ways Iterative scaling keeps adjusting the feature weights until the model agrees with the training data. 8 How about structured outputs? Log-linear models great for n-way classification Also good for predicting sequences v a n but to allow fast dynamic programming, find preferred tags only use n-gram features Also good for dependency parsing but to allow fast dynamic programming or MST parsing, …find preferred links… only use single-edge features 9 How about structured outputs? but to allow fast dynamic programming or MST parsing, …find preferred links… only use single-edge features 10 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou “It was a bright cold day in April and the clocks were striking thirteen” 11 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou “It was a bright cold day in April and the clocks were striking thirteen” 12 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright NOUN”) (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 13 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright NOUN”) (“bright day”) A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 14 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright NOUN”) (“bright A N day”) preceding conjunction A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 15 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? not as good, lots of red ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 16 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny (“bright clocks”) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 17 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny jasn hodi (“bright clocks”) (“bright clock,” stems only) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 18 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny jasn hodi (“bright clocks”) (“bright clock,” stems only) Aplural Nsingular ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 19 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny jasn hodi (“brightN A clocks”) (“bright clock,” where N follows ... stems only) Aplural Nsingular ... undertrained a conjunction Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 20 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? “bright day” or “bright clocks”? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 21 Edge-Factored Parsers (McDonald et al. 2005) our current weight vector Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 22 Edge-Factored Parsers (McDonald et al. 2005) our current weight vector Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score can‟t have both can„t have both (one parent per word) (no crossing links) Thus, an edge may lose (or win) Can‟t have all three because of a consensus of other (no cycles) edges. 23 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 24 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming The cat in the hat wore a stovepipe. ROOT let‟s vertically stretch this graph drawing ROOT wore cat stovepipe The in a hat the each subtree is a linguistic constituent (here a noun phrase) 25 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case to score “cat wore” link, not enough to know this is NP must know it‟s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...} so CKY‟s “grammar constant” is no longer constant ROOT wore cat stovepipe The in a hat the each subtree is a linguistic constituent (here a noun phrase) 26 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996) Back to O(n3) ROOT wore cat stovepipe The in a hat the each subtree is a linguistic constituent (here a noun phrase) 27 Spans vs. constituents Two kinds of substring. » Constituent of the tree: links to the rest only through its headword (root). The cat in the hat wore a stovepipe. ROOT » Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT 28 Decomposing a tree into spans The cat in the hat wore a stovepipe. ROOT The cat + cat in the hat wore a stovepipe. ROOT cat in the hat wore + wore a stovepipe. ROOT cat in + in the hat wore in the hat + hat wore Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996) Back to O(n3) Can play usual tricks for dynamic programming parsing Further refining the constituents or spans Allow prob. model to keep track of even more internal information A*, best-first, coarse-to-fine require “outside” probabilities Training by EM etc. of constituents, spans, or links 30 Hard Constraints on Valid Trees our current weight vector Score of an edge e = features(e) Standard algos valid parse with max total score can‟t have both can„t have both (one parent per word) (no crossing links) Thus, an edge may lose (or win) Can‟t have all three because of a consensus of other (no cycles) edges. 31 Non-Projective Parses ROOT I „ll give a talk tomorrow on bootstrapping subtree rooted at “talk” is a discontiguous noun phrase can„t have both (no crossing links) The “projectivity” restriction. Do we really want it? 32 Non-Projective Parses ROOT I „ll give a talk tomorrow on bootstrapping occasional non-projectivity in English ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory may-know my going-gray (i.e., it shall last till I go gray) frequent non-projectivity in Latin, etc. 33 Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n2). 9 root root 10 10 30 30 9 saw saw 20 0 30 30 John Mary John Mary 11 Every node selects best parent If cycles, contract them and repeat 3 slide thanks to Dragomir Radev 34 Summing over all non-projective trees Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n2). How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev 35 Graph Theory to the Rescue! O(n3) time! Tutte‟s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 36 Building the Kirchoff (Laplacian) Matrix s(1,0) s(2,0) 0 s(1,0) s(2,0) s(n,0) s (1, j ) s (2,1) 0 s(1, j) s(2,1) 0 s(n,0) s (n,1 ) s(n,1) • Negate edge scores j 1 0 s (1,2)0 s(2,1) s(n,1) • Sum columns j1 0 s(1,2) 0 s (2, j ) s(1,2)j 2 s(2, j) 0 s (n,2 s(n,2) s(n,2) ) (children) • Strike root row/col. j2 • Take determinant 0s (1,s(1,n) ss(2,n) s(1,n) s(2,n) 0 n) (2, n) s(n,nj)) 0 j s ( , j jnn N.B.: This allows multiple children of root, but see Koo et al. 2007. 37 Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: s(1, j) s(2,1) s(n,1) Every node selects best parent j1 If cycles, contract and recur s(1,2) s(2, j) s(n,2) j2 s(1,n) s(2,n) s(n, j) jn K K with contracted edge 1, 2 K K({1,2} |{1,2}) K s(1,2) K K Undirected case; special root cases for directed 38 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 39 Exactly Finding the Best Parse but to allow fast dynamic programming or MST parsing, …find preferred links… only use single-edge features With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming grandp. + sibling POS sibling pairs grandparents (non-adjacent) bigrams trigrams O(n4) O(n5) O(n3g6) … O(2n) Non-projective: O(n2) by minimum spanning tree • any of the above features • soft penalties for crossing links • pretty much anything else! NP-hard 40 Let’s reclaim our freedom (again!) This paper in a nutshell Output probability is a product of local factors Throw in any factors we want! (log-linear model) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ How could we find best parse? Integer linear programming (Riedel et al., 2006) doesn‟t give us probabilities when training or parsing MCMC Slow to mix? High rejection rate because of hard TREE constraint? Greedy hill-climbing (McDonald & Pereira 2006) none of these exploit tree structure of parses as the first-order methods do 41 Let’s reclaim our freedom (again!) This paper in a nutshell certain global factors ok too Output probability is a product of local factors Throw in any factors we want! (log-linear model) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3) each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) Converges to a pretty good (but approx.) global parse 42 Let’s reclaim our freedom (again!) This paper in a nutshell Training with many features Decoding with many features Iterative scaling Belief propagation Each weight in turn is Each variable in turn is influenced by others influenced by others Iterate to achieve Iterate to achieve globally optimal weights locally consistent beliefs To train distrib. over trees, To decode distrib. over trees, use dynamic programming use dynamic programming to compute normalizer Z to compute messages New! 43 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 44 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … v v v … find preferred tags Observed input sentence (shaded) 45 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … v a n … find preferred tags Observed input sentence (shaded) 46 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging ”Binary” factor v n a v n a Model reuses that measures v 0 2 1 v 0 2 1 same parameters compatibility of 2 n 2 1 0 n 2 1 0 at this position adjacent tags a 0 3 1 a 0 3 1 … … find preferred tags 47 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags can‟t be adj 48 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 49 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Different unary factor at each position … … v 0.3 v 0.3 v 0.2 n 0.02 n 0 n 0.2 a 0 a 0.1 a 0 find preferred tags 50 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging p(v a n) is proportional v n a v n a to the product of all v 0 2 1 v 0 2 1 n 2 1 0 n 2 1 0 factors‟ values on v a n a 0 3 1 a 0 3 1 … v a n … v 0.3 v 0.3 v 0.2 n 0.02 n 0 n 0.2 a 0 a 0.1 a 0 find preferred tags 51 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging p(v a n) is proportional v n a v n a to the product of all v 0 2 1 v 0 2 1 = … 1*3*0.3*0.1*0.2 … n 2 1 0 n 2 1 0 factors‟ values on v a n a 0 3 1 a 0 3 1 … v a n … v 0.3 v 0.3 v 0.2 n 0.02 n 0 n 0.2 a 0 a 0.1 a 0 find preferred tags 52 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let‟s do dependency parsing! O(n2) boolean variables for the possible links … find preferred links … 53 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let‟s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars t f f f f t … find preferred links … 54 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let‟s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse f f t t f f … find preferred links … 55 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let‟s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f t t t f (cycle) f … find preferred links … 56 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let‟s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f Another illegal parse t t t f (cycle) t … find preferred links … (multiple parents) 57 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation But what if the best assignment isn‟t a tree?? as before, goodness of this link can t 2 f 1 depend on entire observed input context t 1 t 1 some other links f 8 f 2 aren‟t as good given this input t 1 f 3 sentence … find t 1 preferred t 1 links … f 6 f 2 58 Global factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 ffffff 0 ffffft 0 fffftf 0 … … fftfft 1 … … tttttt 0 … find preferred links … 59 Global factors for parsing optionally require the tree to be projective So what factors shall we multiply to define (no crossing links) parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 So far, this is equivalent to edge-factored parsing ffffff 0 (McDonald et al. 2005). ffffft 0 t fffftf 0 f … … we‟re fftfft 1 legal! … … f f tttttt 0 f t 64 entries (0/1) … find preferred links … Note: McDonald et al. (2005) don‟t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! 60 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent f t t f 1 1 t 1 3 t … find preferred links … 61 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables f t grandparent f 1 1 no-cross t t 1 0.2 t … find preferred links by … 62 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent no-cross siblings hidden POS tags subcategorization … … find preferred links by … 63 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 64 Good to have lots of features, but … Nice model Shame about the NP-hardness Can we approximate? Machine learning to the rescue! ML community has given a lot to NLP In the 2000‟s, NLP has been giving back to ML Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing … 65 Great Ideas in ML: Message Passing Count the soldiers there‟s 1 of me 1 2 3 4 5 before before before before before you you you you you 5 4 3 2 1 behind behind behind behind behind you you you you you adapted from MacKay (2003) textbook 66 Great Ideas in ML: Message Passing Count the soldiers there‟s Belief: 1 of me Must be 2 22 +11 +33 = before 6 of us you only see 3 my incoming behind you messages adapted from MacKay (2003) textbook 67 Great Ideas in ML: Message Passing Count the soldiers there‟s Belief: Belief: 1 of me Must be Must be 1 1 1 +11 +44 = 22 +11 +33 = before 6 of us 6 of us you only see 4 my incoming behind you messages adapted from MacKay (2003) textbook 68 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 69 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 70 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 71 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here Belief: Must be 3 here 14 of us adapted from MacKay (2003) textbook 72 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here Belief: Must be 3 here 14 of us adapted from MacKay (2003) textbook 73 Great ideas in ML: Forward-Backward In the CRF, message passing = forward-backward belief v 1.8 n 0 message a 4.2 message α α β β v 7 v n av 3 v 2v n a v 3 … n 2 a 1 v 0 n 2 2 1 1n 1 0a 6 n v1 0 a n7 2 2 1 1 0 n 6 a 1 … a 0 3 1 a 0 3 1 v 0.3 n 0 a 0.1 find preferred tags 74 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief v 5.4 n 0 a 25.2 α β v 3 v 2 … n 1 a 6 n 1 a 7 … v 3 v 0.3 n 1 n 0 a 6 a 0.1 find preferred tags 75 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief Red v 5.4` messages not independent? Graph becomes loopy Pretend they are! n 0 a 25.2` α β v 3 v 2 … n 1 a 6 n 1 a 7 … v 3 v 0.3 n 1 n 0 a 6 a 0.1 find preferred tags 76 Two great tastes that taste great together Upcoming attractions … You got belief propagation in You got my dynamic dynamic programming! programming in my belief propagation! 77 Loopy Belief Propagation for Parsing Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don‟t exist” The 3 7 link tells the Tree factor, “You‟ll have to find another parent for 7” The tree factor tells the 10 7 link, “You‟re on!” The 10 7 link tells 10, “Could you please be a noun?” … … find preferred links … 78 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops Let‟s watch a loop around one triangle … Strong links are suppressing or promoting other links … … find preferred links … 79 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops Let‟s watch a loop around one triangle … How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” ? TREE factor ffffff 0 ffffft 0 fffftf 0 … … fftfft 1 … … … find preferred tttttt 0 links … 80 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” But this is the outside probability of green link! factor TREE ffffff 0 ? TREE factor computes all outgoing messages at once ffffft fffftf 0 0 (given all incoming messages) … … fftfft 1 Projective case: total O(n3) time by inside-outside … … tttttt 0 Non-projective: total O(n3) time by inverting Kirchhoff … find preferred links … matrix (Smith & Smith, 2007) 81 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” But this is the outside probability of green link! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n3) time by inside-outside Non-projective: total O(n3) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant). 82 Some connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008) Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994) Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation As far as we know, our parser is the first use of global constraints in sum-product BP. 83 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 84 Runtimes for each factor type (see paper) Factor type degree runtime count total Tree O(n2) O(n3) 1 O(n3) Proj. Tree O(n2) O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 + O(n) = O(n ) per TOTAL Additive, not multiplicative! 3 iteration 85 Runtimes for each factor type (see paper) Factor type degree runtime count total Tree O(n2) O(n3) 1 O(n3) Proj. Tree O(n2) O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 + O(n) Additive, not an unbounded = O(n ) Each “global” factor coordinates multiplicative! # of variables TOTAL 3 Standard belief propagation would take exponential time to iterate over all configurations of those variables See paper for efficient propagators 86 Experimental Details Decoding Run several iterations of belief propagation Get final beliefs at link variables Feed them into first-order parser This gives the Min Bayes Risk tree (minimizes expected error) Training BP computes beliefs about each factor, too … … which gives us gradients for max conditional likelihood. (as in forward-backward algorithm) Features used in experiments First-order: Individual links just as in McDonald et al. 2005 Higher-order: Grandparent, Sibling bigrams, NoCross 87 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Danish Dutch English Tree+Link 85.5 87.3 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 88 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Danish Dutch English Tree+Link 85.5 87.3 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 exact, slow Best projective 86.0 84.5 90.2 parse with all factors doesn‟t fix enough edges +hill-climbing 86.1 87.6 90.2 89 Time vs. Projective Search Error iterations iterations …DP 140 iterations Compared with O(n4) DP Compared with O(n5) DP 90 Outline Edge-factored parsing Old Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 92 Freedom Regained This paper in a nutshell Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper Compare reranking or stacking Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy 93 Future Opportunities Efficiently modeling more hidden structure POS tags, link roles, secondary links (DAG-shaped parses) Beyond dependencies Constituency parsing, traces, lattice parsing Beyond parsing Alignment, translation Bipartite matching and network flow Joint decoding of parsing and other tasks (IE, MT, reasoning ...) Beyond text Image tracking and retrieval Social networks 94 thank you 95

DOCUMENT INFO

Shared By:

Categories:

Tags:
Graph Coloring Problems, factor graph, Graphical Models, Constraint Satisfaction, Early Vision, Belief Propagation, Cornell University, Loopy Belief Propagation

Stats:

views: | 4 |

posted: | 12/30/2010 |

language: | English |

pages: | 94 |

OTHER DOCS BY mikeholy

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.