VIEWS: 3 PAGES: 23 POSTED ON: 6/2/2011 Public Domain
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab Center for Spoken Language Understanding OGI School of Science & Engineering Oregon Health & Science University Beaverton, Oregon 97006 bodenstab@cslu.ogi.edu Abstract We present Coarse-to-Fine (CTF), a probabilistic parsing algorithm that performs exact inference on complex, high-accuracy parsing models. Exact inference on these models is computationally expensive, while solving simpler models is efficient and produces respectable results. CTF solves a series of increasingly complex models by using the solutions of simpler models to guide optimal parse tree search in a more complex model space. We compare our results with the CYK and A* algorithms and find that our current implementation of CTF traverses the smallest amount of complex model space, but fails to gain a significant advantage in computational efficiency. 1 Introduction The task of natural language parsing is to assign syntactic information to a sentence by hierarchically clustering and labeling its constituents. A constituent is a group of one or more words that function together as a unit, defined by a grammar. The hierarchy of constituents for a sentence is called a parse tree. Sentences often have multiple grammatically valid parse trees. For example, two parse trees for “The aged bottle flies fast” are given in Figure 1. Differences in syntactic structure often produce variations on sentence meaning. The semantic interpretation of parse tree (a) is “The old bottle moves through the air quickly,” while (b) is understood as “The elderly put flies into bottles at a furious pace.” Furthermore, many parse trees are grammatically valid but have no meaningful interpretation. Parsing algorithms attempt to find the most likely, meaningful parse tree for a sentence among all possibilities. 1 (a) (b) Figure 1: Two grammatically valid parse trees for the sentence “The aged bottle flies fast.” Syntactic information obtained by parsing a sentence is critical for many computational approaches to language understanding. In speech recognition, parsing acts as a language model and produces a distribution over upcoming words, biasing recognition results towards grammatically well-formed utterances. Parse structure is also an effective resource in word- sense disambiguation, the task of correctly choosing a word’s intended meaning. Syntactic parse structure and word-sense information are used in natural language applications such as machine translation, question-answering, and document summarization. All parsing algorithms must balance task accuracy with computational time. Improvements in parse tree accuracy often require a larger, more contextually-sensitive grammar. Since a grammar defines the search space of possible parse trees, an increase in grammar size often corresponds to an increase in search space. To find the optimal parse tree efficiently, either the grammar must be small or the search must be well-guided. A great deal of parsing research is based on the Cocke-Younger-Kasami (CYK) dynamic programming algorithm (Kasami 1965; Younger 1967). The CYK algorithm exhaustively traverses the search space of potential parse ( ) trees and finds the globally optimal parse tree with O R n 3 complexity bounds, where R is the number of grammar rules (a constant) and n is the sentence length. This exhaustive approach is 2 sufficient when using small grammars, but becomes intractable for large grammars needed to attain state-of-the-art accuracy. We discuss context-free grammars and methods to improve parse accuracy in Section 2. Strategies to improve search efficiency are prevalent. Charniak et al. (1998) and Collins (1999) use a best-first search, which guides incremental parse tree construction using local heuristic scores to rank each partial subtree. Roark (2001) and Ratnaparkhi (1999) implement a beam search that pursues only the top b full parse tree candidates at any time, pruning potential candidates when the candidate’s heuristic score falls below a local threshold. Both best-first search and beam search are greedy strategies that cannot guarantee a globally optimal solution. More recently, Klein and Manning applied the A* graph traversal algorithm to the parsing problem and were the first to do exact inference over long sentences with large, high- accuracy grammars (Klein and Manning, 2003). The A* search is similar to the best-first search in many ways, but guarantees that the heuristic score will always be an underestimate of the true score, therefore never pruning the globally optimal solution (Hart et al., 1968). Although the average computational time required to perform an A* search is significantly less than the exhaustive CYK algorithm, the A* search is still much slower than its greedy-search competitors. One common characteristic among all current efficient search strategies is the attempt to solve the high-accuracy parsing problem directly: the search for a good parse tree is done using one very large grammar. Although models using small grammars produce lower-accuracy parse trees, these parse trees often share a significant amount of structure with their high-accuracy counterparts. Solving the parsing problem directly does not take advantage of lower-accuracy model solutions, which are much faster to compute and can help direct the search for high- accuracy parse trees in a large grammar space. We present Coarse-to-Fine (CTF), a probabilistic parsing algorithm that finds the globally optimal solution of high-accuracy parsing models. CTF improves search efficiency by first solving a series of simpler models and using these results to guide parse tree search in a large 3 grammar space. To our knowledge, leveraging the similarity between incremental model solutions has not been seriously exploited in previous parsing algorithms. CTF uses this information to guide optimal parse tree selection better than the CYK or A* algorithms. 2 Probabilistic Parsing and Hypergraphs 2.1 Probabilistic Parsing The parsing community abandoned hand-crafted grammars circa 1990 for more robust, data- driven approaches. Today, most parsing systems use a probabilistic context-free grammar (PCFG) induced from human-annotated training data. Formally, a probabilistic context-free grammar G is defined (Jurafsky and Martin, 2000) as a tuple (N, M, Φ , R, D) where • N is a set of non-terminal symbols Example: {Proper_Noun, Noun_Phrase, Verb_Phrase, Sentence,…} • M is a set of terminal symbols ( M I N = ∅ ) Example: {the, and, KMart, housekeeping, final,…} • Φ is a distinguished start symbol ( Φ ∈ N ) Example: Sentence • R is a finite set of grammar rules, each of the form A → β where A is a singleton and β is an ordered list ( A ∈ N , β i ∈ (N U M ) , 0 ≤ β < ∞ ) Example: {Sentence → [Noun_Phrase, Verb_Phrase], Noun_Phrase → [Preposition, Determiner, Noun], Determiner → [the], …} • D is a function assigning probability to all ( A → β ) ∈ R where grammar rule probabilities are interpreted as P( β | A ) and form proper distributions. Example: D(Determiner → [the]) ≡ P( β = [the] | A = Determiner) = 0.5023 4 The members of N, M, Φ , and R are extracted from training data1; D is often learned using maximum likelihood estimation (relative frequency counts). We define Τ(S, G ) to be the set of all parse trees for S permitted by grammar G. Every parse tree τ S ,G ∈ Τ(S , G ) must have root Φ , leaf nodes si, where si is the ith word of sentence S, and be constructed of grammar rules in R. When context permits, we will simplify notation of a parse tree to τ . The joint probability of parse tree τ S ,G and sentence S is the product of all grammar rules in τ S ,G : P(τ , S ) = P(S | τ )P(τ ) = P(τ ) = ∏ D( A → β ) . (1) ( A→ β )∈τ Note that P (S | τ ) is unity since all valid parse trees necessarily include the words of sentence S. The maximum likelihood parse tree τ is the parse tree in Τ(S, G ) with highest ˆ probability. Using Equation 1 and ignoring P(S), since it remains constant, P(τ , S ) τˆ = arg max P(τ | S ) = arg max = arg max P(τ , S ) = arg max P(τ ) . (2) τ ∈Τ ( S ,G ) τ ∈Τ ( S ,G ) P( S ) τ ∈Τ( S ,G ) τ ∈Τ( S ,G ) The number of parse trees in Τ(S, G ) increases exponentially with the length of S. Finding the optimal tree in this set becomes intractable for simple brute-force algorithms. To accomplish this task efficiently, we must exploit the fact that many parse trees in Τ(S, G ) contain identical substructure. For example, both parse trees in Figure 1 use the grammar rule Determiner → [the] to connect the first word of the sentence. A parse forest or hypergraph can be used to eliminate redundant data and facilitate efficient computation of the maximum likelihood parse tree. Many parsing algorithms use this compact representation to find the best parse tree from an exponential number of trees in cubic-bound time. 1 The Wall Street Journal (WSJ) Penn treebank corpus (Marcus et al., 1993) has become the standard data set to train and test parsing algorithms. It consists of approximately 50,000 sentences that have been hand- labeled with parsing structure, similar to Figure 1. 5 2.2 Hypergraphs Probabilistic parsing can be viewed as a search over directed hypergraphs (Klein and Manning, 2001). Directed hypergraphs are similar to directed graphs with the following exception: while an edge in a directed graph connects a single tail node to a single head node, a hyperedge connects a set of tail nodes to a set of head nodes. A hyperedge is only traversable after all tail nodes have been visited. In parsing, hyperedges represent PCFG grammar rules. All hyperedges, therefore, have exactly one head node (A) and possibly multiple tail nodes ( β i ). Figure 2 shows the hypergraph representation of both parse solutions from Figure 1. Figure 2: A partial hypergraph for the sentence “The aged bottle flies fast” representing two parse trees. Numbers in parentheses are for reference only and not actually used in the grammar. To represent parse trees in a hypergraph, we must constrain the hypergraph by requiring each hyperedge to have complete, non-overlapping word coverage. These requirements ensure every word in S participates once, and only once, in every valid parse tree of S. We define the yield of a hyperedge to be the words connected to that hyperedge. In Figure 2, for example, the yield of Noun Phrase (2) is [the, aged, bottle]. A hyperedge has complete word coverage when 6 its yield is contiguous. A hyperedge without complete word coverage (not drawn) would be Sentence → [Noun Phrase (1), Verb Phrase (2)] since the word “bottle” would not be included in the yield of Sentence. For a hyperedge to have non-overlapping word coverage, the yield of every tail node in the hyperedge must not intersect. A hyperedge with overlapping word coverage (not drawn) would be Sentence → [Noun Phrase (1), Noun Phrase (2)] since the words “the” and “aged” would be in both tail node yields. Using this hypergraph framework, parsing sentence S now becomes a search problem starting at S and ending at Φ . Reachability corresponds to parsability and shortest paths correspond to best parses (Klein and Manning, 2001). Note that the S-to- Φ traversal described in this paper is a “bottom-up” traversal strategy, and a grammar rule A → β is interpreted as “once all nodes in β are traversed, A can be traversed.” A “top-down” strategy ( Φ -to- S) is also possible, where A → β means “once A is traversed, all nodes in β can be traversed.” We assume a bottom-up traversal strategy for all algorithms to facilitate comparison. We define Η (S, G ) to be the hypergraph for S containing all hyperpaths permitted by grammar G. This hypergraph space may be infinite if G contains rules with β ≤ 1 (for example, an infinite chain of hyperedges Noun_Phrase → [Noun_Phrase]). In practice, we can neither store nor rank an infinite number of hyperpaths, and these unary and empty rules are dealt with on a special-case basis. The hypergraph Η (S, G ) necessarily contains all valid parse trees in Τ(S, G ) , as well as many other hyperpaths not ending at Φ . 2.3 High-Accuracy Parsing Modeling the context-dependent syntax of language with a context-free grammar is obviously not the optimal choice (Allen, 1995). For the time being, however, the efficiency of context-free parsing outweighs its deficiencies. In order to stay within the context-free framework, yet include 7 more context-dependent information, a number of modifications to the Wall Street Journal (WSJ) Penn treebank corpus have been suggested. These modifications aid PCFG rule-probability estimation by refining the conditional probability distributions over grammar rules to be more context-sensitive. Modifying the training data and inducing a new PCFG creates a new parsing model with a (potentially) more accurate optimal parse tree. We define the PCFG Gwsj to be induced from the original WSJ treebank, and subsequent models Gx to be derived from some modification, X, of this treebank. One popular modification is parent annotation (Johnson, 1998). Johnson reports, for example, that the distribution of children ( β ) under a prepositional phrase is highly influenced by the parent node of the prepositional phrase. Standard PCFG induction from the WSJ Treebank disregards parent information; only one distribution over children is estimated for each non- terminal. To address this problem, Johnson suggests the following: 1. Replace all non-terminals in the training data with new non-terminals that explicitly include the direct parent non-terminal, 2. Induce a PCFG Gparent from this context-enriched data; probabilities are now conditioned on additional parent information D parent ( AP = Parent → β ) = P (β | A, Parent ) , 3. Find the optimal parse tree τ S ,G parent using the new model Gparent, ˆ 4. For evaluation, map the parent-specific non-terminals in τ S ,G parent (AP=Parent) back to ˆ the original non-terminal set from Gwsj (A), and calculate parse accuracy. Two parse trees are shown in Figure 3 with (a) non-terminals from Gwsj and (b) non-terminals from Gparent. Adding the context of a parent label to each non-terminal potentially squares the number of non-terminals in the new grammar and significantly increases the number of grammar rules. 8 (a) (b) Figure 3: A parse tree labeled with (a) the original WSJ treebank non-terminal set, and (b) an augmented Gparent non-terminal set. The non-terminal notation (X p=Y) in (b) means the grammar rule headed by X has a parent grammar rule headed by Y in this parse tree. In addition to parent annotation, many other empirically useful annotations are used. Figure 4 is a summary of commonly used parsing models and their respective statistics when evaluated on the WSJ treebank. The general trend of Figure 4 is that relatively modest gains in accuracy require an increase in grammar size and, consequently, a decrease in parsing efficiency. The most dramatic increase in grammar size is seen in the Glexical model. Lexical annotation 9 Model Grammar Size Accuracy WSJ 14,526 74.3 Parent 29,829 78.6 Head POS + Parent 55,989 80.8 Klein and Manning (2003b) 51,766 86.3 Lexical (Collins, 1999) 500,000+ 88.6 Figure 4: Popular grammar models, each with a corresponding grammar size and parsing accuracy on the WSJ treebank with sentences of length 40 words and shorter. modifies non-terminals to include the actual word of the constituent that most influences the constituent’s syntactic behavior. Almost every state-of-the-art parsing model includes lexical information. The motivation being, for example, the verb “bought” prefers [Noun_Phrase, bought, Noun_Phrase] as in “Bill bought a new car,” while the verb “slept” prefers [Noun_Phrase, slept, Adverb_Phrase] as in “Bill slept very well.” With a 10,000 word vocabulary, it is clear why the size of a lexicalized grammar is very large. The model of Klein and Manning (2003b) appears to be an outlier since high accuracy is obtained by using a relatively small grammar. To achieve this result, Klein and Manning hand- tune a grammar by supplementing non-terminals with 17 different linguistically-motivated annotations. The grammar size remains small because each annotation is only added to non- terminals that seem to benefit (based on observation), as opposed to the universal annotation seen in models such as Gparent. Replication of their grammar is very difficult and rarely done. The general trend of accuracy improvements requiring an increase in grammar size is still seen for each additional annotation in their results, only on a smaller scale. Large grammars not only increase computational search requirements, but are also prone to sparse data problems as probability estimates quickly become unreliable. If a grammar rule is unobserved in the training data, it receives zero probability. As a result, many well-formed testing sentences may fail to find even one valid parse tree since a required grammar rule may not be available. 10 Sparse data problems are typically addressed by “smoothing” the distribution of children for each non-terminal (Chen and Goodman, 1998). Smoothing robs a small amount of probability from observed grammar rules, and uniformly distributes that probability over unobserved rules, allowing no rule to have zero probability. This technique allows every sentence to be given a parse structure (even ill-formed sentences) at the cost of an enormous increase in grammar size. A smoothed lexicalized grammar can easily exceed one hundred million grammar rules, making efficient hypergraph search an extremely difficult problem. 3 Coarse-to-Fine Parsing Our Coarse-to-Fine (CTF) parsing algorithm efficiently finds the globally optimal parse tree of large, high-accuracy grammar models. CTF exploits the similarity between optimal parse trees of increasingly complex models to find high-accuracy model solutions quickly. For example, parse tree accuracy using Gwsj is only slightly lower than using Gparent and the two optimal parse trees from these models, on average, share a considerable amount of structure (after removing contextual annotation). CTF searches for the optimal parse tree in the high-accuracy model space by following high probability hyperpaths suggested by simpler models, which are very efficient to compute. We initialize our algorithm by building a complete hypergraph for sentence S with some simple grammar GX. Building this complete hypergraph is efficient because the number of grammar rules in GX is small. We then incrementally replace hyperedges from high probability parse trees in this hypergraph with corresponding hyperedges from a more complex model. Conditioning hyperedges on additional information modifies their probability, affecting the total probability of parse trees in which they participate. We continue this process until we construct a full parse tree composed only of hyperedges from the more complex model. Any number of intermediate models can participate in the CTF algorithm to intelligently guide the search from a course to a fine model solution. 11 We evaluate CTF by implementing a two-tier algorithm to find the optimal parse tree τˆ ∈ Τ(S , G parent ) , guided by high probability parse trees from the hypergraph H (S , Gcoarse ) , where Gcoarse is a slight modification of Gwsj (discussed in Section 3.1). As described above, hyperedges in H (S , Gcoarse ) are systematically replaced by corresponding parent-conditioned ( hyperedges from Gparent. This process creates a new hybrid hypergraph H S , G coarse , G parent , X , ) with a heterogeneous collection of hyperedges from both Gcoarse and Gparent, where X denotes the current state of the hypergraph (which grammar each hyperedge is from). Figure 5 shows two ( ) possible states of H S , Gcoarse , G parent for the sentence “The aged bottle flies fast” (a) before and (b) after the hyperedge Noun_Phrase → [Determiner, Noun] from Gcoarse has been replaced with the hyperedge from Gparent conditioned on the parent non-terminal Sentence. ( ) Unlike a single-PCFG hypergraph (e.g. H S , G wsj ), which has only one optimal parse ( ) tree τ ∈ H S , G wsj , the optimal hybrid parse tree is dependent on the state of hypergraph ˆ H (S , G coarse , G parent , X ) . Each time a hyperedge is replaced in H (S , G coarse , G parent , X ) , the ( ) optimal parse tree may also change. We define τ X ∈ H S , Gcoarse , G parent , X to be the hybrid ˆ parse tree with highest probability when the hypergraph is in state X. The order of hyperedge replacement is important. If we replace every hyperedge in H (S , Gcoarse ) with hyperedges from Gparent, the entire hypergraph H (S , G parent ) will be constructed and we gain nothing over an exhaustive CYK search using the Gparent grammar. Instead, the CTF algorithm chooses to replace hyperedges from high probability parse trees in H (S , Gcoarse ) to lead us quickly to the highest probability parse tree in H (S , G parent ) . We ( continue to replace hyperedges in H S , G coarse , G parent , X ) until we have constructed a full hyperpath composed only of hyperedges from Gparent. 12 (a) (b) ( ) Figure 5: A possible state of hypergraph H S , G coarse , G parent where S = “The aged bottle flies fast” (a) before and (b) after the hyperedge Noun_Phrase → [Determiner, Noun] is replaced by the appropriate hyperedge from Gparent Noun_PhraseP=Sentence → [DeterminerP=Noun_Phrase, NounP=Noun_Phrase]. 13 3.1 A Coarse Grammar ˆ ( How can we be sure we have found the optimal parse tree τ ∈ Τ S , G parent ) when a potential Gparent parse tree is constructed in our hybrid hypergraph? Suppose we use Gwsj to construct our initial hypergraph. Refining the probability of a hyperedge from Gwsj by conditioning on a particular parent non-terminal can potentially increase or decrease the hyperedge probability. For example, the probability P( β = [Determiner, Noun] | A = Noun_Phrase, Parent = Sentence) may be higher or lower than the probability P( β = [Determiner, Noun] | A = Noun_Phrase) averaged over all parent non-terminals. Since a probability increase is possible, replacing a ( hyperedge from Gwsj in H S , G wsj , G parent , X ) with a hyperedge from Gparent could cause a previously non-optimal parse tree τ X to become the optimal parse tree τ X ′ in this new state. ˆ ˆ ( ) ( Therefore, to find the optimal parse tree τ ∈ Τ S , G parent guided by the hypergraph H S , G wsj , ) ( it would be necessary to replace every hyperedge from Gwsj in H S , G wsj , G parent , X ) to guarantee no potential hybrid parse tree would increase in probability and become the maximum likelihood solution. As mentioned earlier, if we replace every hyperedge from Gwsj, we gain no efficiency benefit over the baseline CYK algorithm. To avoid exhaustive replacement, we modify the grammar rule probabilities of Gwsj such that any additional parent-conditioning information will never increase the probability of a hyperedge. That is, when a hyperedge from Gwsj is replaced with a hyperedge from Gparent, the new hyperedge probability will either decrease or remain the same. We define this new grammar Gcoarse to be identical to Gwsj modulo the modification of all grammar rule probabilities to be Dcoarse ( A → β ) = max D parent ( AP = n → β ) = max P (β | A, Parent = n ) . (3) n∈N n∈N 14 Note that using the max probability over parent annotated grammar rules creates improper distributions over the grammar rules of Gcourse because the total probability ∑ P(β | A) often A exceeds one. Because we use these probabilities as an upper-bound to the actual Gparent probability of a potential parse tree, a proper distribution is not necessary. ˆ ( Guiding parse tree selection towards τ ∈ Τ S, G parent ) with Gcoarse instead of Gwsj guarantees that hyperedges (and consequently potential parse trees) will never increase in probability when parent information is supplied. Once the maximum likelihood parse tree of H (S , G wsj , G parent , X ) is composed only of hyperedges from Gparent, no additional search is ( required – all other potential parse trees in H S , G wsj , G parent , X ) necessarily have an upper- bound of lower probability. One drawback of Gcoarse is the decrease in parse accuracy to 49.6% compared to the 74.3% parse accuracy of Gwsj. Even though traversal suggestions from Gcoarse are, on average, worse than Gwsj, empirical results show that Gcoarse still guides the search towards the optimal Gparent parse tree very quickly. 3.2 Implementation Decisions of Coarse-To-Fine Pseudocode for a two-tier CTF algorithm can be seen in Figure 6, where Ex is defined as a hyperedge from grammar Gx. The three most significant components of CTF – choosing a hyperedge (line 6), splitting the hyperedge (line 7), and updating probabilities (line 8) – each have a range of possible implementation strategies. In the following section, we will discuss these options. Methods to choose a hyperedge from the current maximum likelihood parse tree (line 6) range between two extremes: top-down and bottom-up. A top-down strategy chooses Ecoarse closest to the root, and is necessarily connected to the root by Eparent hyperedges. A bottom-up 15 1 Induce PCFGs Gwsj and Gparent from training data 2 Construct Gcoarse using grammar rules from Gwsj with max probabilities from Gparent 3 For each sentence S in the testing set 4 Build the complete hypergraph H (S , Gcoarse ) and find τ X ˆ 5 While ∃E coarse ∈ τ X ˆ 6 Choose any E coarse ∈ τ X ˆ 7 Replace Ecoarse with appropriate parent hyperedge Eparent, modifying the state X of the hybrid hypergraph 8 Propagate the probability difference between Ecoarse and Eparent through parse trees previously containing Ecoarse 9 ˆ ( Find the new optimal parse tree τ X ∈ H S , Gcoarse , G parent , X ) Figure 6: Pseudocode for the Coarse-to-Fine algorithm. Ex is a hyperedge from grammar Gx. strategy chooses Ecoarse furthest from the root, and is necessarily connected to its yield by Eparent hyperedges. Any alternative method to choose a hyperedge falls between these two strategies. A top-down strategy first focuses on hyperedges near the root, shared by a small number ( of parse trees. Finding the true high probability parse trees in H S , G parent ) requires more total hyperedge replacements than using a bottom-up approach, because each hyperedge replacement moves few parse trees towards their true Gparent probability. The advantage of a top-down approach is that less computational time is required in the propagation step (line 8) since probability difference is percolated up the tree, and these hyperedges are already near the root. A bottom-up strategy initially updates hyperedges shared by a large number of parse trees at the bottom of the hypergraph. This strategy requires more computational time during the probability update step (line 8) because many more trees in the hypergraph share hyperedges near the leaves than hyperedges near the root. For every hyperedge replaced, this propagation moves the probability of a large number of trees closer to the actual probability of the trees in H (S , G parent ) . As a result, fewer hyperedges need to be replaced before the globally optimal 16 parse tree is found when using the bottom-up approach. We report empirical results on both top- down and bottom-up Ecoarse selection strategies in Section 4. We also investigate two options when replacing a Ecoarse hyperedge with its parent annotated version (line 7): (1) replace Ecoarse with all potential parent annotated hyperedges, or (2) add only one Eparent specified by Ecoarse and the parent non-terminal from the maximum likelihood parse tree, leaving Ecoarse in the hypergraph for other parse trees with different parent non- terminals. Empirical results show that replacing a hyperedge with all potential parent hyperedges reduces the total number of CTF iterations; we also find that most instantiated Eparent hyperedges never participate in a maximum likelihood parse tree before the optimal parse tree of H (S , G parent ) is discovered. These unnecessary Eparent hyperedges require space and time to propagate probability differences, and the computational requirements of method (1) outweigh its benefits. All CTF experiments we report in this paper use method (2) when replacing Ecoarse hyperedges. The replacement of Ecoarse with Eparent potentially reduces the probability of all parse trees ( containing Ecoarse. If we think of the probability assigned to a parse tree in H S , G wsj , G parent , X ) as an upper-bound to the actual probability of the tree given model Gparent, then modifying the probability of a hyperedge in this tree simply tightens that bound. Although propagating probability differences (line 8) through all parse trees containing a replaced hyperedge is optimal (in terms of tightening the bound on the most number of parse trees), it is not necessary. The CTF algorithm only requires the probability of the current optimal parse tree to change – the unmodified probability of all other parse trees remains a poor, but correct, upper-bound. Nevertheless, each time a hyperedge is replaced, we propagate the probability difference through all affected parse trees. This strategy of propagation reduces the amount of Gparent search-space most effectively. 17 The propagation of one probability difference to all participating parse trees is bounded 1 by R parent n 2 operations and can dominate CTF computational time. Caraballo and Charniak 4 (1998) also fall prey to this problem in their best-first parser and suggest delaying propagation until probability differences “become significant.” This significance threshold neither guarantees a maximum likelihood parse tree, nor ensures cubic-time parsing bounds. Klein and Manning’s A* algorithm avoids propagation entirely by ordering the hypergraph traversal such that all tail node probabilities are fixed once a hyperedge composed of these nodes is traversed. Consequently, the A* traversal order causes many more hyperedges to be processed than search methods using propagation. Efficient probability propagation through hypergraphs has not been well studied in the parsing community and clearly warrants more research. 4 Results We compare CTF with both Klein and Manning’s A* algorithm and the baseline exhaustive CYK search. ( All algorithms find the same globally optimal parse tree τ ∈ Τ S, G parent . ˆ ) All experiments use sections 2 through 21 of the Wall Street Journal Penn Treebank corpus for training, and section 24 for testing. As is commonly done, function tags and empty nodes are removed from all data (Collins, 1999); we also remove quotes and map auxiliary verbs to a new non-terminal AUX. Furthermore, to avoid the complexities of lexical input, we use one-best part- of-speech tags instead of words, generated by the CSLU tagger (Hollingshead et al., 2005). The computational performance of each algorithm can be broken into two pieces: the number of hyperedges processed, and the computational requirement of processing each hyperedge. The three algorithms we evaluate attempt to optimize algorithmic efficiency by balancing these two factors. Figure 7 summarizes the computational bounds of each performance component; Figure 8 reports empirical results for total computational time and number of hyperedges traversed. Computational time is an obvious metric when comparing algorithms 18 Algorithm Hyperedges Processed Processing Cost of one Hyperedge CYK ( ) O R n3 O(1) A* O( R n ) 3 O(lg n ) CTF O( R n ) 3 O n2( ) Figure 7: Computational complexity for the two components of efficient hypergraph search. A complete bound on each algorithm is the product of both. R is the size of the grammar and n is the length of the sentence. designed for efficiency, while tallying hyperedges gives intuition on how well the search is guided. We know that the complete hypergraph space is bounded by O R n 3 ( ) and the CYK algorithm exhaustively traverses every hyperedge. Both the A* and CTF algorithms efficiently search a constant reduction of this space, as seen Figure 8 (d), but the search-space growth is still cubic. The complexity bound of processing one hyperedge, on the other hand, differs between each algorithm. Figure 7 shows that the CYK algorithm will perform best when sentences approach infinite length, but we are only interested in optimizing efficiency over sensible values of n. If an algorithm can prune the search space and not increase the processing cost of one hyperedge too much, the overall computational time of finding the optimal parse tree can improve. Comparing CTF bottom-up and top-down hyperedge selection strategies, Figure 8 (a) shows top-down to perform, overall, more efficiently than bottom-up, yet in Figure 8 (b) we see that the number of hyperedges visited by a top-down strategy is greater. Although a top-down strategy traverses more hypergraph space, the number of parse trees affected by each hyperedge replacement is much lower, requiring less computational time for propagation. On average, a top-down strategy requires propagation through half as many hyperedges as a bottom-up strategy. This is because replacing hyperedges at the bottom of the graph requires many more propagations 19 Computational Time Comparison Search Guidance Comparison 100 3500000 A* A* 90 CYK CYK 3000000 80 CTF Top-Down CTF Top-Down Hyperedges Traversed 70 CTF Bottom-Up 2500000 CTF Bottom-Up Time (seconds) 60 2000000 50 40 1500000 30 1000000 20 10 500000 0 5 7 9 11 13 15 17 19 21 23 25 0 5 7 9 11 13 15 17 19 21 23 25 Sentence Le ngth Se nte nce Length (a) (b) Computational Time Comparison Search Guidance Comparison 100 10000000 1000000 10 Hyperedges Traversed 100000 Time (seconds) 1 10000 1000 0.1 A* A* 100 CYK CYK 0.01 CTF Top-Down CTF Top-Down 10 CTF Bottom-Up CTF Bottom-Up 0.001 1 5 7 9 11 13 15 17 19 21 23 25 5 7 9 11 13 15 17 19 21 23 25 Sentence Length Sentence Length (c) (d) Figure 8: Comparison between CTF: top-down, CTF: bottom-up, A*, and CYK measuring (a) computational time, and (b) hyperedges traversed. Plots (c) and (d) are a log scale of (a) and (b), respectively. than replacing hyperedges near the root. A bottom-up strategy suggests these expensive hyperedges at the beginning of the search when less is known about the true distribution of parse ( ) trees in Τ S, G parent . On the other hand, a top-down strategy spends more time splitting less expensive hyperedges near the root, and only traverses to the bottom of the current maximum likelihood parse tree for high probability candidates composed primarily of Gparent hyperedges. The analysis of total computational time in Figure 8 (a) shows the A* algorithm to perform better than CTF or CYK on sentences of length 10 to 25. As sentence length increases, a diminishing difference is seen between the A* and CYK algorithms because the overhead of 20 maintaining a sorted agenda of R parent n 3 hyperedges (required by A* and not CYK) hinders the savings gained by exploring less search space. Both top-down and bottom-up implementations of CTF have a computational advantage over A* and CYK when the sentence length (n) is very small, as seen in Figure 7 (c), but quickly become inefficient as n grows. Since modifying the 1 probability of one hyperedge can theoretically affect R parent n 2 other hyperedges in the 4 hypergraph, probability propagation severely hinders performance on longer sentences. In Figure 8 (b), on the other hand, we clearly see the advantage CTF has over both the CYK and A* algorithms when we compare hyperedges traversed in the Gparent space. This result confirms our hypothesis that following hyperedges suggested by a simpler model to find the optimal parse tree in a larger, more context-dependent grammar space, is very effective. Both implementations of CTF prune the Gparent exhaustive search space by an order of magnitude. We are optimistic that future research will lead to alternative probability propagation methods, and searching over this significantly reduced space will produce an efficient parsing algorithm. 5 Conclusion We have presented a new algorithm to efficiently traverse a large hypergraph space by using the solutions of simpler models. To our knowledge, Coarse-to-Fine is the only exact inference parsing algorithm to leverage these simpler model solutions. We have discussed the advantages and disadvantages of CTF compared to standard parsing algorithms, as well as many possible variations on CTF implementation. Although no computational efficiency is gained by parsing long sentences with our current algorithm, we have shown how the CTF approach dramatically reduces the number of hyperedges traversed. Many avenues of future research exist. First, we believe a coarse grammar with probabilities closer to Gwsj would lead an even greater search-space reduction. High probability 21 ( ) parse trees in H S , G wsj are more similar to the optimal Gparent parse tree than high probability parse trees in H (S , Gcoarse ) . Preserving the probabilities of Gwsj, while maintaining a guarantee to find the globally optimal solution, is a significant challenge. Second, we plan to extend CTF to models with much larger grammars that include lexical information and probability smoothing. The CYK and A* algorithms have been previously shown to exhibit significant performance degradation when using these large grammars required for high-accuracy results. Since a Coarse- to-Fine search is guided by full parse trees of simpler models, we suspect that our algorithm will be relatively unaffected by this explosion in potential search space. Last, efficient probability propagation through a hypergraph is a neglected problem in natural language parsing. Reduction in the computational complexity of this task would not only improve the overall performance of CTF, but many other parsing algorithms as well. References Allen, J. Natural Language Understanding. The Benjamin/Cummings Publishing Company, Inc, Redwood City, CA. 1995. Charniak, Eugene, Sharon Goldwater, and Mark Johnson. Edge-Based Best-First Chart Parsing. In Proceedings of the Sixth Workshop on Very Large Corpora. 1998. Chen, Stan and Joshua Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. Harvard University Technical Report TR-10-98. 1998. Collins, Michael. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. 1999. Hart, Peter, Nils Nilsson, and Bertram Raphael. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetic, SSC4(2) 100–107. 1968. Hollingshead, Kristy, Seeger Fisher, and Brian Roark. Comparing and Combining Finite-State and Context-Free Parsers. In Proceedings of Human Language Technologies / Empirical Methods for Natural Language Processing (HLT/EMNLP-2005). 2005. Johnson, Mark. PCFG Models of Linguistic Tree Representations. Computational Linguistics, 24:613-632. 1998. Jurafsky, Daniel and James H. Martin. Speech and Language Processing. Prentice-Hall. 2000. 22 Kasami, T. An Efficient Recognition and Syntax Analysis Algorithm for Context-Free Languages. Technical Report AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA. 1965. Klein, Dan and Chris Manning. Parsing and Hypergraphs. International Workshop on Parsing Technologies (IWPT-2001). 2001. Klein, Dan and Chris Manning. Factored A* Search for Models over Sequences and Trees. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03). 2003. Klein, Dan and Chris Manning. Accurate Unlexicalized Parsing. In Proceedings of Association of Computational Linguistics (ACL-03). 2003b. Marcus, Mitchell, Beatrice Santorini and Mary Ann Marcinkiewicz. Building a Large Annotated Corpus of English: the Penn treebank. Computational Linguistics,19(2):313-330. 1993. Ratnaparkhi, Adwait. Learning to Parse Natural Language with Maximum Entropy Models. Machine Learning, 34:151-175. 1999. Roark, Brian. Probabilistic Top-Down Parsing and Language Modeling. Computational Linguistics, 27:249-276. 2001. Younger, D.H. Recognition and Parsing of Context Free Languages in Time n3. Information and Control, 10:189-208. 1967. 23