Docstoc

Coarse-to-Fine Efficient Viterbi Parsing

Document Sample
Coarse-to-Fine Efficient Viterbi Parsing Powered By Docstoc
					              Coarse-to-Fine Efficient Viterbi Parsing
                                     Nathan Bodenstab
                         Center for Spoken Language Understanding
                           OGI School of Science & Engineering
                            Oregon Health & Science University
                                  Beaverton, Oregon 97006
                                   bodenstab@cslu.ogi.edu


Abstract

We present Coarse-to-Fine (CTF), a probabilistic parsing algorithm that performs exact inference

on complex, high-accuracy parsing models. Exact inference on these models is computationally

expensive, while solving simpler models is efficient and produces respectable results. CTF solves

a series of increasingly complex models by using the solutions of simpler models to guide

optimal parse tree search in a more complex model space. We compare our results with the CYK

and A* algorithms and find that our current implementation of CTF traverses the smallest amount

of complex model space, but fails to gain a significant advantage in computational efficiency.



1 Introduction

The task of natural language parsing is to assign syntactic information to a sentence by

hierarchically clustering and labeling its constituents. A constituent is a group of one or more

words that function together as a unit, defined by a grammar. The hierarchy of constituents for a

sentence is called a parse tree. Sentences often have multiple grammatically valid parse trees.

For example, two parse trees for “The aged bottle flies fast” are given in Figure 1. Differences in

syntactic structure often produce variations on sentence meaning. The semantic interpretation of

parse tree (a) is “The old bottle moves through the air quickly,” while (b) is understood as “The

elderly put flies into bottles at a furious pace.” Furthermore, many parse trees are grammatically

valid but have no meaningful interpretation. Parsing algorithms attempt to find the most likely,

meaningful parse tree for a sentence among all possibilities.


                                                 1
                         (a)                                                          (b)

           Figure 1: Two grammatically valid parse trees for the sentence “The aged bottle flies fast.”



        Syntactic information obtained by parsing a sentence is critical for many computational

approaches to language understanding. In speech recognition, parsing acts as a language model

and produces a distribution over upcoming words, biasing recognition results towards

grammatically well-formed utterances.            Parse structure is also an effective resource in word-

sense disambiguation, the task of correctly choosing a word’s intended meaning. Syntactic parse

structure and word-sense information are used in natural language applications such as machine

translation, question-answering, and document summarization.

        All parsing algorithms must balance task accuracy with computational time.

Improvements in parse tree accuracy often require a larger, more contextually-sensitive grammar.

Since a grammar defines the search space of possible parse trees, an increase in grammar size

often corresponds to an increase in search space. To find the optimal parse tree efficiently, either

the grammar must be small or the search must be well-guided. A great deal of parsing research is

based on the Cocke-Younger-Kasami (CYK) dynamic programming algorithm (Kasami 1965;

Younger 1967). The CYK algorithm exhaustively traverses the search space of potential parse

                                                               (       )
trees and finds the globally optimal parse tree with O R n 3 complexity bounds, where R is the

number of grammar rules (a constant) and n is the sentence length. This exhaustive approach is



                                                        2
sufficient when using small grammars, but becomes intractable for large grammars needed to

attain state-of-the-art accuracy. We discuss context-free grammars and methods to improve parse

accuracy in Section 2.

        Strategies to improve search efficiency are prevalent. Charniak et al. (1998) and Collins

(1999) use a best-first search, which guides incremental parse tree construction using local

heuristic scores to rank each partial subtree. Roark (2001) and Ratnaparkhi (1999) implement a

beam search that pursues only the top b full parse tree candidates at any time, pruning potential

candidates when the candidate’s heuristic score falls below a local threshold. Both best-first

search and beam search are greedy strategies that cannot guarantee a globally optimal solution.

        More recently, Klein and Manning applied the A* graph traversal algorithm to the

parsing problem and were the first to do exact inference over long sentences with large, high-

accuracy grammars (Klein and Manning, 2003). The A* search is similar to the best-first search

in many ways, but guarantees that the heuristic score will always be an underestimate of the true

score, therefore never pruning the globally optimal solution (Hart et al., 1968). Although the

average computational time required to perform an A* search is significantly less than the

exhaustive CYK algorithm, the A* search is still much slower than its greedy-search competitors.

        One common characteristic among all current efficient search strategies is the attempt to

solve the high-accuracy parsing problem directly: the search for a good parse tree is done using

one very large grammar. Although models using small grammars produce lower-accuracy parse

trees, these parse trees often share a significant amount of structure with their high-accuracy

counterparts. Solving the parsing problem directly does not take advantage of lower-accuracy

model solutions, which are much faster to compute and can help direct the search for high-

accuracy parse trees in a large grammar space.

        We present Coarse-to-Fine (CTF), a probabilistic parsing algorithm that finds the

globally optimal solution of high-accuracy parsing models. CTF improves search efficiency by

first solving a series of simpler models and using these results to guide parse tree search in a large


                                                  3
grammar space.     To our knowledge, leveraging the similarity between incremental model

solutions has not been seriously exploited in previous parsing algorithms.        CTF uses this

information to guide optimal parse tree selection better than the CYK or A* algorithms.



2 Probabilistic Parsing and Hypergraphs

2.1 Probabilistic Parsing

The parsing community abandoned hand-crafted grammars circa 1990 for more robust, data-

driven approaches.    Today, most parsing systems use a probabilistic context-free grammar

(PCFG) induced from human-annotated training data. Formally, a probabilistic context-free

grammar G is defined (Jurafsky and Martin, 2000) as a tuple (N, M, Φ , R, D) where


    •   N is a set of non-terminal symbols
            Example: {Proper_Noun, Noun_Phrase, Verb_Phrase, Sentence,…}

    •   M is a set of terminal symbols ( M I N = ∅ )
            Example: {the, and, KMart, housekeeping, final,…}

    •   Φ is a distinguished start symbol ( Φ ∈ N )
            Example: Sentence

    •   R is a finite set of grammar rules, each of the form A → β where A is a singleton and
        β is an ordered list ( A ∈ N , β i ∈ (N U M ) , 0 ≤ β < ∞ )
            Example: {Sentence → [Noun_Phrase, Verb_Phrase],
                        Noun_Phrase → [Preposition, Determiner, Noun],
                        Determiner → [the],
                        …}

    •   D is a function assigning probability to all ( A → β ) ∈ R where grammar rule
        probabilities are interpreted as P( β | A ) and form proper distributions.
            Example: D(Determiner → [the])
                                   ≡ P( β = [the] | A = Determiner) = 0.5023




                                                4
The members of N, M, Φ , and R are extracted from training data1; D is often learned using

maximum likelihood estimation (relative frequency counts).

         We define Τ(S, G ) to be the set of all parse trees for S permitted by grammar G. Every

parse tree τ S ,G ∈ Τ(S , G ) must have root Φ , leaf nodes si, where si is the ith word of sentence S,

and be constructed of grammar rules in R. When context permits, we will simplify notation of a

parse tree to τ . The joint probability of parse tree τ S ,G and sentence S is the product of all

grammar rules in τ S ,G :

         P(τ , S ) = P(S | τ )P(τ ) = P(τ ) =          ∏ D( A → β )           .                     (1)
                                                     ( A→ β )∈τ



         Note that P (S | τ ) is unity since all valid parse trees necessarily include the words of

sentence S. The maximum likelihood parse tree τ is the parse tree in Τ(S, G ) with highest
                                               ˆ

probability. Using Equation 1 and ignoring P(S), since it remains constant,

                                                  P(τ , S )
         τˆ = arg max P(τ | S ) = arg max                   = arg max P(τ , S ) = arg max P(τ ) .   (2)
              τ ∈Τ ( S ,G )       τ ∈Τ ( S ,G )    P( S )      τ ∈Τ( S ,G )        τ ∈Τ( S ,G )



         The number of parse trees in Τ(S, G ) increases exponentially with the length of S.

Finding the optimal tree in this set becomes intractable for simple brute-force algorithms. To

accomplish this task efficiently, we must exploit the fact that many parse trees in Τ(S, G ) contain

identical substructure.       For example, both parse trees in Figure 1 use the grammar rule

Determiner → [the] to connect the first word of the sentence. A parse forest or hypergraph can

be used to eliminate redundant data and facilitate efficient computation of the maximum

likelihood parse tree. Many parsing algorithms use this compact representation to find the best

parse tree from an exponential number of trees in cubic-bound time.



1
  The Wall Street Journal (WSJ) Penn treebank corpus (Marcus et al., 1993) has become the standard data
set to train and test parsing algorithms. It consists of approximately 50,000 sentences that have been hand-
labeled with parsing structure, similar to Figure 1.


                                                             5
2.2 Hypergraphs

Probabilistic parsing can be viewed as a search over directed hypergraphs (Klein and Manning,

2001). Directed hypergraphs are similar to directed graphs with the following exception: while

an edge in a directed graph connects a single tail node to a single head node, a hyperedge

connects a set of tail nodes to a set of head nodes. A hyperedge is only traversable after all tail

nodes have been visited. In parsing, hyperedges represent PCFG grammar rules. All hyperedges,

therefore, have exactly one head node (A) and possibly multiple tail nodes ( β i ). Figure 2 shows

the hypergraph representation of both parse solutions from Figure 1.




           Figure 2: A partial hypergraph for the sentence “The aged bottle flies fast” representing two
           parse trees. Numbers in parentheses are for reference only and not actually used in the
           grammar.



        To represent parse trees in a hypergraph, we must constrain the hypergraph by requiring

each hyperedge to have complete, non-overlapping word coverage. These requirements ensure

every word in S participates once, and only once, in every valid parse tree of S. We define the

yield of a hyperedge to be the words connected to that hyperedge. In Figure 2, for example, the

yield of Noun Phrase (2) is [the, aged, bottle]. A hyperedge has complete word coverage when



                                                        6
its yield is contiguous. A hyperedge without complete word coverage (not drawn) would be

Sentence → [Noun Phrase (1), Verb Phrase (2)] since the word “bottle” would not be included

in the yield of Sentence. For a hyperedge to have non-overlapping word coverage, the yield of

every tail node in the hyperedge must not intersect.        A hyperedge with overlapping word

coverage (not drawn) would be Sentence → [Noun Phrase (1), Noun Phrase (2)] since the words

“the” and “aged” would be in both tail node yields.

        Using this hypergraph framework, parsing sentence S now becomes a search problem

starting at S and ending at Φ .      Reachability corresponds to parsability and shortest paths

correspond to best parses (Klein and Manning, 2001).      Note that the S-to- Φ traversal described

in this paper is a “bottom-up” traversal strategy, and a grammar rule A → β is interpreted as

“once all nodes in β are traversed, A can be traversed.” A “top-down” strategy ( Φ -to- S) is also

possible, where A → β means “once A is traversed, all nodes in β can be traversed.” We

assume a bottom-up traversal strategy for all algorithms to facilitate comparison.

        We define Η (S, G ) to be the hypergraph for S containing all hyperpaths permitted by

grammar G. This hypergraph space may be infinite if G contains rules with β ≤ 1 (for example,

an infinite chain of hyperedges Noun_Phrase → [Noun_Phrase]). In practice, we can neither

store nor rank an infinite number of hyperpaths, and these unary and empty rules are dealt with on

a special-case basis. The hypergraph Η (S, G ) necessarily contains all valid parse trees in

Τ(S, G ) , as well as many other hyperpaths not ending at Φ .



2.3 High-Accuracy Parsing

Modeling the context-dependent syntax of language with a context-free grammar is obviously not

the optimal choice (Allen, 1995). For the time being, however, the efficiency of context-free

parsing outweighs its deficiencies. In order to stay within the context-free framework, yet include




                                                 7
more context-dependent information, a number of modifications to the Wall Street Journal (WSJ)

Penn treebank corpus have been suggested. These modifications aid PCFG rule-probability

estimation by refining the conditional probability distributions over grammar rules to be more

context-sensitive. Modifying the training data and inducing a new PCFG creates a new parsing

model with a (potentially) more accurate optimal parse tree. We define the PCFG Gwsj to be

induced from the original WSJ treebank, and subsequent models Gx to be derived from some

modification, X, of this treebank.

         One popular modification is parent annotation (Johnson, 1998). Johnson reports, for

example, that the distribution of children ( β ) under a prepositional phrase is highly influenced

by the parent node of the prepositional phrase. Standard PCFG induction from the WSJ Treebank

disregards parent information; only one distribution over children is estimated for each non-

terminal. To address this problem, Johnson suggests the following:


         1. Replace all non-terminals in the training data with new non-terminals that explicitly
            include the direct parent non-terminal,
         2. Induce a PCFG Gparent from this context-enriched data; probabilities are now
            conditioned on additional parent information
                         D parent ( AP = Parent → β ) = P (β | A, Parent ) ,
         3. Find the optimal parse tree τ S ,G parent using the new model Gparent,
                                         ˆ
         4. For evaluation, map the parent-specific non-terminals in τ S ,G parent (AP=Parent) back to
                                                                      ˆ
             the original non-terminal set from Gwsj (A), and calculate parse accuracy.


Two parse trees are shown in Figure 3 with (a) non-terminals from Gwsj and (b) non-terminals

from Gparent. Adding the context of a parent label to each non-terminal potentially squares the

number of non-terminals in the new grammar and significantly increases the number of grammar

rules.




                                                    8
                                                       (a)




                                                       (b)

           Figure 3: A parse tree labeled with (a) the original WSJ treebank non-terminal set, and (b)
           an augmented Gparent non-terminal set. The non-terminal notation (X p=Y) in (b) means the
           grammar rule headed by X has a parent grammar rule headed by Y in this parse tree.



        In addition to parent annotation, many other empirically useful annotations are used.

Figure 4 is a summary of commonly used parsing models and their respective statistics when

evaluated on the WSJ treebank. The general trend of Figure 4 is that relatively modest gains in

accuracy require an increase in grammar size and, consequently, a decrease in parsing efficiency.

The most dramatic increase in grammar size is seen in the Glexical model. Lexical annotation



                                                       9
                      Model                      Grammar Size                Accuracy
            WSJ                                     14,526                     74.3
            Parent                                  29,829                     78.6
            Head POS + Parent                       55,989                     80.8
            Klein and Manning (2003b)               51,766                     86.3
            Lexical (Collins, 1999)                500,000+                    88.6

           Figure 4: Popular grammar models, each with a corresponding grammar size and parsing
           accuracy on the WSJ treebank with sentences of length 40 words and shorter.




modifies non-terminals to include the actual word of the constituent that most influences the

constituent’s syntactic behavior. Almost every state-of-the-art parsing model includes lexical

information.    The motivation being, for example, the verb “bought” prefers [Noun_Phrase,

bought, Noun_Phrase] as in “Bill bought a new car,” while the verb “slept” prefers

[Noun_Phrase, slept, Adverb_Phrase] as in “Bill slept very well.”                 With a 10,000 word

vocabulary, it is clear why the size of a lexicalized grammar is very large.

        The model of Klein and Manning (2003b) appears to be an outlier since high accuracy is

obtained by using a relatively small grammar. To achieve this result, Klein and Manning hand-

tune a grammar by supplementing non-terminals with 17 different linguistically-motivated

annotations. The grammar size remains small because each annotation is only added to non-

terminals that seem to benefit (based on observation), as opposed to the universal annotation seen

in models such as Gparent. Replication of their grammar is very difficult and rarely done. The

general trend of accuracy improvements requiring an increase in grammar size is still seen for

each additional annotation in their results, only on a smaller scale.

        Large grammars not only increase computational search requirements, but are also prone

to sparse data problems as probability estimates quickly become unreliable. If a grammar rule is

unobserved in the training data, it receives zero probability. As a result, many well-formed

testing sentences may fail to find even one valid parse tree since a required grammar rule may not

be available.




                                                   10
        Sparse data problems are typically addressed by “smoothing” the distribution of children

for each non-terminal (Chen and Goodman, 1998). Smoothing robs a small amount of probability

from observed grammar rules, and uniformly distributes that probability over unobserved rules,

allowing no rule to have zero probability. This technique allows every sentence to be given a

parse structure (even ill-formed sentences) at the cost of an enormous increase in grammar size.

A smoothed lexicalized grammar can easily exceed one hundred million grammar rules, making

efficient hypergraph search an extremely difficult problem.



3 Coarse-to-Fine Parsing

Our Coarse-to-Fine (CTF) parsing algorithm efficiently finds the globally optimal parse tree of

large, high-accuracy grammar models. CTF exploits the similarity between optimal parse trees of

increasingly complex models to find high-accuracy model solutions quickly. For example, parse

tree accuracy using Gwsj is only slightly lower than using Gparent and the two optimal parse trees

from these models, on average, share a considerable amount of structure (after removing

contextual annotation). CTF searches for the optimal parse tree in the high-accuracy model space

by following high probability hyperpaths suggested by simpler models, which are very efficient

to compute.

        We initialize our algorithm by building a complete hypergraph for sentence S with some

simple grammar GX. Building this complete hypergraph is efficient because the number of

grammar rules in GX is small. We then incrementally replace hyperedges from high probability

parse trees in this hypergraph with corresponding hyperedges from a more complex model.

Conditioning hyperedges on additional information modifies their probability, affecting the total

probability of parse trees in which they participate. We continue this process until we construct a

full parse tree composed only of hyperedges from the more complex model. Any number of

intermediate models can participate in the CTF algorithm to intelligently guide the search from a

course to a fine model solution.


                                                11
        We evaluate CTF by implementing a two-tier algorithm to find the optimal parse tree

τˆ ∈ Τ(S , G parent ) , guided by high probability parse trees from the hypergraph H (S , Gcoarse ) ,

where Gcoarse is a slight modification of Gwsj (discussed in Section 3.1). As described above,

hyperedges in H (S , Gcoarse ) are systematically replaced by corresponding parent-conditioned

                                                                                       (
hyperedges from Gparent. This process creates a new hybrid hypergraph H S , G coarse , G parent , X ,       )
with a heterogeneous collection of hyperedges from both Gcoarse and Gparent, where X denotes the

current state of the hypergraph (which grammar each hyperedge is from). Figure 5 shows two

                      (                   )
possible states of H S , Gcoarse , G parent for the sentence “The aged bottle flies fast” (a) before and

(b) after the hyperedge Noun_Phrase → [Determiner, Noun] from Gcoarse has been replaced with

the hyperedge from Gparent conditioned on the parent non-terminal Sentence.

                                                          (          )
        Unlike a single-PCFG hypergraph (e.g. H S , G wsj ), which has only one optimal parse

            (         )
tree τ ∈ H S , G wsj , the optimal hybrid parse tree is dependent on the state of hypergraph
      ˆ

H (S , G coarse , G parent , X ) . Each time a hyperedge is replaced in H (S , G coarse , G parent , X ) , the

                                                                 (                          )
optimal parse tree may also change. We define τ X ∈ H S , Gcoarse , G parent , X to be the hybrid
                                               ˆ

parse tree with highest probability when the hypergraph is in state X.

        The order of hyperedge replacement is important. If we replace every hyperedge in

H (S , Gcoarse ) with hyperedges from Gparent, the entire hypergraph H (S , G parent ) will be

constructed and we gain nothing over an exhaustive CYK search using the Gparent grammar.

Instead, the CTF algorithm chooses to replace hyperedges from high probability parse trees in

H (S , Gcoarse ) to lead us quickly to the highest probability parse tree in H (S , G parent ) . We

                                              (
continue to replace hyperedges in H S , G coarse , G parent , X          )   until we have constructed a full

hyperpath composed only of hyperedges from Gparent.




                                                     12
                                             (a)




                                            (b)

                                             (                )
Figure 5: A possible state of hypergraph H S , G coarse , G parent where S = “The aged bottle
flies fast” (a) before and (b) after the hyperedge Noun_Phrase → [Determiner, Noun] is
replaced         by        the        appropriate     hyperedge       from      Gparent
Noun_PhraseP=Sentence → [DeterminerP=Noun_Phrase, NounP=Noun_Phrase].




                                            13
3.1 A Coarse Grammar

                                                         ˆ               (
How can we be sure we have found the optimal parse tree τ ∈ Τ S , G parent         ) when a potential
Gparent parse tree is constructed in our hybrid hypergraph? Suppose we use Gwsj to construct our

initial hypergraph. Refining the probability of a hyperedge from Gwsj by conditioning on a

particular parent non-terminal can potentially increase or decrease the hyperedge probability. For

example, the probability P( β = [Determiner, Noun] | A = Noun_Phrase, Parent = Sentence)

may be higher or lower than the probability P( β = [Determiner, Noun] | A = Noun_Phrase)

averaged over all parent non-terminals. Since a probability increase is possible, replacing a

                             (
hyperedge from Gwsj in H S , G wsj , G parent , X    )   with a hyperedge from Gparent could cause a

previously non-optimal parse tree τ X to become the optimal parse tree τ X ′ in this new state.
                                                                        ˆ

                                           ˆ     (             )                           (
Therefore, to find the optimal parse tree τ ∈ Τ S , G parent guided by the hypergraph H S , G wsj ,      )
                                                                              (
it would be necessary to replace every hyperedge from Gwsj in H S , G wsj , G parent , X             )   to

guarantee no potential hybrid parse tree would increase in probability and become the maximum

likelihood solution.

        As mentioned earlier, if we replace every hyperedge from Gwsj, we gain no efficiency

benefit over the baseline CYK algorithm. To avoid exhaustive replacement, we modify the

grammar rule probabilities of Gwsj such that any additional parent-conditioning information will

never increase the probability of a hyperedge. That is, when a hyperedge from Gwsj is replaced

with a hyperedge from Gparent, the new hyperedge probability will either decrease or remain the

same. We define this new grammar Gcoarse to be identical to Gwsj modulo the modification of all

grammar rule probabilities to be

        Dcoarse ( A → β ) = max D parent ( AP = n → β ) = max P (β | A, Parent = n ) .         (3)
                            n∈N                             n∈N




                                                    14
Note that using the max probability over parent annotated grammar rules creates improper

distributions over the grammar rules of Gcourse because the total probability        ∑ P(β | A) often
                                                                                      A


exceeds one.      Because we use these probabilities as an upper-bound to the actual Gparent

probability of a potential parse tree, a proper distribution is not necessary.

                                                 ˆ         (
           Guiding parse tree selection towards τ ∈ Τ S, G parent     )    with Gcoarse instead of Gwsj

guarantees that hyperedges (and consequently potential parse trees) will never increase in

probability when parent information is supplied. Once the maximum likelihood parse tree of

H (S , G wsj , G parent , X ) is composed only of hyperedges from Gparent, no additional search is

                                                    (
required – all other potential parse trees in H S , G wsj , G parent , X   ) necessarily have an upper-
bound of lower probability. One drawback of Gcoarse is the decrease in parse accuracy to 49.6%

compared to the 74.3% parse accuracy of Gwsj. Even though traversal suggestions from Gcoarse

are, on average, worse than Gwsj, empirical results show that Gcoarse still guides the search towards

the optimal Gparent parse tree very quickly.



3.2 Implementation Decisions of Coarse-To-Fine

Pseudocode for a two-tier CTF algorithm can be seen in Figure 6, where Ex is defined as a

hyperedge from grammar Gx. The three most significant components of CTF – choosing a

hyperedge (line 6), splitting the hyperedge (line 7), and updating probabilities (line 8) – each have

a range of possible implementation strategies. In the following section, we will discuss these

options.

           Methods to choose a hyperedge from the current maximum likelihood parse tree (line 6)

range between two extremes: top-down and bottom-up. A top-down strategy chooses Ecoarse

closest to the root, and is necessarily connected to the root by Eparent hyperedges. A bottom-up




                                                  15
     1    Induce PCFGs Gwsj and Gparent from training data
     2    Construct Gcoarse using grammar rules from Gwsj with max probabilities from Gparent
     3    For each sentence S in the testing set
     4            Build the complete hypergraph H (S , Gcoarse ) and find τ X
                                                                           ˆ
     5            While ∃E coarse ∈ τ X
                                     ˆ
     6                      Choose any E coarse ∈ τ X
                                                   ˆ
     7                      Replace Ecoarse with appropriate parent hyperedge Eparent, modifying the
                            state X of the hybrid hypergraph
     8                      Propagate the probability difference between Ecoarse and Eparent through
                            parse trees previously containing Ecoarse
     9                                                       ˆ              (
                            Find the new optimal parse tree τ X ∈ H S , Gcoarse , G parent , X         )
           Figure 6: Pseudocode for the Coarse-to-Fine algorithm. Ex is a hyperedge from grammar Gx.



strategy chooses Ecoarse furthest from the root, and is necessarily connected to its yield by Eparent

hyperedges. Any alternative method to choose a hyperedge falls between these two strategies.

         A top-down strategy first focuses on hyperedges near the root, shared by a small number

                                                                            (
of parse trees. Finding the true high probability parse trees in H S , G parent         ) requires more total
hyperedge replacements than using a bottom-up approach, because each hyperedge replacement

moves few parse trees towards their true Gparent probability. The advantage of a top-down

approach is that less computational time is required in the propagation step (line 8) since

probability difference is percolated up the tree, and these hyperedges are already near the root.

         A bottom-up strategy initially updates hyperedges shared by a large number of parse trees

at the bottom of the hypergraph. This strategy requires more computational time during the

probability update step (line 8) because many more trees in the hypergraph share hyperedges near

the leaves than hyperedges near the root. For every hyperedge replaced, this propagation moves

the probability of a large number of trees closer to the actual probability of the trees in

H (S , G parent ) . As a result, fewer hyperedges need to be replaced before the globally optimal



                                                     16
parse tree is found when using the bottom-up approach. We report empirical results on both top-

down and bottom-up Ecoarse selection strategies in Section 4.

        We also investigate two options when replacing a Ecoarse hyperedge with its parent

annotated version (line 7): (1) replace Ecoarse with all potential parent annotated hyperedges, or (2)

add only one Eparent specified by Ecoarse and the parent non-terminal from the maximum likelihood

parse tree, leaving Ecoarse in the hypergraph for other parse trees with different parent non-

terminals. Empirical results show that replacing a hyperedge with all potential parent hyperedges

reduces the total number of CTF iterations; we also find that most instantiated Eparent hyperedges

never participate in a maximum likelihood parse tree before the optimal parse tree of

H (S , G parent ) is discovered. These unnecessary Eparent hyperedges require space and time to

propagate probability differences, and the computational requirements of method (1) outweigh its

benefits. All CTF experiments we report in this paper use method (2) when replacing Ecoarse

hyperedges.

        The replacement of Ecoarse with Eparent potentially reduces the probability of all parse trees

                                                                                    (
containing Ecoarse. If we think of the probability assigned to a parse tree in H S , G wsj , G parent , X   )
as an upper-bound to the actual probability of the tree given model Gparent, then modifying the

probability of a hyperedge in this tree simply tightens that bound.               Although propagating

probability differences (line 8) through all parse trees containing a replaced hyperedge is optimal

(in terms of tightening the bound on the most number of parse trees), it is not necessary. The

CTF algorithm only requires the probability of the current optimal parse tree to change – the

unmodified probability of all other parse trees remains a poor, but correct, upper-bound.

Nevertheless, each time a hyperedge is replaced, we propagate the probability difference through

all affected parse trees. This strategy of propagation reduces the amount of Gparent search-space

most effectively.




                                                   17
          The propagation of one probability difference to all participating parse trees is bounded

     1
by     R parent n 2 operations and can dominate CTF computational time. Caraballo and Charniak
     4

(1998) also fall prey to this problem in their best-first parser and suggest delaying propagation

until probability differences “become significant.” This significance threshold neither guarantees

a maximum likelihood parse tree, nor ensures cubic-time parsing bounds. Klein and Manning’s

A* algorithm avoids propagation entirely by ordering the hypergraph traversal such that all tail

node probabilities are fixed once a hyperedge composed of these nodes is traversed.

Consequently, the A* traversal order causes many more hyperedges to be processed than search

methods using propagation. Efficient probability propagation through hypergraphs has not been

well studied in the parsing community and clearly warrants more research.



4 Results

We compare CTF with both Klein and Manning’s A* algorithm and the baseline exhaustive CYK

search.                                                                          (
           All algorithms find the same globally optimal parse tree τ ∈ Τ S, G parent .
                                                                     ˆ                     )   All

experiments use sections 2 through 21 of the Wall Street Journal Penn Treebank corpus for

training, and section 24 for testing. As is commonly done, function tags and empty nodes are

removed from all data (Collins, 1999); we also remove quotes and map auxiliary verbs to a new

non-terminal AUX. Furthermore, to avoid the complexities of lexical input, we use one-best part-

of-speech tags instead of words, generated by the CSLU tagger (Hollingshead et al., 2005).

          The computational performance of each algorithm can be broken into two pieces: the

number of hyperedges processed, and the computational requirement of processing each

hyperedge. The three algorithms we evaluate attempt to optimize algorithmic efficiency by

balancing these two factors. Figure 7 summarizes the computational bounds of each performance

component; Figure 8 reports empirical results for total computational time and number of

hyperedges traversed. Computational time is an obvious metric when comparing algorithms


                                                 18
              Algorithm            Hyperedges Processed               Processing Cost of one
                                                                           Hyperedge
             CYK                             ( )
                                            O R n3                                O(1)
             A*                             O( R n ) 3
                                                                                 O(lg n )
             CTF                            O( R n ) 3
                                                                                 O n2( )
            Figure 7: Computational complexity for the two components of efficient hypergraph
            search. A complete bound on each algorithm is the product of both.   R   is the size of the
            grammar and n is the length of the sentence.



designed for efficiency, while tallying hyperedges gives intuition on how well the search is

guided.

          We know that the complete hypergraph space is bounded by O R n 3               (      ) and the CYK
algorithm exhaustively traverses every hyperedge. Both the A* and CTF algorithms efficiently

search a constant reduction of this space, as seen Figure 8 (d), but the search-space growth is still

cubic. The complexity bound of processing one hyperedge, on the other hand, differs between

each algorithm. Figure 7 shows that the CYK algorithm will perform best when sentences

approach infinite length, but we are only interested in optimizing efficiency over sensible values

of n. If an algorithm can prune the search space and not increase the processing cost of one

hyperedge too much, the overall computational time of finding the optimal parse tree can

improve.

          Comparing CTF bottom-up and top-down hyperedge selection strategies, Figure 8 (a)

shows top-down to perform, overall, more efficiently than bottom-up, yet in Figure 8 (b) we see

that the number of hyperedges visited by a top-down strategy is greater. Although a top-down

strategy traverses more hypergraph space, the number of parse trees affected by each hyperedge

replacement is much lower, requiring less computational time for propagation. On average, a

top-down strategy requires propagation through half as many hyperedges as a bottom-up strategy.

This is because replacing hyperedges at the bottom of the graph requires many more propagations



                                                           19
                                                       Computational Time Comparison                                                                                               Search Guidance Comparison

                                    100                                                                                                                             3500000
                                                  A*                                                                                                                               A*
                                     90
                                                  CYK                                                                                                                              CYK
                                                                                                                                                                    3000000
                                     80           CTF Top-Down                                                                                                                     CTF Top-Down




                                                                                                                                             Hyperedges Traversed
                                     70           CTF Bottom-Up                                                                                                     2500000        CTF Bottom-Up
                   Time (seconds)




                                     60
                                                                                                                                                                    2000000
                                     50
                                     40                                                                                                                             1500000
                                     30
                                                                                                                                                                    1000000
                                     20
                                     10                                                                                                                              500000
                                      0
                                          5       7      9   11         13     15       17    19    21    23        25                                                    0
                                                                                                                                                                               5   7     9   11     13    15    17     19   21    23   25
                                                                       Sentence Le ngth
                                                                                                                                                                                                   Se nte nce Length
                                                                             (a)
                                                                                                                                                                                                    (b)

                                                      Computational Time Comparison                                                                                                Search Guidance Comparison

                                    100                                                                                                                             10000000

                                                                                                                                                                     1000000
                                     10

                                                                                                                                    Hyperedges Traversed
                                                                                                                                                                      100000
  Time (seconds)




                                      1                                                                                                                                10000

                                                                                                                                                                        1000
                                    0.1
                                                                                                     A*                                                                                                                     A*
                                                                                                                                                                         100
                                                                                                     CYK                                                                                                                    CYK
                             0.01                                                                    CTF Top-Down                                                                                                           CTF Top-Down
                                                                                                                                                                          10
                                                                                                     CTF Bottom-Up                                                                                                          CTF Bottom-Up
                     0.001                                                                                                                                                 1
                                          5       7      9    11        13         15    17    19    21        23        25                                                    5   7     9   11      13    15   17     19   21    23   25

                                                                       Sentence Length                                                                                                             Sentence Length


                                                                             (c)                                                                                                                    (d)

                                                  Figure 8: Comparison between CTF: top-down, CTF: bottom-up, A*, and CYK measuring
                                                  (a) computational time, and (b) hyperedges traversed. Plots (c) and (d) are a log scale of (a)
                                                  and (b), respectively.



than replacing hyperedges near the root.                                                                                      A bottom-up strategy suggests these expensive

hyperedges at the beginning of the search when less is known about the true distribution of parse

                                              (                    )
trees in Τ S, G parent . On the other hand, a top-down strategy spends more time splitting less

expensive hyperedges near the root, and only traverses to the bottom of the current maximum

likelihood parse tree for high probability candidates composed primarily of Gparent hyperedges.

                                          The analysis of total computational time in Figure 8 (a) shows the A* algorithm to

perform better than CTF or CYK on sentences of length 10 to 25. As sentence length increases, a

diminishing difference is seen between the A* and CYK algorithms because the overhead of



                                                                                                                              20
maintaining a sorted agenda of R parent n 3 hyperedges (required by A* and not CYK) hinders the

savings gained by exploring less search space. Both top-down and bottom-up implementations of

CTF have a computational advantage over A* and CYK when the sentence length (n) is very

small, as seen in Figure 7 (c), but quickly become inefficient as n grows. Since modifying the

                                                           1
probability of one hyperedge can theoretically affect        R parent n 2 other hyperedges in the
                                                           4

hypergraph, probability propagation severely hinders performance on longer sentences.

        In Figure 8 (b), on the other hand, we clearly see the advantage CTF has over both the

CYK and A* algorithms when we compare hyperedges traversed in the Gparent space. This result

confirms our hypothesis that following hyperedges suggested by a simpler model to find the

optimal parse tree in a larger, more context-dependent grammar space, is very effective. Both

implementations of CTF prune the Gparent exhaustive search space by an order of magnitude. We

are optimistic that future research will lead to alternative probability propagation methods, and

searching over this significantly reduced space will produce an efficient parsing algorithm.



5 Conclusion

We have presented a new algorithm to efficiently traverse a large hypergraph space by using the

solutions of simpler models. To our knowledge, Coarse-to-Fine is the only exact inference

parsing algorithm to leverage these simpler model solutions. We have discussed the advantages

and disadvantages of CTF compared to standard parsing algorithms, as well as many possible

variations on CTF implementation. Although no computational efficiency is gained by parsing

long sentences with our current algorithm, we have shown how the CTF approach dramatically

reduces the number of hyperedges traversed.

        Many avenues of future research exist.        First, we believe a coarse grammar with

probabilities closer to Gwsj would lead an even greater search-space reduction. High probability




                                                21
                 (       )
parse trees in H S , G wsj are more similar to the optimal Gparent parse tree than high probability

parse trees in H (S , Gcoarse ) . Preserving the probabilities of Gwsj, while maintaining a guarantee

to find the globally optimal solution, is a significant challenge. Second, we plan to extend CTF to

models with much larger grammars that include lexical information and probability smoothing.

The CYK and A* algorithms have been previously shown to exhibit significant performance

degradation when using these large grammars required for high-accuracy results. Since a Coarse-

to-Fine search is guided by full parse trees of simpler models, we suspect that our algorithm will

be relatively unaffected by this explosion in potential search space. Last, efficient probability

propagation through a hypergraph is a neglected problem in natural language parsing. Reduction

in the computational complexity of this task would not only improve the overall performance of

CTF, but many other parsing algorithms as well.



References

Allen, J. Natural Language Understanding. The Benjamin/Cummings Publishing Company, Inc,
        Redwood City, CA. 1995.

Charniak, Eugene, Sharon Goldwater, and Mark Johnson. Edge-Based Best-First Chart Parsing.
       In Proceedings of the Sixth Workshop on Very Large Corpora. 1998.

Chen, Stan and Joshua Goodman. An Empirical Study of Smoothing Techniques for Language
       Modeling. Harvard University Technical Report TR-10-98. 1998.

Collins, Michael. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis,
        University of Pennsylvania. 1999.

Hart, Peter, Nils Nilsson, and Bertram Raphael. A Formal Basis for the Heuristic Determination
        of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetic,
        SSC4(2) 100–107. 1968.

Hollingshead, Kristy, Seeger Fisher, and Brian Roark. Comparing and Combining Finite-State
        and Context-Free Parsers. In Proceedings of Human Language Technologies / Empirical
        Methods for Natural Language Processing (HLT/EMNLP-2005). 2005.

Johnson, Mark. PCFG Models of Linguistic Tree Representations. Computational Linguistics,
       24:613-632. 1998.

Jurafsky, Daniel and James H. Martin. Speech and Language Processing. Prentice-Hall. 2000.


                                                 22
Kasami, T. An Efficient Recognition and Syntax Analysis Algorithm for Context-Free
       Languages.    Technical Report AFCRL-65-758, Air Force Cambridge Research
       Laboratory, Bedford, MA. 1965.

Klein, Dan and Chris Manning. Parsing and Hypergraphs. International Workshop on Parsing
        Technologies (IWPT-2001). 2001.

Klein, Dan and Chris Manning. Factored A* Search for Models over Sequences and Trees. In
        Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence
        (IJCAI-03). 2003.

Klein, Dan and Chris Manning. Accurate Unlexicalized Parsing. In Proceedings of Association
        of Computational Linguistics (ACL-03). 2003b.

Marcus, Mitchell, Beatrice Santorini and Mary Ann Marcinkiewicz. Building a Large Annotated
       Corpus of English: the Penn treebank. Computational Linguistics,19(2):313-330. 1993.

Ratnaparkhi, Adwait. Learning to Parse Natural Language with Maximum Entropy Models.
       Machine Learning, 34:151-175. 1999.

Roark, Brian. Probabilistic Top-Down Parsing and Language Modeling.             Computational
       Linguistics, 27:249-276. 2001.

Younger, D.H. Recognition and Parsing of Context Free Languages in Time n3. Information and
      Control, 10:189-208. 1967.




                                              23

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:6/2/2011
language:English
pages:23