Two Paradigms for Natural- Language Processing Robert C. Moore Senior Researcher Microsoft Research Why is Microsoft interested in natural-language processing? Make computers/software easier to use. Long term goal: just talk to your computer (Startrek scenario). Some of Microsoft’s near(er) term goals in NLP Better search Help find things on your computer. Help find information on the Internet. Document summarization Help deal with information overload. Machine translation Why is Microsoft interested in machine translation? Internal: Microsoft is the world’s largest user of translation services. MT can help Microsoft Translate documents that would otherwise not be translated – e.g., PSS knowledge base (http://support.microsoft.com/default.aspx?scid=f h;ES-ES;faqtraduccion). Save money on human translation by providing machine translations as a starting point. External: Sell similar software/services to other large companies. Knowledge engineering vs. machine learning in NLP Biggest debate over the last 15 years in NLP has been knowledge engineering vs. machine learning. KE approach to NLP usually involves hand- coding of grammars and lexicons by linguistic experts. ML approach to NLP usually involves training statistical models on large amounts of annotated or un-annotated text. Central problems in KE-based NLP Parsing – determining the syntactic structure of a sentence. Interpretation – deriving formal representation of the meaning of a sentence. Generation – deriving a sentence that expresses a given meaning representation. Simple examples of KE-based NLP notations Phrase-structure grammar: S Np Vp, Np Sue, Np Mary Vp V Np, V sees Syntactic structure: [[Sue]Np [[sees]V [Mary]Np]Vp]S Meaning representation: [see(E), agt(E,sue), pat(E,mary)] Unification Grammar: the pinnacle of the NLP KE paradigm Provides a uniform declarative formalism. Can be used to specify both syntactic and semantic analyses. A single grammar can be used for both parsing and generation. Supports a variety of efficient parsing and generation algorithms. Background: Question formation in English To construct a yes/no question: Place the tensed auxiliary verb from the corresponding statement at the front of the clause. John can see Mary. Can John see Mary? If there is no tensed auxiliary, add the appropriate form of the semantically empty auxiliary do. John sees Mary. John does see Mary. Does John see Mary? Question formation in English (continued) To construct a who/what question: For a non-subject who/what question, form a corresponding yes/no question. Does John see Mary? Replace the noun phrase in the position being questioned with a question noun phrase and move to the front of the clause. Who does John see ? For a subject who/what question, simply replace the subject with a question noun phrase. Who sees Mary? Example of a UG grammar rule involved in who/what questions S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=), NP::(cat=np, wh=y, whgap_in=, whgap_out=), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=, vgap=). Context-free backbone of rule S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=), NP::(cat=np, wh=y, whgap_in=, whgap_out=), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=, vgap=). Category subtype features S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=), NP::(cat=np, wh=y, whgap_in=, whgap_out=), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=, vgap=). Features for tracking long distance dependencies S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=), NP::(cat=np, wh=y, whgap_in=, whgap_out=), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=, vgap= ). Semantic features S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=), NP::(cat=np, wh=y, whgap_in=, whgap_out=), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=, vgap=). Parsing algorithms for UG Virtually any CFG parsing algorithm can be applied to UG by replacing identity tests on nonterminals with unification of nonterminals. UG grammars are Turing complete, so grammars have to be written appropriately for parsing to terminate. “Reasonable” grammars generally can be parsed in polynomial time, often n3. Generation algorithms for UG Since grammar is purely declarative, generation can be done by “running the parser backwards.” Efficient generation algorithms are more complicated than that, but still polynomial for “reasonable” grammars and “exact generation.” Generation taking into account semantic equivalence is worst-case NP-hard, but still can be efficient in practice. A Prolog-based UG system to play with Go to http://www.research.microsoft.com/research/downloads/ Download “Unification Grammar Sentence Realization Algorithms,” which includes A simple bottom-up parser, Two sophisticated generation algorithms, A small sample grammar and lexicon, A paraphrase demo that Parses sentences covered by the grammar into a semantic representation. Generates all sentences that have that semantic representation according to the grammar. A paraphrase example ?- paraphrase(s(_,'CAT'(),'CAT'(),'CAT'()), [what,direction,was,the,cat,chased,by,the,dog,in]). in what direction did the dog __ chase the cat __ in what direction was the cat __ chased __ by the dog in what direction was the cat __ chased by the dog __ what direction did the dog __ chase the cat in __ what direction was the cat __ chased in __ by the dog what direction was the cat __ chased by the dog in __ generation_elapsed_seconds(0.0625) Whatever happened to UG- based NLP? UG-based NLP is elegant, but lacks robustness for broad-coverage tasks. Hard for human experts to incorporate enough details for broad coverage, unless grammar/lexicon are very permissive. Too many possible ambiguities arise as coverage increases. How machine-learning-based NLP addresses these problems Details are learned by processing very large corpora. Ambiguities are resolved by choosing most likely answer according to a statistical model. Increase in stat/ML papers at ACL conferences over 15 years 100 90 2003 80 70 Percent Stat/ML 60 50 1998 40 30 1993 20 10 1988 0 1985 1990 1995 2000 2005 Year Characteristics of ML approach to NLP compared to KE approach Model-driven rather than theory-driven. Uses shallower analyses and representations. More opportunistic and more diverse in range of problems addressed. Often driven by availability of training data. Differences in approaches to stat/ML NLP Type of training data Annotated – supervised training Un-annotated – unsupervised training Type of model Joint model – e.g., generative probabilistic Conditional model – e.g., conditional maximum entropy Type of training Joint – maximum likelihood training Conditional – discriminative training Statistical parsing models Most are: Generative probabilistic models, Trained on annotated data (e.g., Penn Treebank), Using maximum likelihood training. The simplest such model would be a probabilistic context-free grammar. Probabilistic context-free grammars (PCFGs) A PCFG is a CFG that assigns to each production a conditional probability of the right-hand side given the left-hand side. The probability of a derivation is simply the product of the conditional probabilities of all the productions used in the derivation. PCFG-based parsing chooses, as the parse of a sentence, the derivation of the sentence having the highest probability. Problems with simple generative probabilistic models Incorporating more features into the model splits data, resulting in sparse data problems. Joint maximum likelihood training “wastes” probability mass predicting the given part of the input data. A currently popular technique: conditional maximum entropy models Basic models are of the form: 1 p ( y | x) exp i f i ( x, y) Z ( x) i Advantages: Using more features does not require splitting data. Training maximizes conditional probability rather than joint probability. Unsupervised learning in NLP Tries to infer unknown parameters and alignments of data to “hidden” states that best explain (i.e., assign highest probability to) un-annotated NL data. Most common training method is Expectation Maximization (EM): Assume initial distributions for joint probability of alignments of hidden states to observable data. Compute joint probabilities for observed training data and all possible alignments. Re-estimate probability distributions based on probabilistically weighted counts from previous step. Iterate last two steps until desired convergence is reached. Statistical machine translation A leading example of unsupervised learning in NLP. Models are trained from parallel bilingual, but otherwise un-annotated corpora. Models usually assume a sequence of words in one language is produced by a generative probabilistic process from a sequence of words in another language. Structure of stat MT models Often a noisy-channel framework is assumed: p(e | f ) p(e) p(f | e) In basic models, each target word is assumed to be generated by one source word. A simple model: IBM model 1 A sentence e produces a sentence f assuming The length m of f is independent of the length l of e. Each word of f is generated by one word of e (including an empty word e0). Each word in e is equally likely to generate the word at any position in f, independently of how any other words are generated. Mathematically: m l p(f | e) (l 1) m t ( f j 1 i 0 j | ei ) More advanced models Most approaches Model how words are ordered (but crudely). Model how many words a given word is likely to translates into. Best performing approaches model word- sequence-to-word-sequence translations. Some initial work has been done on incorporating syntactic structure into models. Examples of machine learned English/Italian word translations PROCESSOR PROCESSORE THAT CHE APPLICATIONS APPLICAZIONI FUNCTIONALITY FUNZIONALITÀ SPECIFY SPECIFICARE PHASE FASE NODE NODO SEGMENT SEGMENTO DATA DATI CUBES CUBI SERVICE SERVIZIO VERIFICATION VERIFICA THREE TRE ALLOWS CONSENTE IF SE TABLE TABELLA SITES SITI BETWEEN TRA TARGET DESTINAZIONE DOMAINS DOMINI RESTORATION RIPRISTINO MULTIPLE PIÙ ATTENDANT SUPERVISORE NETWORKS RETI GROUPS GRUPPI A UN MESSAGING MESSAGGISTICA PHYSICALLY FISICAMENTE MONITORING MONITORAGGIO FUNCTIONS FUNZIONI How do KE and ML approaches to NLP compare today? ML has become the dominant paradigm in NLP. (“Today’s students know everything about maxent modeling, but not what a noun phrase is.”) ML results are easier to transfer than KE results. We probably now have enough computer power and data to learn more by ML than a linguistic expert could encode in a lifetime. In almost every independent evaluation, ML methods outperform KE methods in practice. Do we still need linguistics in computational linguistics? There are still many things we are not good at modeling statistically. For example, stat MT models based on single- words or strings are good at getting the right words, but poor at getting them in the right order. Consider: La profesora le gusta a tu hermano. Your brother likes the teacher. The teacher likes your brother. Concluding thoughts If forced to choose between a pure ML approach and a pure KE approach, ML almost always wins. Statistical models still seem to need a lot more linguistic features for really high performance. A lot of KE is actually hidden in ML approaches, in the form of annotated data, which is usually expensive to obtain. The way forward may be to find methods for experts to give advice to otherwise unsupervised ML methods, which may be cheaper than annotating enough data to learn the content of the advice.