Docstoc

Slide 1 - The Center for Language and Speech Processing

Document Sample
Slide 1 - The Center for Language and Speech Processing Powered By Docstoc
					Two Paradigms for Natural-
Language Processing

       Robert C. Moore
      Senior Researcher
      Microsoft Research
Why is Microsoft interested in
natural-language processing?
   Make computers/software easier to use.
   Long term goal: just talk to your
    computer (Startrek scenario).
Some of Microsoft’s near(er)
term goals in NLP
   Better search
       Help find things on your computer.
       Help find information on the Internet.
   Document summarization
       Help deal with information overload.
   Machine translation
Why is Microsoft interested in
machine translation?
   Internal: Microsoft is the world’s largest user
    of translation services. MT can help Microsoft
       Translate documents that would otherwise not be
        translated – e.g., PSS knowledge base
        (http://support.microsoft.com/default.aspx?scid=f
        h;ES-ES;faqtraduccion).
       Save money on human translation by providing
        machine translations as a starting point.
   External: Sell similar software/services to
    other large companies.
Knowledge engineering vs.
machine learning in NLP
   Biggest debate over the last 15 years in NLP
    has been knowledge engineering vs. machine
    learning.
   KE approach to NLP usually involves hand-
    coding of grammars and lexicons by linguistic
    experts.
   ML approach to NLP usually involves training
    statistical models on large amounts of
    annotated or un-annotated text.
Central problems in KE-based
NLP
   Parsing – determining the syntactic
    structure of a sentence.
   Interpretation – deriving formal
    representation of the meaning of a
    sentence.
   Generation – deriving a sentence that
    expresses a given meaning
    representation.
Simple examples of KE-based
NLP notations
   Phrase-structure grammar:
       S  Np Vp, Np  Sue, Np  Mary
       Vp  V Np, V  sees
   Syntactic structure:
       [[Sue]Np [[sees]V [Mary]Np]Vp]S
   Meaning representation:
       [see(E), agt(E,sue), pat(E,mary)]
Unification Grammar: the
pinnacle of the NLP KE paradigm
   Provides a uniform declarative
    formalism.
   Can be used to specify both syntactic
    and semantic analyses.
   A single grammar can be used for both
    parsing and generation.
   Supports a variety of efficient parsing
    and generation algorithms.
Background: Question
formation in English
To construct a yes/no question:
 Place the tensed auxiliary verb from the
  corresponding statement at the front of the clause.
       John can see Mary.
       Can John     see Mary?
   If there is no tensed auxiliary, add the appropriate
    form of the semantically empty auxiliary do.
       John sees Mary.
       John does see Mary.
       Does John     see Mary?
Question formation in English
(continued)
To construct a who/what question:
 For a non-subject who/what question, form a
  corresponding yes/no question.
       Does John   see Mary?
   Replace the noun phrase in the position being
    questioned with a question noun phrase and move to
    the front of the clause.
       Who does John    see    ?
   For a subject who/what question, simply replace the
    subject with a question noun phrase.
       Who sees Mary?
Example of a UG grammar rule
involved in who/what questions
S1/S_sem ---> [NP/NP_sem, S2/S_sem] :-
    S1::(cat=s, stype=whq, whgap_in=SL,
          whgap_out=SL, vgap=[]),
    NP::(cat=np, wh=y, whgap_in=[],
          whgap_out=[]),
    S2::(cat=s, stype=ynq,
          whgap_in=NP/NP_sem,
          whgap_out=[], vgap=[]).
Context-free backbone of rule
S1/S_sem ---> [NP/NP_sem, S2/S_sem] :-
   S1::(cat=s, stype=whq, whgap_in=SL,
        whgap_out=SL, vgap=[]),
   NP::(cat=np, wh=y, whgap_in=[],
        whgap_out=[]),
   S2::(cat=s, stype=ynq,
        whgap_in=NP/NP_sem,
        whgap_out=[], vgap=[]).
Category subtype features
S1/S_sem ---> [NP/NP_sem, S2/S_sem] :-
    S1::(cat=s, stype=whq, whgap_in=SL,
          whgap_out=SL, vgap=[]),
    NP::(cat=np, wh=y, whgap_in=[],
          whgap_out=[]),
    S2::(cat=s, stype=ynq,
          whgap_in=NP/NP_sem,
          whgap_out=[], vgap=[]).
Features for tracking long
distance dependencies
S1/S_sem ---> [NP/NP_sem, S2/S_sem] :-
    S1::(cat=s, stype=whq, whgap_in=SL,
          whgap_out=SL, vgap=[]),
    NP::(cat=np, wh=y, whgap_in=[],
          whgap_out=[]),
    S2::(cat=s, stype=ynq,
         whgap_in=NP/NP_sem,
         whgap_out=[], vgap=[] ).
Semantic features
S1/S_sem ---> [NP/NP_sem, S2/S_sem] :-
    S1::(cat=s, stype=whq, whgap_in=SL,
          whgap_out=SL, vgap=[]),
    NP::(cat=np, wh=y, whgap_in=[],
          whgap_out=[]),
    S2::(cat=s, stype=ynq,
          whgap_in=NP/NP_sem,
          whgap_out=[], vgap=[]).
Parsing algorithms for UG
   Virtually any CFG parsing algorithm can be
    applied to UG by replacing identity tests on
    nonterminals with unification of nonterminals.
   UG grammars are Turing complete, so
    grammars have to be written appropriately
    for parsing to terminate.
   “Reasonable” grammars generally can be
    parsed in polynomial time, often n3.
Generation algorithms for UG
   Since grammar is purely declarative,
    generation can be done by “running the
    parser backwards.”
   Efficient generation algorithms are more
    complicated than that, but still polynomial for
    “reasonable” grammars and “exact
    generation.”
   Generation taking into account semantic
    equivalence is worst-case NP-hard, but still
    can be efficient in practice.
A Prolog-based UG system to
play with
   Go to
    http://www.research.microsoft.com/research/downloads/
   Download “Unification Grammar Sentence Realization
    Algorithms,” which includes
       A simple bottom-up parser,
       Two sophisticated generation algorithms,
       A small sample grammar and lexicon,
       A paraphrase demo that
            Parses sentences covered by the grammar into a semantic
             representation.
            Generates all sentences that have that semantic representation
             according to the grammar.
A paraphrase example
?- paraphrase(s(_,'CAT'([]),'CAT'([]),'CAT'([])),
              [what,direction,was,the,cat,chased,by,the,dog,in]).

in what direction did the dog __ chase the cat __

in what direction was the cat __ chased __ by the dog

in what direction was the cat __ chased by the dog __

what direction did the dog __ chase the cat in __

what direction was the cat __ chased in __ by the dog

what direction was the cat __ chased by the dog in __

generation_elapsed_seconds(0.0625)
Whatever happened to UG-
based NLP?
   UG-based NLP is elegant, but lacks
    robustness for broad-coverage tasks.
   Hard for human experts to incorporate
    enough details for broad coverage,
    unless grammar/lexicon are very
    permissive.
   Too many possible ambiguities arise as
    coverage increases.
How machine-learning-based
NLP addresses these problems
   Details are learned by processing very
    large corpora.
   Ambiguities are resolved by choosing
    most likely answer according to a
    statistical model.
Increase in stat/ML papers at
ACL conferences over 15 years
                     100

                     90
                                                     2003
                     80

                     70
   Percent Stat/ML




                     60

                     50                      1998

                     40
                     30              1993

                     20
                     10
                             1988
                      0
                      1985    1990    1995    2000    2005
                                      Year
Characteristics of ML approach to
NLP compared to KE approach
   Model-driven rather than theory-driven.
   Uses shallower analyses and
    representations.
   More opportunistic and more diverse in
    range of problems addressed.
   Often driven by availability of training
    data.
Differences in approaches to
stat/ML NLP
   Type of training data
       Annotated – supervised training
       Un-annotated – unsupervised training
   Type of model
       Joint model – e.g., generative probabilistic
       Conditional model – e.g., conditional maximum
        entropy
   Type of training
       Joint – maximum likelihood training
       Conditional – discriminative training
Statistical parsing models
   Most are:
       Generative probabilistic models,
       Trained on annotated data (e.g., Penn
        Treebank),
       Using maximum likelihood training.
   The simplest such model would be a
    probabilistic context-free grammar.
Probabilistic context-free
grammars (PCFGs)
   A PCFG is a CFG that assigns to each
    production a conditional probability of the
    right-hand side given the left-hand side.
   The probability of a derivation is simply the
    product of the conditional probabilities of all
    the productions used in the derivation.
   PCFG-based parsing chooses, as the parse of
    a sentence, the derivation of the sentence
    having the highest probability.
Problems with simple generative
probabilistic models
   Incorporating more features into the
    model splits data, resulting in sparse
    data problems.
   Joint maximum likelihood training
    “wastes” probability mass predicting the
    given part of the input data.
A currently popular technique:
conditional maximum entropy models
   Basic models are of the form:
                       1                         
        p ( y | x)         exp  i f i ( x, y) 
                     Z ( x)     i                

   Advantages:
       Using more features does not require
        splitting data.
       Training maximizes conditional probability
        rather than joint probability.
Unsupervised learning in NLP
   Tries to infer unknown parameters and alignments of
    data to “hidden” states that best explain (i.e., assign
    highest probability to) un-annotated NL data.
   Most common training method is Expectation
    Maximization (EM):
       Assume initial distributions for joint probability of alignments
        of hidden states to observable data.
       Compute joint probabilities for observed training data and all
        possible alignments.
       Re-estimate probability distributions based on
        probabilistically weighted counts from previous step.
       Iterate last two steps until desired convergence is reached.
Statistical machine translation
   A leading example of unsupervised learning in
    NLP.
   Models are trained from parallel bilingual, but
    otherwise un-annotated corpora.
   Models usually assume a sequence of words
    in one language is produced by a generative
    probabilistic process from a sequence of
    words in another language.
Structure of stat MT models
   Often a noisy-channel framework is
    assumed:
           p(e | f )  p(e)  p(f | e)
   In basic models, each target word is
    assumed to be generated by one source
    word.
    A simple model: IBM model 1
   A sentence e produces a sentence f assuming
       The length m of f is independent of the length l of e.
       Each word of f is generated by one word of e
        (including an empty word e0).
       Each word in e is equally likely to generate the word
        at any position in f, independently of how any other
        words are generated.
   Mathematically:
                                         m     l
               p(f | e) 
                            (l  1)   m    t ( f
                                          j 1 i 0
                                                      j   | ei )
More advanced models
   Most approaches
       Model how words are ordered (but crudely).
       Model how many words a given word is likely to
        translates into.
   Best performing approaches model word-
    sequence-to-word-sequence translations.
   Some initial work has been done on
    incorporating syntactic structure into models.
Examples of machine learned
English/Italian word translations
   PROCESSOR      PROCESSORE         THAT          CHE
   APPLICATIONS   APPLICAZIONI       FUNCTIONALITY FUNZIONALITÀ
   SPECIFY        SPECIFICARE        PHASE         FASE
   NODE           NODO               SEGMENT       SEGMENTO
   DATA           DATI               CUBES         CUBI
   SERVICE        SERVIZIO           VERIFICATION VERIFICA
   THREE          TRE                ALLOWS        CONSENTE
   IF             SE                 TABLE         TABELLA
   SITES          SITI               BETWEEN       TRA
   TARGET         DESTINAZIONE       DOMAINS       DOMINI
   RESTORATION    RIPRISTINO         MULTIPLE      PIÙ
   ATTENDANT      SUPERVISORE        NETWORKS      RETI
   GROUPS         GRUPPI             A             UN
   MESSAGING      MESSAGGISTICA      PHYSICALLY    FISICAMENTE
   MONITORING     MONITORAGGIO       FUNCTIONS     FUNZIONI
How do KE and ML approaches to
NLP compare today?
   ML has become the dominant paradigm in
    NLP. (“Today’s students know everything
    about maxent modeling, but not what a noun
    phrase is.”)
   ML results are easier to transfer than KE
    results.
   We probably now have enough computer
    power and data to learn more by ML than a
    linguistic expert could encode in a lifetime.
   In almost every independent evaluation, ML
    methods outperform KE methods in practice.
Do we still need linguistics in
computational linguistics?
   There are still many things we are not good
    at modeling statistically.
   For example, stat MT models based on single-
    words or strings are good at getting the right
    words, but poor at getting them in the right
    order.
   Consider:
       La profesora le gusta a tu hermano.
       Your brother likes the teacher.
       The teacher likes your brother.
Concluding thoughts
   If forced to choose between a pure ML approach and
    a pure KE approach, ML almost always wins.
   Statistical models still seem to need a lot more
    linguistic features for really high performance.
   A lot of KE is actually hidden in ML approaches, in
    the form of annotated data, which is usually
    expensive to obtain.
   The way forward may be to find methods for experts
    to give advice to otherwise unsupervised ML
    methods, which may be cheaper than annotating
    enough data to learn the content of the advice.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:0
posted:7/7/2011
language:English
pages:37