Morphology by ewghwehws

VIEWS: 9 PAGES: 28

									               Announcements
• Main CSE file server went down last night
   – Hand in your homework using „submit_cse467‟ as soon
     as you can – no penalty if handed in today.
• Friday (10/6) each team must e-mail me:
   – who is on the team
   – ideas for project, with scale up/down plans
• Get together as a team to work when I am away:
   – 10/12–10/13: Thursday – Friday next week
   – 10/23–10/27: Monday – Friday
   Part of speech (POS) tagging
• Tagging of words in a corpus with the
  correct part of speech, drawn from some
  tagset.
• Early automatic POS taggers were rule-
  based.
• Stochastic POS taggers are reasonably
  accurate.
   Applications of POS tagging
• Parsing
  – recovering syntactic structure requires correct
     POS tags
  – partial parsing refers to and syntactic analysis
     which does not result in a full syntactic parse
     (e.g. finding noun phrases)
  - “parsing by chunks”
   Applications of POS tagging
• Information extraction
  – fill slots in predefined templates with
    information
  – full parse is not needed for this task, but partial
    parsing results (phrases) can be very helpful
  – information extraction tags with grammatical
    categories to find semantic categories
   Applications of POS tagging
• Question answering
  – system responds to a user question with a noun
    phrase
     • Who shot JR? (Kristen Shepard)
     • Where is Starbucks? (UB Commons)
     • What is good to eat here? (pizza)
    Background on POS tagging
• How hard is tagging?
  – most words have just a single tag: easy
  – some words have more than one possible tag:
    harder
  – many common words are ambiguous
• Brown corpus:
  – 10.4% of word types are ambiguous
  – 40%+ of word tokens are ambiguous
    Disambiguation approaches
• Rule-based
   – rely on large set of rules to disambiguate in context
   – rules are mostly hand-written
• Stochastic
   – rely on probabilities of words having certain tags in
     context
   – probabilities derived from training corpus
• Combined
   – transformation-based tagger: uses stochastic approach
     to determine initial tagging, then uses a rule-based
     approach to “clean up” the tags
 Determining the appropriate tag for
         an untagged word
Two types of information can be used:
• syntagmatic information
  – consider the tags of other words in the
    surrounding context
  – tagger using such information correctly tagged
    approx. 77% of words
  – problem: content words (which are the ones
    most likely to be ambiguous) typically have
    many parts of speech, via productive rules (e.g.
    N  V)
 Determining the appropriate tag for
         an untagged word
• use information about word (e.g. usage
  probability)
  – baseline for tagger performance is given by a
    tagger that simply assigns the most common tag
    to ambiguous words
  – correctly tags 90% of words
• modern taggers use a variety of information
  sources
  Note about accuracy measures
• Modern taggers claim accuracy rates of around
  96% to 97%.
• This sounds impressive, but how good are they
  really?
• This is a measure of correctness at the level of
  individual words, not whole corpora.
• With a 96% accuracy, 1 word out of 25 is tagged
  incorrectly. This represents roughly one tagging
  error per sentence.
      Rule-based POS tagging
• Two-stage design:
  – first stage looks up individual words in a
    dictionary and tags words with sets of possible
    tags
  – second stage uses rules to disambiguate,
    resulting in singleton tag sets
       Stochastic POS tagging
• Stochastic taggers choose tags that result in
  the highest probability:
    P(word | tag) * P(tag | previous n tags)
• Stochastic taggers generally maximize
  probabilities for tag sequences for
  sentences.
         Bigram stochastic tagger
• This kind of tagger “…chooses tag ti for word wi that is
  most probable given the previous tag ti-1 and the current
  word wi:
      ti = argmaxj P(tj | ti-1, wi)                             (8.2)”
                                                                 [page 303]
• Bayes law says: P(T|W) = P(T)P(W|T)/P(W)
      P(tj | ti-1, wi) = P(tj) P(ti-1, wi | tj) / P(tI-1, wi)

• Since we take the argmax of this over the tis, result is the
  same as using:
      P(tj | ti-1, wi) = P(tj) P(ti-1, wi | tj)

• Rewriting:
      ti = argmaxj P(tj | ti-1)P(wi | tj)
           Example (page 304)
• What tag to we assign to race?
   – to/TO race/??
   – the/DT race/??
• In the first case, if we are choosing between NN
  and VB as tags for race, the equations are:
   – P(VB|TO)P(race|VB)
   – P(NN|TO)P(race|NN)
• Tagger will choose tag for NN which maximizes
  the probability.
                    Example
• For first part – look at tag sequence probability:
   – P(NN|TO) = 0.021
   – P(VB|TO) = 0.34
• For second part – look at lexical likelihood:
   – P(race|NN) = 0.00041
   – P(race|VB) = 0.00003
• Combining these:
   – P(VB|TO)P(race|VB) = 0.00001
   – P(NN|TO)P(race|NN) = 0.000007
             English syntax
• What are some properties of English syntax
  we might want our formalism to capture?
• This depends on our goal:
  – processing written or spoken language?
  – modeling human behavior or not?
• Context-free grammar formalism
Things a grammar should capture
• As we have mentioned repeatedly, human
  language is an amazingly complex system
  of communication. Some properties of
  language which a (computational) grammar
  should reflect include:
  – Constituency
  – Agreement
  – Subcategorization / selectional restrictions
               Constituency
• Phrases are syntactic equivalence classes:
  – they can appear in the same contexts
  – they are not semantic equivalence classes: they
    can clearly mean different things
• Ex (noun phrases)
  – Clifford the big red dog
  – the man from the city
  – a lovable little kitten
            Constituency tests

• Can appear before a verb:
   – a lovable little kitten eats food
   – the man from the city arrived yesterday
• Other arbitrary word groupings cannot:
   – *from the arrived yesterday

A string of words which is starred, like the one above, is
  considered ill-formed. Various gradations can occur,
  such as „?‟, „?*‟, „*‟, „**‟. Judgements are subjective.
      More tests of constituency
• They also function as a unit with respect to syntactic
  processes:
   – On September seventeenth, I‟d like to fly from Atlanta to Denver.
   – I‟d like to fly on September seventeenth from Atlanta to Denver.
   – I‟d like to fly from Atlanta to Denver on September seventeenth.
• Other groupings of words don‟t behave the same:
   – * On September, I‟d like to fly seventeenth from Atlanta to
     Denver.
   – * On I‟d like to fly September seventeenth from Atlanta to Denver.
   – * I‟d like to fly on September from Atlanta to Denver seventeenth.
   – * I‟d like to fly on from Atlanta to Denver September seventeenth.
                  Agreement
• English has subject-verb agreement:
  –   The cats chase that dog all day long.
  –   * The cats chases that dog all day long.
  –   The dog is chased by the cats all day long.
  –   * The dog are chased by the cats all day long.
• Many languages exhibit much more
  agreement than English.
              Subcategorization
• Verbs (predicates) require arguments of
  different types:
  –               The mirage disappears daily.
  –   NP          I prefer ice cream.
  –   NP PP       I leave Boston in the morning.
  –   NP NP       I gave Mary a ticket.
  –   PP          I leave on Thursday.
              Alternations

• want can take either an NP and an infinitival
  VP:
  – I want a flight …
  – I want to fly …
• find cannot take an infinitival VP:
  – I found a flight …
  – * I found to fly …
     How can we encode rules of
            language?
• There are many grammar formalisms. Most
  are variations on context-free grammars.
• Context-free grammars are of interest
  because they
  – have well-known properties (e.g. can be parsed
    in polynomial time)
  – can capture many aspects of language
       Basic context-free grammar
               formalism
• A CFG is a 4-tuple (N,,P,S) where
   – N is a set of non-terminal symbols
   –  is a set of terminal symbols
   – P is a set of productions, P  N X (  N)*
   – S is a start symbol
• and   N = 
• Each production is of the form A  , where A is
  a non-terminal and  is drawn from (  N)*
  Problems with basic formalism
• Consider a grammar rule like
      S  Aux NP VP
• To handle agreement between subject and verb,
  we could replace that rule with two new ones:
      S  3SgAux 3SgNP VP
      S  Non3SgAux Non3SgNP VP
• Need rules like the following too:
      3SgAux  does | has | can | …
      Non3SgAux  do | have | can | …
        Extensions to formalism
• Feature structures and unification
   – feature structures are of the form
       [ f1=v1, f2=v2, … , fn=vn ]
   – feature structures can be partially specified:
       (a) [ Number = Sg, Person = 3, Category = NP ]
       (b) [ Number = Sg, Category = NP ]
       (c) [ Person = 3, Category = NP ]
   – (b) unified with (c) is (a)
• Feature structures can be used to express feature-
  value constraints across constituents without rule
  multiplication.
           Other formalisms
• More powerful: tree adjoining grammars
  – trees, not rules, are fundamental
  – trees are either initial or auxiliary
  – two operations: substitution and adjunction
• Less powerful: finite-state grammars
  – cannot handle general recursion
  – can be sufficient to handle real-world data
  – recursion spelled out explicitly to some level
    (large grammar)

								
To top