Docstoc

CL-for-Olympiad

Document Sample
CL-for-Olympiad Powered By Docstoc
					Computational Linguistics

               James Pustejovsky
               Brandeis University




  Boston Computational Linguistics Olympiad Team
                    Fall, 2007
 What is Computational Linguistics?

Computational Linguistics is the computational analysis of natural
languages.
   Process information contained in natural language.
Can machines understand human language?
   Define „understand‟
   Understanding is the ultimate goal. However, one doesn‟t
   need to fully understand to be useful.
         Goals of this Lecture
Learn about the problems and possibilities of natural
language analysis:
   What are the major issues?
   What are the major solutions?

At the end you should:
   Agree that language is subtle and interesting!
   Know about some of the algorithms.
   Know how difficult it can be!
           It‟s 2007,
but we‟re not anywhere close
    to realizing the dream
  (or nightmare …) of 2001
   Dave Bowman: ―Open the pod bay doors.‖

   Dave Bowman: ―Open the pod bay doors, please, HAL.‖




HAL 9000: ―I‘m sorry Dave. I‘m afraid I can‘t do that.‖
        Why is NLP difficult?

Computers are not brains
  There is evidence that much of language
  understanding is built-in to the human brain
Computers do not socialize
  Much of language is about communicating with people
Key problems:
  Representation of meaning
  Language presupposed knowledge about the world
  Language only reflects the surface of meaning
  Language presupposes communication between people
            Hidden Structure

English plural pronunciation
   Toy      + s  toyz         ;   add   z
   Book     + s  books        ;   add   s
   Church   + s  churchiz     ;   add   iz
   Box      + s  boxiz        ;   add   iz
   Sheep    + s  sheep        ;   add   nothing


What about new words?
   Bach     + „s  boxs        ; why not boxiz?
         Language subtleties
Adjective order and placement
  A   big black dog
  A   big black scary dog
  A   big scary dog
  A   scary big dog
  A   black big dog


Antonyms
  Which sizes go together?
    – Big and little
    – Big and small
    – Large and small
      Large and little
   World Knowledge is subtle

He arrived at the lecture.
He chuckled at the lecture.

He arrived drunk.
He chuckled drunk.

He chuckled his way through the lecture.
He arrived his way through the lecture.
     Words are ambiguous
   (have multiple meanings)

I know that.

I know that block.

I know that blocks the sun.

I know that block blocks the sun.
           Headline Ambiguity
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
Ban on nude dancing on Governor‘s desk
Local high school dropouts cut in half
   The Role of Memorization

Children learn words quickly
  As many as 9 words/day
  Often only need one exposure to associate meaning
  with word
    – Can make mistakes, e.g., overgeneralization
        “I goed to the store.”
  Exactly how they do this is still under study
    The Role of Memorization

Dogs can do word association too!
  Rico, a border collie in Germany
  Knows the names of each of 100 toys
  Can retrieve items called out to him with over 90%
  accuracy.
  Can also learn and remember the names of
  unfamiliar toys after just one encounter, putting
  him on a par with a three-year-old child.




                      http://www.nature.com/news/2004/040607/pf/040607-8_pf.html
But there is too much to memorize!

establish
establishment
     the church of England as the official state church.
disestablishment
antidisestablishment
antidisestablishmentarian
antidisestablishmentarianism
     is a political philosophy that is opposed to the
     separation of church and state.
        Rules and Memorization
Current thinking in psycholinguistics is that we use a
combination of rules and memorization
    However, this is very controversial
Mechanism:
    If there is an applicable rule, apply it
    However, if there is a memorized version, that takes
    precedence. (Important for irregular words.)
     – Artists paint “still lifes”
          Not “still lives”
     – Past tense of
          think  thought
          blink  blinked

This is a simplification; for more on this, see Pinker‘s ―Words and
Language‖ and ―The Language Instinct‖.
  Representation of Meaning

I know that block blocks the sun.
  How do we represent the meanings of “block”?
  How do we represent “I know”?
  How does that differ from “I know that.”?
  Who is “I”?
  How do we indicate that we are talking about earth‟s sun
  vs. some other planet‟s sun?
  When did this take place? What if I move the block?
  What if I move my viewpoint? How do we represent this?
How to tackle these problems?

The field was stuck for quite some time.
A new approach started around 1990
   Well, not really new, but the first time around, in the
   50‟s, they didn‟t have the text, disk space, or GHz
Main idea: combine memorizing and rules
How to do it:
   Get large text collections (corpora)
   Compute statistics over the words in those
   collections
Surprisingly effective
   Even better now with the Web
    Corpus-based Example:
Pre-Nominal Adjective Ordering
 Important for translation and generation
 Examples:
     big fat Greek wedding
     fat Greek big wedding
 Some approaches try to characterize this as semantic
 rules, e.g.:
     Age < color, value < dimension
 Data-intensive approaches
     Assume adjective ordering is independent of the
     noun they modify
     Compare how often you see {a, b} vs {b, a}


Keller & Lapata, “The Web as Baseline”, HLT-NAACL‟04
     Corpus-based Example:
 Pre-Nominal Adjective Ordering
Data-intensive approaches
  Compare how often you see {a, b} vs {b, a}
  What happens when you encounter an unseen pair?
    – Shaw and Hatzivassiloglou ‟99 use transitive closures
    – Malouf ‟00 uses a back-off bigram model
        P(<a,b>|{a,b}) vs. P(<b,a>|{a,b})
        He also uses morphological analysis, semantic similarity
         calculations and positional probabilities
  Keller and Lapata ‟04 use just the very simple algorithm
    – But they use the web as their training set
    – Gets 90% accuracy on 1000 sequences
    – As good as or better than the complex algorithms



 Keller & Lapata, “The Web as Baseline”, HLT-NAACL‟04
Real-World Applications of NLP
Spelling Suggestions/Corrections
Grammar Checking
Synonym Generation
Information Extraction
Text Categorization
Automated Customer Service
Speech Recognition (limited)
Machine Translation
In the (near?) future:
   Question Answering
   Improving Web Search Engine results
   Automated Metadata Assignment
   Online Dialogs
Synonym Generation
Synonym Generation
Synonym Generation
            Levels of Language
Sound Structure (Phonetics and Phonology)
   The sounds of speech and their production
   The systematic way that sounds are differently realized in
   different environments.
Word Structure (Morphology)
   From morphos = shape (not transform, as in morph)
   Analyzes how words are formed from minimal units of meaning;
   also derivational rules
     – dog + s = dogs; eat, eats, ate
Phrase Structure (Syntax)
   From the Greek syntaxis, arrange together
   Describes grammatical arrangements of words into hierarchical
   structure
          Levels of Language
Thematic Structure
   Getting closer to meaning
   Who did what to whom
    – Subject, object, predicate
Semantic Structure
   How the lower levels combine to convey meaning
Pragmatics and Discourse Structure
   How language is used across sentences.
       Parsing at Every Level

Transforming from a surface representation to an underlying
representation
   It‟s not straightforward to do any of these mappings!
   Ambiguity at every level
    – Word: is “saw” a verb or noun?
    – Phrase: “I saw the guy on the hill with the telescope.”
         Who is on the hill?
    – Semantic: which hill?
         Tokens and Types
 The term word can be used in two different ways:
1.   To refer to an individual occurrence of a word
2.   To refer to an abstract vocabulary item
 For example, the sentence ―my dog likes his dog‖ contains
 five occurrences of words, but four vocabulary items.
 To avoid confusion use more precise terminology:
1.   Word token: an occurrence of a word
2.   Word Type: a vocabulary item
        Tokenization (continued)

Tokenization is harder that it seems
     I‟ll see you in New York.
     The aluminum-export ban.
The simplest approach is to use ―graphic words‖ (i.e., separate
words using whitespace)
Another approach is to use regular expressions to specify which
substrings are valid words.
NLTK provides a generic tokenization interface: TokenizerI
               Terminology

Tagging
  The process of associating labels with each token in a text
Tags
  The labels
Tag Set
  The collection of tags used for a particular task
                         Example
Typically a tagged text is a sequence of white-space
  separated base/tag tokens:

  The/at Pantheon‘s/np interior/nn ,/,still/rb in/in its/pp original/jj
  form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at
  architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at
  perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in
  the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn
  ./.
            What does Tagging do?

1.       Collapses Distinctions
     •     Lexical identity may be discarded
     •     e.g. all personal pronouns tagged with PRP
2.       Introduces Distinctions
     •     Ambiguities may be removed
     •     e.g. deal tagged with NN or VB
     •     e.g. deal tagged with DEAL1 or DEAL2
3.       Helps classification and prediction
Significance of Parts of Speech

A word’s POS tells us a lot about the word and its
neighbors:
   Limits the range of meanings (deal), pronunciation (object
   vs object) or both (wind)
   Helps in stemming
   Limits the range of following words for Speech Recognition
   Can help select nouns from a document for IR
   Basis for partial parsing (chunked parsing)
   Parsers can build trees directly on the POS tags instead of
   maintaining a lexicon
           Choosing a tagset

The choice of tagset greatly affects the difficulty of
the problem
Need to strike a balance between
   Getting better information about context (best:
   introduce more distinctions)
   Make it possible for classifiers to do their job (need
   to minimize distinctions)
Some of the best-known Tagsets

 Brown corpus: 87 tags
 Penn Treebank: 45 tags
 Lancaster UCREL C5 (used to tag the BNC): 61 tags
 Lancaster C7: 145 tags
        The Brown Corpus
The first digital corpus (1961)
  Francis and Kucera, Brown University
Contents: 500 texts, each 2000 words long
  From American books, newspapers, magazines
  Representing genres:
   – Science fiction, romance fiction, press reportage
     scientific writing, popular lore
            Penn Treebank

First syntactically annotated corpus
1 million words from Wall Street Journal
Part of speech tags and syntax trees
 How hard is POS tagging?

In the Brown corpus,
- 11.5% of word types ambiguous
- 40% of word TOKENS


   Number of tags    1       2      3     4    5    6   7

   Number of words   35340   3760   264   61   12   2   1
   types
Important Penn Treebank tags
Verb inflection tags
The entire Penn Treebank tagset
          Tagging methods

Hand-coded
Statistical taggers
Brill (transformation-based) tagger
               Default Tagger

We need something to use for unseen words
  E.g., guess NNP for a word with an initial capital
How to do this?
  Apply a sequence of regular expression tests
  Assign the word to a suitable tag
If there are no matches…
  Assign to the most frequent unknown tag, NN
   – Other common ones are verb, proper noun, adjective
  Note the role of closed-class words in English
   – Prepositions, auxiliaries, etc.
   – New ones do not tend to appear.
            Training vs. Testing
A fundamental idea in computational linguistics
Start with a collection labeled with the right answers
    Supervised learning
    Usually the labels are done by hand
―Train‖ or ―teach‖ the algorithm on a subset of the labeled text.
Test the algorithm on a different set of data.
    Why?
     – If memorization worked, we‟d be done.
     – Need to generalize so the algorithm works on examples that you haven‟t
       seen yet.
     – Thus testing only makes sense on examples you didn‟t train on.
        Evaluating a Tagger

Tagged tokens – the original data
Untag (exclude) the data
Tag the data with your own tagger
Compare the original and new tags
  Iterate over the two lists checking for identity and
  counting
  Accuracy = fraction correct
          Language Modeling

Another fundamental concept in NLP
Main idea:
   For a given language, some words are more likely than others
   to follow each other, or
   You can predict (with some degree of accuracy) the
   probability that a given word will follow another word.
                          N-Grams
The N stands for how many terms are used
    Unigram: 1 term
    Bigram: 2 terms
    Trigrams: 3 terms
     – Usually don‟t go beyond this
You can use different kinds of terms, e.g.:
    Character based n-grams
    Word-based n-grams
    POS-based n-grams
Ordering
    Often adjacent, but not required
We use n-grams to help determine the context in which some linguistic
phenomenon happens.
    E.g., look at the words before and after the period to see if it is the end of a
    sentence or not.
         Features and Contexts
wn-2       wn-1     wn        wn+1



tn-2        tn-1     tn        tn+1



       CONTEXT     FEATURE   CONTEXT
                  Unigram Tagger
Trained using a tagged corpus to determine which tags are most
common for each word.

   E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1
   time, and with VBP 1 time


Performance is highly dependent on the quality of its training set.
   Can‟t be too small
   Can‟t be too different from texts we actually want to tag
          Nth Order Tagging
Order refers to how much context
  It‟s one less than the N in N-gram here because we use the
  target word itself as part of the context.
   – Oth order = unigram tagger
   – 1st order = bigrams
   – 2nd order = trigrams
Bigram tagger
  For tagging, in addition to considering the token‟s type, the
  context also considers the tags of the n preceding tokens
  What is the most likely tag for w_n, given w_n-1 and t_n-1?
  The tagger picks the tag which is most likely for that
  context.
Tagging with lexical frequencies
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN
the/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater
    P(race|VB)
    P(race|NN)
Actual estimate from the Switchboard corpus:
    P(race|NN) = .00041
    P(race|VB) = .00003
        Rule-Based Tagger

The Linguistic Complaint
  Where is the linguistic knowledge of a tagger?
  Just a massive table of numbers
  Aren‟t there any linguistic insights that could emerge
  from the data?
  Could thus use handcrafted sets of rules to tag input
  sentences, for example, if input follows a determiner
  tag it as a noun.
           The Brill tagger

An example of TRANSFORMATION-BASED LEARNING
Very popular (freely available, works fairly well)
A SUPERVISED method: requires a tagged corpus
Basic idea: do a quick job first (using frequency),
then revise it using contextual rules
 Brill Tagging: In more detail

Start with simple (less accurate) rules…learn better ones from
tagged corpus
   Tag each word initially with most likely POS
   Examine set of transformations to see which improves
   tagging decisions compared to tagged corpus
   Re-tag corpus using best transformation
   Repeat until, e.g., performance doesn‟t improve
   Result: tagging procedure (ordered list of transformations)
   which can be applied to new, untagged text
                     An example
Examples:
     It is expected to race tomorrow.
     The race for outer space.
Tagging algorithm:
1.   Tag all uses of “race” as NN (most likely tag in the Brown
     corpus)
     •   It is expected to race/NN tomorrow
     •   the race/NN for outer space
2.   Use a transformation rule to replace the tag NN with VB for all
     uses of “race” preceded by the tag TO:
     •   It is expected to race/VB tomorrow
     •   the race/NN for outer space
 Transformation-based learning
       in the Brill tagger
1. Tag the corpus with the most likely tag for each
   word
2. Choose a TRANSFORMATION that deterministically
   replaces an existing tag with a new one such that
   the resulting tagged corpus has the lowest error
   rate
3. Apply that transformation to the training corpus
4. Repeat
5. Return a tagger that
   a. first tags using unigrams
   b. then applies the learned transformations in order
Examples of learned
  transformations
Templates
Probabilities in Language Modeling

 A fundamental concept in NLP
 Main idea:
    For a given language, some words are more likely than others
    to follow each other, or
    You can predict (with some degree of accuracy) the
    probability that a given word will follow another word.
        Next Word Prediction
From a NY Times story...
   Stocks ...
   Stocks plunged this ….
   Stocks plunged this morning, despite a cut in interest rates
   Stocks plunged this morning, despite a cut in interest rates
   by the Federal Reserve, as Wall ...
   Stocks plunged this morning, despite a cut in interest rates
   by the Federal Reserve, as Wall Street began
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall Street
began trading for the first time since last …
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall Street
began trading for the first time since last Tuesday's
terrorist attacks.
     Human Word Prediction

Clearly, at least some of us have the ability to predict
future words in an utterance.
How?
   Domain knowledge
   Syntactic knowledge
   Lexical knowledge
                    Claim

A useful part of the knowledge needed to allow word
prediction can be captured using simple statistical
techniques
In particular, we'll rely on the notion of the
probability of a sequence (a phrase, a sentence)
                  Applications
Why do we want to predict a word, given some preceding
words?
   Rank the likelihood of sequences containing various
   alternative hypotheses, e.g. for ASR
   Theatre owners say popcorn/unicorn sales have doubled...
   Assess the likelihood/goodness of a sentence
    – for text generation or machine translation.
   The doctor recommended a cat scan.
   El doctor recommendó una exploración del gato.
 N-Gram Models of Language

Use the previous N-1 words in a sequence to predict
the next word
Language Model (LM)
  unigrams, bigrams, trigrams,…
How do we train these models?
  Very large corpora
              Simple N-Grams
Assume a language has V word types in its lexicon, how likely is
word x to follow word y?
   Simplest model of word probability: 1/V
   Alternative 1: estimate likelihood of x occurring in new text
   based on its general frequency of occurrence estimated from a
   corpus (unigram probability)
     popcorn is more likely to occur than unicorn
   Alternative 2: condition the likelihood of x occurring in the
   context of previous words (bigrams, trigrams,…)
     mythical unicorn is more likely than mythical popcorn
             A Word on Notation
P(unicorn)
   Read this as “The probability of seeing the token unicorn”
   Unigram tagger uses this.
P(unicorn|mythical)
   Called the Conditional Probability.
   Read this as “The probability of seeing the token unicorn
   given that you‟ve seen the token mythical
   Bigram tagger uses this.
   Related to the conditional frequency distributions that
   we‟ve been working with.
Computing the Probability of a
     Word Sequence
Compute the product of component conditional probabilities?
   P(the mythical unicorn) =
   P(the) P(mythical|the) P(unicorn|the mythical)
The longer the sequence, the less likely we are to find it in a
training corpus
     P(Most biologists and folklore specialists believe that in fact the
       mythical unicorn horns derived from the narwhal)
Solution: approximate using n-grams
                  Bigram Model
Approximate    P(wn |w1n1) by         P(wn |wn 1)
   P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption:
   The probability of a word depends only on the probability
   of a limited history
Generalization:
   The probability of a word depends only on the probability
   of the n previous words
     – trigrams, 4-grams, …
     – the higher n is, the more data needed to train
     – backoff models
              Using N-Grams
For N-gram models

   P(wn-1,wn) = P(wn | wn-1) P(wn-1)
   By the Chain Rule we can decompose a joint probability,
   e.g. P(w1,w2,w3)
    P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-
      1|wn) P(wn)
    For bigrams then, the probability of a sequence is just the
      product of the conditional probabilities of its bigrams
    P(the,mythical,unicorn) = P(unicorn|mythical)P(mythical|the)
      P(the|<start>)




                              n
                    P(w1n )   P(wk | wk 1)
                             k 1
         Training and Testing
N-Gram probabilities come from a training corpus
   overly narrow corpus: probabilities don't generalize
   overly general corpus: probabilities don't reflect task or
   domain
A separate test corpus is used to evaluate the model, typically
using standard metrics
   held out test set; development test set
   cross validation
   results tested for statistical significance
       Shallow (Chunk) Parsing

Goal: divide a sentence into a sequence of chunks.
 Chunks are non-overlapping regions of a text
     [I] saw [a tall man] in [the park].
 Chunks are non-recursive
       A chunk can not contain other chunks
 Chunks are non-exhaustive
       Not all words are included in chunks
    Chunk Parsing Examples
Noun-phrase chunking:
 [I] saw [a tall man] in [the park].
Verb-phrase chunking:
 The man who [was in the park] [saw me].
Prosodic chunking:

 [I saw] [a tall man] [in the park].
Question answering:
   What [Spanish explorer] discovered [the
   Mississippi River]?
 Shallow Parsing: Motivation

Locating information
   e.g., text retrieval
    – Index a document collection on its noun phrases
Ignoring information
   Generalize in order to study higher-level patterns
    – e.g. phrases involving “gave” in Penn treebank:
         gave NP; gave up NP in NP; gave NP up; gave NP
          help; gave NP to NP
   Sometimes a full parse has too much structure
    – Too nested
    – Chunks usually are not recursive
           Representation
BIO (or IOB)




Trees
Comparison with Full Syntactic Parsing

     Parsing is usually an intermediate stage
         Builds structures that are used by later stages of processing
     Full parsing is a sufficient but not necessary intermediate stage for
     many NLP tasks
         Parsing often provides more information than we need
     Shallow parsing is an easier problem
         Less word-order flexibility within chunks than between chunks
         More locality:
           – Fewer long-range dependencies
           – Less context-dependence
           – Less ambiguity
     Chunks and Constituency
   Constituents: [[a tall man] [ in [the park]]].
   Chunks:         [a tall man] in [the park].

A constituent is part of some higher unit in the
hierarchical syntactic parse
Chunks are not constituents
     Constituents are recursive
But, chunks are typically subsequences of constituents
     Chunks do not cross major constituent boundaries
                              Chunking
Define a regular expression that matches the sequences of tags
in a chunk

  A simple noun phrase chunk regexp:
   (Note that <NN.*> matches any tag starting with NN)
 <DT>? <JJ>* <NN.?>

Chunk all matching subsequences:
    the/DT little/JJ cat/NN   sat/VBD on/IN the/DT mat/NN
   [the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

If matching subsequences overlap, first 1 gets priority
                  Unchunking
Remove any chunk with a given pattern
  e.g., unChunkRule(„<NN|DT>+‟, „Unchunk NNDT‟)
  Combine with Chunk Rule <NN|DT|JJ>+
Chunk all matching subsequences:
  Input:
    the/DT little/JJ cat/NN   sat/VBD on/IN the/DT mat/NN
  Apply chunk rule
   [the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
  Apply unchunk rule
   [the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
                              Chinking
A chink is a subsequence of the text that is not a chunk.
Define a regular expression that matches the sequences of tags in a
chink
 A simple chink regexp for finding NP chunks:
 (<VB.?>|<IN>)+
First apply chunk rule to chunk everything
   Input:
    the/DT little/JJ cat/NN   sat/VBD on/IN the/DT mat/NN
   ChunkRule('<.*>+', „Chunk everything‟)
   [the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]
   Apply Chink rule above:
   [the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
            Chunk                 Chink         Chunk
                     Merging
Combine adjacent chunks into a single chunk
   Define a regular expression that matches the sequences of
   tags on both sides of the point to be merged
Example:
   Merge a chunk ending in JJ with a chunk starting with NN
   MergeRule(„<JJ>‟, „<NN>‟, „Merge adjs and nouns‟)

    [the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN
    [the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
Splitting is the opposite of merging
Applying Chunking to Treebank Data
Classifying at Different Granularies

Text Categorization:
   Classify an entire document
Information Extraction (IE):
   Identify and classify small units within documents
Named Entity Extraction (NE):
   A subset of IE
   Identify and classify proper names
    – People, locations, organizations
Example: The Problem:
   Looking for a Job




                  Martin Baker, a person




                  Genomics job



                  Employers job posting form
          What is Information Extraction
       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
source concept, by which software code is                  NAME           TITLE   ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
         What is Information Extraction
       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
                                                           NAME               TITLE  ORGANIZATION
source concept, by which software code is
                                                  IE       Bill Gates         CEO     Microsoft
made public to encourage improvement and
                                                           Bill Veghte        VP      Microsoft
development by outside programmers. Gates
                                                           Richard Stallman   founder Free Soft..
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
       What is Information Extraction
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates           aka “named entity
made public to encourage improvement and
                                                  Microsoft       extraction”
development by outside programmers. Gates
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
        What is Information Extraction
       A family                    Information Extraction =
of techniques:                      segmentation + classification + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
     What is Information Extraction
       A family                    Information Extraction =
of techniques:                      segmentation + classification + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                   IE in Context
Create ontology

       Spider
             Filter by relevance
                                   IE
                           Segment
                           Classify
                           Associate
                           Cluster
                                                      Database
                                            Load DB

Document          Train extraction models                        Query,
collection                                                       Search

         Label training data                                              Data mine
                Landscape of IE Tasks:
                       Degree of Formatting
        Text paragraphs                               Grammatical sentences
       without formatting                           and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.


  Non-grammatical snippets,                                    Tables
    rich formatting & links
            Landscape of IE Tasks:
        Intended Breadth of Coverage
   Web site specific    Genre specific   Wide, non-specific
   Formatting                Layout           Language
Amazon.com Book Pages       Resumes         University Names
       Landscape of IE Tasks”
                        Complexity
   Closed set                          Regular set
U.S. states                     U.S. phone numbers

He was born in Alabama…         Phone: (413) 545-1323

The big Wyoming sky…            The CALD main office can be
                                reached at 412-268-1299


   Complex pattern                Ambiguous patterns,
                                  needing context and
U.S. postal addresses             many sources of evidence
University of Arkansas         Person names
P.O. Box 140                    …was among the six houses
Hope, AR 71802                  sold by Hope Feldman that year.

Headquarters:                   Pawel Opalinski, Software
1128 Main Street, 4th Floor     Engineer at WhizBang Labs.
Cincinnati, Ohio 45210
             Landscape of IE Tasks:
                     Single Field/Record
  Jack Welch will retire as CEO of General Electric tomorrow. The top role
  at the Connecticut company will be filled by Jeffrey Immelt.



     Single entity            Binary relationship            N-ary record

 Person: Jack Welch           Relation: Person-Title     Relation:   Succession
                              Person: Jack Welch         Company:    General Electric
                              Title:    CEO              Title:      CEO
Person: Jeffrey Immelt
                                                         Out:        Jack Welsh
                                                         In:         Jeffrey Immelt
                            Relation: Company-Location
Location: Connecticut
                            Company: General Electric
                            Location: Connecticut



“Named entity” extraction
State of the Art Performance: a
             sample
Named entity recognition from newswire text
   Person, Location, Organization, …
   F1 in high 80‟s or low- to mid-90‟s
Binary relation extraction
   Contained-in (Location1, Location2)
   Member-of (Person1, Organization1)
   F1 in 60‟s or 70‟s or 80‟s
Web site structure recognition
   Extremely accurate performance obtainable
   Human effort (~10min?) required on each site
  Three generations of IE systems
Hand-Built Systems – Knowledge Engineering [1980s– ]
   Rules written by hand
   Require experts who understand both the systems and the
   domain
   Iterative guess-test-tweak-repeat cycle
Automatic, Trainable Rule-Extraction Systems [1990s– ]
   Rules discovered automatically using predefined templates,
   using automated rule learners
   Require huge, labeled corpora (effort is just moved!)
Statistical Models [1997 – ]
   Use machine learning to learn which features indicate
   boundaries and types of entities.
   Learning usually supervised; may be partially unsupervised
                    Landscape of IE Techniques
                                                          Classify Pre-segmented
              Lexicons                                          Candidates                                                 Sliding Window
  Abraham Lincoln was born in Kentucky.                Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.

                                             member?        Classifier
                                      Alabama                                                                             Classifier
                                                                         which class?
                                      Alaska                                                                                         which class?
                                      …
                                      Wisconsin                                                                                                Try alternate
                                      Wyoming                                                                                                  window sizes:




        Boundary Models                                    Finite State Machines                                       Context Free Grammars
  Abraham Lincoln was born in Kentucky.                 Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.
BEGIN
                                                                                        Most likely state sequence?
                                                                                                                          NNP         NNP     V     V   P         NP

                  Classifier                                                                                                                                 PP
                                     which class?
                                                                                                                                                        VP
                                                                                                                                NP
                                                                                                                                               VP
    BEGIN   END    BEGIN       END
                                                                                                                                         S


 Any of these models can be used to capture words, formatting or both.
         Trainable IE systems

         Pros                          Cons
Annotating text is simpler &   Hand-crafted systems
faster than writing rules.     perform better, especially at
Domain independent             hard tasks. (but this is
Domain experts don‘t need      changing)
to be linguists or             Training data might be
programers.                    expensive to acquire
Learning algorithms ensure     May need huge amount of
full coverage of examples.     training data
                               Hand-writing rules isn‘t that
                               hard!!
       MUC: the genesis of IE
DARPA funded significant efforts in IE in the early to mid
1990‘s.
Message Understanding Conference (MUC) was an
annual event/competition where results were presented.
Focused on extracting information from news articles:
   Terrorist events
   Industrial joint ventures
   Company management changes
Information extraction of particular interest to the
intelligence community (CIA, NSA). (Note: early ‘90‘s)
     Message Understanding
       Conference (MUC)
Named entity
    – Person, Organization, Location
Co-reference
    – Clinton  President Bill Clinton
Template element
    – Perpetrator, Target
Template relation
    – Incident
Multilingual
             MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint
   venture in Taiwan with a local concern and a
   Japanese trading house to produce golf clubs to be
   shipped to Japan. The joint venture, Bridgestone
   Sports Taiwan Co., capitalized at 20 million new
   Taiwan dollars, will start production of 20,000 iron
   and ―metal wood‖ clubs a month
             MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint
   venture in Taiwan with a local concern and a
   Japanese trading house to produce golf clubs to be
   shipped to Japan. The joint venture, Bridgestone
   Sports Taiwan Co., capitalized at 20 million new
   Taiwan dollars, will start production of 20,000 iron
   and ―metal wood‖ clubs a month
               MUC Templates

Relationship
    – tie-up
Entities:
    – Bridgestone Sports Co, a local concern, a Japanese
      trading house
Joint venture company
    – Bridgestone Sports Taiwan Co
Activity
    – ACTIVITY 1
Amount
    – NT$2,000,000
             MUC Templates

ATIVITY 1
  Activity
   – Production
  Company
   – Bridgestone Sports Taiwan Co
  Product
   – Iron and “metal wood” clubs
  Start Date
   – January 1990
                   Example of IE from FASTUS (1993)

  Bridgestone Sports Co. said Friday it had set up a joint venture
  in Taiwan with a local concern and a Japanese trading house to
  produce golf clubs to be supplied to Japan.

  The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
  million new Taiwan dollars, will start production in January 1990
  with production of 20,000 iron and “metal wood” clubs a month.


TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
          “a local concern”
          “a Japanese trading house”
Joint Venture Company:
         “Bridgestone Sports Taiwan Co.”
Activity:       ACTIVITY-1
Amount:         NT$200000000
                      Example of IE: FASTUS(1993)

  Bridgestone Sports Co. said Friday it had set up a joint venture
  in Taiwan with a local concern and a Japanese trading house to
  produce golf clubs to be supplied to Japan.

  The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
  million new Taiwan dollars, will start production in January 1990
  with production of 20,000 iron and “metal wood” clubs a month.


TIE-UP-1                                     ACTIVITY-1
Relationship: TIE-UP                         Activity: PRODUCTION
Entities: “Bridgestone Sport Co.”            Company:
          “a local concern”                      “Bridgestone Sports Taiwan Co.”
          “a Japanese trading house”         Product:
Joint Venture Company:                           “iron and „metal wood‟ clubs”
         “Bridgestone Sports Taiwan Co.”     Start Date:
Activity:       ACTIVITY-1                        DURING: January 1990
Amount:         NT$200000000
       Example of IE: FASTUS(1993): Resolving anaphora
 Bridgestone Sports Co. said Friday it had set up a joint venture
 in Taiwan with a local concern and a Japanese trading house to
 produce golf clubs to be supplied to Japan.

 The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
 million new Taiwan dollars, will start production in January 1990
 with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1                                   ACTIVITY-1
Relationship: TIE-UP                       Activity: PRODUCTION
Entities: “Bridgestone Sport Co.”          Company:
          “a local concern”                    “Bridgestone Sports Taiwan Co.”
          “a Japanese trading house”       Product:
Joint Venture Company:                         “iron and „metal wood‟ clubs”
         “Bridgestone Sports Taiwan Co.”   Start Date:
Activity:       ACTIVITY-1                      DURING: January 1990
Amount:         NT$200000000
         Evaluating IE Accuracy
Always evaluate performance on independent, manually-annotated test
data not used during system development.
Measure for each test document:
   Total number of correct extractions in the solution template: N
   Total number of slot/value pairs extracted by the system: E
   Number of extracted slot/value pairs that are correct (i.e. in the
   solution template): C
Compute average value of metrics adapted from IR:
   Recall = C/N
   Precision = C/E
   F-Measure = Harmonic mean of recall and precision
MUC Information Extraction:
  State of the Art c. 1997




                NE – named entity recognition
                CO – coreference resolution
                TE – template element construction
                TR – template relation construction
                ST – scenario template production
Finite State Transducers for IE
 Basic method for extracting relevant information
 IE systems generally use a collection of specialized
 FSTs
     – Company Name detection
     – Person Name detection
     – Relationship detection
  Three Equivalent Representations
                       Regular
                     expressions



                         Each
                          can
                        describe
                       the others           Regular
   Finite
                                          languages
   automata



Theorem:
  For every regular expression, there is a deterministic
  finite-state automaton that defines the same
  language, and vice versa.
         Question Answering
Today:
   Introduction to QA
   A typical full-fledged QA system
   A very simple system, in response to this
   An intermediate approach
Wednesday:
   Using external resources
    – WordNet
    – Encyclopedias, Gazeteers
   Incorporating a reasoning system
   Machine Learning of mappings
   Other question types (e.g., biography, definitions)
           A                   of Search Types

Question/Answer    What is the typical height of a giraffe?

                   What are some good ideas for landscaping my
Browse and Build
                   client‘s yard?

                   What are some promising untried treatments
Text Data Mining
                   for Raynaud‘s disease?
     Beyond Document Retrieval

Document Retrieval
   Users submit queries corresponding to their information
   needs.
   System returns (voluminous) list of full-length documents.
   It is the responsibility of the users to find information of
   interest within the returned documents.
Open-Domain Question Answering (QA)
   Users ask questions in natural language.
       What is the highest volcano in Europe?
   System returns list of short answers.
       … Under Mount Etna, the highest volcano in Europe,
   perches the fabulous town …
   A real use for NLP
      Questions and Answers

What is the height of a typical giraffe?

   The result can be a simple answer, extracted from
   existing web pages.
   Can specify with keywords or a natural language
   query
    – However, most web search engines are not set up to
      handle questions properly.
    – Get different results using a question vs. keywords
   The Problem of Question Answering
When was the San Francisco fire?
    … were driven over it. After the ceremonial tie was removed - it burned in the San
   Francisco fire of 1906 – historians believe an unknown Chinese worker probably drove
   the last steel spike into a wooden tie. If so, it was only…



     What is the nationality of Pope John Paul II?
         … stabilize the country with its help, the Catholic hierarchy stoutly held out for pluralism,
         in large part at the urging of Polish-born Pope John Paul II. When the Pope
         emphatically defended the Solidarity trade union during a 1987 tour of the…




Where is the Taj Mahal?
   … list of more than 360 cities around the world includes the Great Reef in
   Australia, the Taj Mahal in India, Chartre‘s Cathedral in France, and
   Serengeti National Park in Tanzania. The four sites Japan has listed include…
The Problem of Question Answering
                                        Natural language question,
                                           not keyword queries


 What is the nationality of Pope John Paul II?
    … stabilize the country with its help, the Catholic hierarchy stoutly held out for pluralism,
    in large part at the urging of Polish-born Pope John Paul II. When the Pope
    emphatically defended the Solidarity trade union during a 1987 tour of the…




       Short text fragment,
           not URL list
Question Answering from text

With massive collections of full-text documents,
simply finding relevant documents is of limited
use: we want answers
QA: give the user a (short) answer to their
question, perhaps supported by evidence.
An alternative to standard IR
  The first problem area in IR where NLP is really
  making a difference.
 People want to ask questions…
Examples from AltaVista query log
  who invented surf music?
  how to make stink bombs
  where are the snowdens of yesteryear?
  which english translation of the bible is used in official catholic
  liturgies?
  how to do clayart
  how to copy psx
  how tall is the sears tower?
Examples from Excite query log (12/1999)
  how can i find someone in texas
  where can i find information on puritan religion?
  what are the 7 wonders of the world
  how can i eliminate stress
  What vacuum cleaner does Consumers Guide recommend
    A Brief (Academic) History
In some sense question answering is not a new research area
Question answering systems can be found in many areas of NLP
research, including:
   Natural language database systems
     – A lot of early NLP work on these
   Problem-solving systems
     – STUDENT (Winograd ‟77)
     – LUNAR   (Woods & Kaplan ‟77)
   Spoken dialog systems
     – Currently very active and commercially relevant
The focus is now on open-domain QA is new
   First modern system: MURAX (Kupiec, SIGIR‟93):
     – Trivial Pursuit questions
     – Encyclopedia answers
   FAQFinder (Burke et al. ‟97)
   TREC QA competition (NIST, 1999–present)
                 AskJeeves

AskJeeves is probably most hyped example of
―Question answering‖
How it used to work:
   Do pattern matching to match a question to their
   own knowledge base of questions
   If a match is found, returns a human-curated answer
   to that known question
   If that fails, it falls back to regular web search
   (Seems to be more of a meta-search engine now)
A potentially interesting middle ground, but a fairly
weak shadow of real QA
 Question Answering at TREC
Question answering competition at TREC consists of answering a set of
500 fact-based questions, e.g.,
    “When was Mozart born?”.
Has really pushed the field forward.
The document set
    Newswire textual documents from LA Times, San Jose Mercury News, Wall
    Street Journal, NY Times etcetera: over 1M documents now.
    Well-formed lexically, syntactically and semantically (were reviewed by
    professional editors).
The questions
    Hundreds of new questions every year, the total is ~2400
Task
    Initially extract at most 5 answers: long (250B) and short (50B).
    Now extract only one exact answer.
    Several other sub-tasks added later: definition, list, biography.
         Sample TREC questions
1. Who is the author of the book, "The Iron Lady: A Biography of
Margaret Thatcher"?

2. What was the monetary value of the Nobel Peace Prize in 1989?

3. What does the Peugeot company manufacture?

4. How much did Mercury spend on advertising in 1993?

5. What is the name of the managing director of Apricot Computer?

6. Why did David Koresh ask the FBI for a word processor?

7. What is the name of the rare neurological disease with
   symptoms such as: involuntary movements (tics), swearing,
   and incoherent vocalizations (grunts, shouts, etc.)?
                    TREC Scoring
For the first three years systems were allowed to return 5 ranked
answer snippets (50/250 bytes) to each question.

   Mean Reciprocal Rank Scoring (MRR):
     – Each question assigned the reciprocal rank of the first correct answer.
       If correct answer at position k, the score is 1/k.
       1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ position


   Mainly Named Entity answers (person, place, date, …)


From 2002 on, the systems are only allowed to return a single exact
answer and the notion of confidence has been introduced.
      Top Performing Systems
In 2003, the best performing systems at TREC can answer approximately
60-70% of the questions
Approaches and successes have varied a fair deal
    Knowledge-rich approaches, using a vast array of NLP techniques
    stole the show in 2000-2003
     – Notably Harabagiu, Moldovan et al. ( SMU/UTD/LCC )
Statistical systems starting to catch up
    AskMSR system stressed how much could be achieved by very simple
    methods with enough text (and now various copycats)
    People are experimenting with machine learning methods
Middle ground is to use large collection of surface matching patterns (ISI)
          Example QA System
This system contains many components used by other systems, but more
complex in some ways
Most work completed in 2001; there have been advances by this group
and others since then.
Next slides based mainly on:
   Paşca and Harabagiu, High-Performance Question Answering from
   Large Text Collections, SIGIR‟01.
   Paşca and Harabagiu, Answer Mining from Online Documents,
   ACL‟01.
   Harabagiu, Paşca, Maiorano: Experiments with Open-Domain
   Textual Question Answering. COLING‟00
                  QA Block Architecture
                              Extracts and ranks passages
                              using surface-text techniques

Captures the semantics of the question                      Extracts and ranks answers
Selects keywords for PR                                     using NL techniques
                                  Question Semantics



 Q                                        Passage                 Answer
             Question
             Processing
                              Keywords
                                          Retrieval
                                                       Passages
                                                                  Extraction         A

                     WordNet                                            WordNet
                                          Document
                     Parser               Retrieval                     Parser

                     NER                                                NER
      Question Processing Flow
                                          Question
                Construction of the       semantic
                question representation   representation




Q   Question
    parsing
                 Answer type detection    AT category



                     Keyword selection    Keywords
           Question Stems and Answer Types

Identify the semantic category of expected answers

Question                                        Question stem   Answer type

Q555: What was the name of Titanic‘s captain?   What            Person



Q654: What U.S. Government agency registers     What            Organization
trademarks?

Q162: What is the capital of Kosovo?            What            City

Q661: How much does one ton of cement cost?     How much        Quantity



     Other question stems: Who, Which, Name, How hot...
     Other answer types: Country, Number, Product...
  Detecting the Expected Answer Type
In some cases, the question stem is sufficient to indicate the answer
type (AT)
   Why  REASON
   When  DATE
In many cases, the question stem is ambiguous
   Examples
     – What was the name of Titanic‟s captain ?
     – What U.S. Government agency registers trademarks?
     – What is the capital of Kosovo?
   Solution: select additional question concepts (AT words) that help
   disambiguate the expected answer type
   Examples
     – captain
     – agency
     – capital
      Answer Type Taxonomy
Encodes 8707 English concepts to help recognize expected answer type
Mapping to parts of Wordnet done by hand
   Can connect to Noun, Adj, and/or Verb subhierarchies
Answer Type Detection Algorithm
 Select the answer type word from the question representation.
    Select the word(s) connected to the question. Some “content-
    free” words are skipped (e.g. “name”).
    From the previous set select the word with the highest
    connectivity in the question representation.


 Map the AT word in a previously built AT hierarchy
    The AT hierarchy is based on WordNet, with some concepts
    associated with semantic categories, e.g. “writer”  PERSON.
 Select the AT(s) from the first hypernym(s) associated with a semantic
 category.
                 Answer Type Hierarchy

                               PERSON


       scientist,                   inhabitant,                performer,
     man of science             dweller, denizen           performing artist
                                                                     dancer
researcher  chemist        American westerner             actor
                                   islander,                            ballet
  oceanographer                island-dweller        actress tragedian dancer


                  researcher                       name
    What
  PERSON                                    PERSON
                                             What              oceanographer
                        discovered
                                                   French
 Hepatitis-B     vaccine                              Calypso         owned

What researcher discovered the            What is the name of the French
vaccine against Hepatitis-B?              oceanographer who owned Calypso?
            Understanding a Simple
Feb. 18, 2004
                  Narrative
Yesterday Holly was running a marathon when she twisted her ankle. David had pushed her.




           Temporal Awareness of the Narrative

        • Time
        • Events, Activities, and States
        • Anchoring the Events
        • Ordering the Events
      Temporal Aspects of Narrative
Feb. 18, 2004    Text
Yesterday Holly was running a marathon when she twisted her ankle. David had
   pushed her.



                                               before
                  02172004                                            02182004
                           during




                                        1. When did the running occur?
                      run               Yesterday.
                            finishes




                                        2. When did the twisting occur?
         during




                                        Yesterday, during the running.
                                        3. Did the pushing occur before the twisting?
                  before




                                twist
  push                                  Yes.
                                ankle
                                        4. Did Holly keep running after twisting her ankle?
                                        5. Probably not.
                  Temporal Assumptions



• Time primitives are temporal intervals.

• No branching into the future or the past

• 13 basic (binary) interval relations
    •[b,a,eq,o,oi,s,si,f,fi,d,di,m,mi],
        (six are inverses of the other six)

• Supported by a transitivity table that defines the conjunction
of any two relations.

• All 13 relations can be expressed using meet:
    •Before (X, Y)  Z , (meets(X, Z)  (meets (Z, Y))

                                                                   141
Allen’s 13 Temporal Relations
         A

         B                   A is EQUAL to B
                             B is EQUAL to A

         A
                             A is BEFORE B
                         B
                             B is AFTER A

         A
                             A MEETS B
                     B
                             B is MET by A


         A                   A OVERLAPS B
                 B           B is OVERLAPPED by A

     A
                             A STARTS B
         B
                             B is STARTED by A

             A
                             A FINISHES B
         B
                             B is FINISHED by A

         A
                             A DURING B
         B
                             B CONTAINS A
                                                    142
              Allen’s Temporal Ontology

Properties hold over every subinterval of an interval

               —> Holds(p, T) e.g., ”John was sick for a day."

Events hold only over an interval and not over any subinterval of it.
           —> Occurs(e, T) e.g., ”Mary wrote a letter this afternoon."

Processes hold over some subintervals of the interval they occur in.
 —> Occuring(p, T) e.g., ”Mary is writing a letter today."




                                                                         143
          Desiderata for Temporal
          Specification Language


Linguistic Expressiveness:
    Tense and Aspect
    Aspectual Classes
    Anchoring relations
    Ordering relations
    Temporal reference and reasoning
Design Features
    XML
    Open architecture
    Compliant with other annotation schemes
    Transparent Semantics




                                              144
        Temporal Expression Types
1.   Times
        3 o’clock
        mid-morning
2.   Dates
        Fully Specified
        –    June 11, 1989
        –    Summer, 2002
        Underspecified
        –    Monday
        –    Next month
        –    Last year
        –    Two days ago
3.   Durations
        Three months
        Two years
4.   Sets
        Every month
        Every Tuesday
                      Events
Can have temporal a/o spatial locations
Can have types
   assassinations, bombings, joint ventures, etc.
Can have people a/o other objects as participants
Can be hypothetical
Can have not happened…
   Different Notions of Events
Topic: ―well-defined subject‖ for searching
   document- or collection-level
Template: structure with slots for participant named entities
   document-level
Mention: linguistic expression that expresses an underlying
event
   phrase-level (verb/noun)
       TIMEX3 Annotation Schema
Time Points
       <TIMEX3 tid=“t1” type=“TIME” value=“T24:00”>midnight</TIMEX3>
       <TIMEX3 tid=“t2” type=“DATE” value=“2005-02-15” temporalFunction=“TRUE”
         anchorTimeID=“t0”>tomorrow</TIMEX3>


Durations
      <TIMEX3 tid="t6" type="DURATION" value="P2W" beginPoint="t61" endPoint="t62">two
        weeks</TIMEX3> from <TIMEX3 tid="t61" type="DATE" value="2003-06-07">June 7,
        2003</TIMEX3>
      <TIMEX3 tid="t62" type="DATE" value="2003-06-21" temporalFunction="true"
        anchorTimeID="t6"/>

Sets
     <TIMEX3 tid=”t1” type=”SET” value=”P1M” quant=”EVERY”
   freq=”P3D”>
        three days every month</TIMEX3>
     <TIMEX3 tid=”t1” type=”SET” value=”P1M” freq=”P2X”>
        twice a month</TIMEX3>
Example TIMEX3 Markup
  <TIMEX3
       tid=“t1”
       type=“TIME”
       value=“T24:00” >
        midnight
                </TIMEX3>
Example TIMEX3 Markup
  <TIMEX3
       tid=“t1”
       type=“SET”
        freq=“P2X”
        value=“P1M” >
        twice a month
                        </TIMEX3>
             Events and Times
Event expressions:
   tensed verbs; has left, was captured, will resign;
   stative adjectives; sunken, stalled, on board;
   event nominals; merger, Military Operation, lecture;
Dependencies between events and times:
   Anchoring; John left on Monday.
   Orderings; The party happened after midnight.
   Embedding; John said Mary left.
            Features of TimeML
Extends TIMEX2 annotation;
    Temporal Functions: three years ago
    Anchors to events and other temporal expressions: three years after
    the Gulf War
Identifies signals determining interpretation of temporal expressions;
    Temporal Prepositions: for, during, on, at;
    Temporal Connectives: before, after, while.
Identifies event expressions;
    tensed verbs; has left, was captured, will resign;
    stative adjectives; sunken, stalled, on board;
    event nominals; merger, Military Operation, Gulf War;
Creates dependencies between events and times:
    Anchoring; John left on Monday.
    Orderings; The party happened after midnight.
    Embedding; John said Mary left.




                                                                          152
  Representing Time and Events in Text

AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to
withdraw from Iranian territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
               Temporal Expressions
AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to
withdraw from Iranian territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
              TimeML Event Classes
Occurrence:
    die, crash, build, merge, sell, take advantage of, ..
State:
    Be on board, kidnapped, recovering, love, ..
Reporting:
    Say, report, announce,
I-Action:
    Attempt, try,promise, offer
I-State:
    Believe, intend, want, …
Aspectual:
    begin, start, finish, stop, continue.
Perception:
    See, hear, watch, feel.
AP-NR-08-15-90 1337EDT
                         Event Expressions
Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to withdraw from Iranian
territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
                        Links
Temporal Relations
   Anchoring to a Time
   Ordering between Time and Events
Aspectual Relations
   Phases of an event
Subordinating Relations
   Events that syntactically subordinate other events
                             TLINK
            Temporal Anchoring/Ordering
Simultaneous (happening at the same time)
Identical: (referring to the same event)
        John drove to Boston. During his drive he ate a donut.
Before the other:
        In six of the cases suspects have already been arrested.
After the other:
Immediately before the other:
        All passengers died when the plane crashed into the mountain.
Immediately after than the other:
Including the other:
        John arrived in Boston last Thursday.
Being included in the other:
Exhaustively during the duration of the other:
        John taught for 20 minutes.
Beginning of the other:
        John was in the gym between 6:00 p.m. and 7:00 p.m.
Begun by the other:
Ending of the other:
        John was in the gym between 6:00 p.m. and 7:00 p.m.
Ended by the other:                                                     158
                     Linking Timex to Timex
AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to
withdraw from Iranian territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
                 Anchoring Event to Timex
AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to
withdraw from Iranian territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
                                    Ordering Events

AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to
withdraw from Iranian territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

Past <         Tuesday         <          Today                    <             Indef Future
_______________________________________________________________________________
war(I,I)       say(Bush,S) arrive(H,DC)           withdraw(Saddam)
captured(sold)             Seek(Saddam,peace)                    release(Saddam,soldiers)
                                                                                 extend(US,quarantine)
                            ALINK

ALINK or Aspectual Link represent the relationship between an
   aspectual event and its argument event. Examples of the possible
   aspectual relations we will encode are:

1. Initiation:
    John started to read.
2. Culmination:
    John finished assembling the table.
3. Termination:
    John stopped talking.
4. Continuation:
    John kept talking.
                              Aspectual Links
President Bush today denounced Saddam's ``ruinous policies of war,'' and
said the United States is ``striking a blow for the principle that might does
not make right.''

In a speech delivered at the Pentagon, Bush seemed to suggest
that American forces could be in the gulf region for some time.
``No one should doubt our staying power or determination,'' he said.

The U.S. military buildup in Saudi Arabia continued at fever pace, with Syrian
troops now part of a multinational force camped out in the desert to guard the
 Saudi kingdom from any new thrust by Iraq.

In a letter to President Hashemi Rafsanjani of Iran, read by a broadcaster over
 Baghdad radio, Saddam said he will begin withdrawing troops from Iranian
territory a week from tomorrow and release Iranian prisoners of war.

Iran said an Iraqi diplomatic delegation was en route to Tehran to deliver
 Saddam's message, which it said it would review ``with optimism.''
                               SLINK
SLINK or Subordination Link is used for contexts introducing relations
between two events, or an event and a signal, of the following sort:

Factive: Certain verbs introduce an entailment (or presupposition) of the
argument's veracity. They include forget in the tensed complement, regret,
manage:
         John forgot that he was in Boston last year.
         Mary regrets that she didn't marry John.
Counterfactive: The event introduces a presupposition about the non-
veracity of its argument: forget (to), unable to (in past tense), prevent,
cancel, avoid, decline, etc.
         John forgot to buy some wine.
         John prevented the divorce.
Evidential: Evidential relations are introduced by REPORTING or
PERCEPTION:
         John said he bought some wine.
         Mary saw John carrying only beer.
Negative evidential: Introduced by REPORTING (and PERCEPTION?) events
conveying negative polarity:
         John denied he bought only beer.
Conditional: Introduced when a conditional is present:
         If Mary leaves today, John will leave tomorrow.
AP-NR-08-15-90 1337EDT
                       Subordinated Links
Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to withdraw from Iranian
territory and release soldiers captured
during the Iran-Iraq war. Also today, King Hussein of Jordan arrived in
Washington seeking to mediate the Persian Gulf crisis. President Bush on
Tuesday said the United States may extend its naval quarantine to Jordan's
Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had
sent an envoy to the Middle East on a series of stops to include
Baghdad. Soviet officials also said Soviet women, children and
invalids would be allowed to leave Iraq.
                    SLINK: Reported Speech
AP-NR-08-15-90 1337EDT

Iraq's Saddam Hussein, facing U.S. and Arab troops at the Saudi
border, today sought peace on another front by promising to withdraw from Iranian
territory and release soldiers captured during the Iran-Iraq war. Also today, King
Hussein of Jordan arrived in Washington seeking to mediate the Persian Gulf crisis.
President Bush on Tuesday said the United States may extend its naval quarantine
to Jordan's Red Sea port of Aqaba to shut off Iraq's last unhindered trade route.

In another mediation effort, the Soviet Union said today it had sent an envoy to the
Middle East on a series of stops to include Baghdad. Soviet officials also
said Soviet women, children and invalids would be allowed to leave Iraq.
   Evaluation of Answer Type Hierarchy

This evaluation done in 2001
Controlled the variation of the number of WordNet synsets
included in the answer type hierarchy.
Test on 800 TREC questions.


           Hierarchy         Precision score
           coverage          (50-byte answers)

             0%                     0.296
             3%                     0.404
            10%                     0.437
            25%                     0.451
            50%                     0.461




 The derivation of the answer type is the main source of
 unrecoverable errors in the QA system
         Keyword Selection

Answer Type indicates what the question is looking
for, but provides insufficient context to locate the
answer in very large document collection
Lexical terms (keywords) from the question, possibly
expanded with lexical/semantic variations provide the
required context.
         Lexical Terms Extraction
Questions approximated by sets of unrelated words
(lexical terms)
Similar to bag-of-word IR models

Question (from TREC QA track)                   Lexical terms

Q002: What was the monetary value of the Nobel monetary, value, Nobel,
Peace Prize in 1989?                           Peace, Prize

Q003: What does the Peugeot company             Peugeot, company,
manufacture?                                    manufacture

Q004: How much did Mercury spend on             Mercury, spend, advertising,
advertising in 1993?                            1993

Q005: What is the name of the managing          name, managing, director,
director of Apricot Computer?                   Apricot, Computer
     Keyword Selection Algorithm
1.   Select   all non-stopwords in quotations
2.   Select   all NNP words in recognized named entities
3.   Select   all complex nominals with their adjectival modifiers
4.   Select   all other complex nominals
5.   Select   all nouns with adjectival modifiers
6.   Select   all other nouns
7.   Select   all verbs
8.   Select   the AT word (which was skipped in all previous steps)
 Keyword Selection Examples
What researcher discovered the vaccine against Hepatitis-B?
   Hepatitis-B, vaccine, discover, researcher


What is the name of the French oceanographer who owned Calypso?
   Calypso, French, own, oceanographer


What U.S. government agency registers trademarks?
   U.S., government, trademarks, register, agency


What is the capital of Kosovo?
   Kosovo, capital
                      Passage Retrieval
                              Extracts and ranks passages
                              using surface-text techniques

Captures the semantics of the question                      Extracts and ranks answers
Selects keywords for PR                                     using NL techniques
                                  Question Semantics



 Q                                        Passage                 Answer
             Question
             Processing
                              Keywords
                                          Retrieval
                                                       Passages
                                                                  Extraction         A

                     WordNet                                            WordNet
                                          Document
                     Parser               Retrieval                     Parser

                     NER                                                NER
    Passage Extraction Loop
Passage Extraction Component
  Extracts passages that contain all selected keywords
  Passage size dynamic
  Start position dynamic
Passage quality and keyword adjustment
  In the first iteration use the first 6 keyword selection
  heuristics
  If the number of passages is lower than a threshold
   query is too strict  drop a keyword
  If the number of passages is higher than a threshold
   query is too relaxed  add a keyword
       Passage Retrieval Architecture


Keywords   Keyword      No   Passage     Yes   Passage   Passage
           Adjustment         Quality          Scoring   Ordering


                                   Passages
                                                          Ranked
                         Passage Extraction               Passages

                                   Documents
                             Document
                             Retrieval
               Passage Scoring
Passages are scored based on keyword windows
For example, if a question has a set of keywords: {k1, k2, k3, k4}, and
in a passage k1 and k2 are matched twice, k3 is matched once, and k4
is not matched, the following windows are built:
    Window 1                            Window 2
      k1              k2                   k1              k2
                           k3                                   k3
    k2                                   k2
                 k1                                   k1

    Window 3                            Window 4
      k1              k2                   k1              k2
                           k3                                   k3
    k2                                   k2
                 k1                                   k1
            Passage Scoring

Passage ordering is performed using a radix sort that
involves three scores:
  SameWordSequenceScore (largest)
    – Computes the number of words from the question that
      are recognized in the same sequence in the window
  DistanceScore (largest)
    – The number of words that separate the most distant
      keywords in the window
  MissingKeywordScore (smallest)
    – The number of unmatched keywords in the window
                    Answer Extraction
                              Extracts and ranks passages
                              using surface-text techniques

Captures the semantics of the question                      Extracts and ranks answers
Selects keywords for PR                                     using NL techniques
                                  Question Semantics



 Q                                        Passage                 Answer
             Question
             Processing
                              Keywords
                                          Retrieval
                                                       Passages
                                                                  Extraction         A

                     WordNet                                            WordNet
                                          Document
                     Parser               Retrieval                     Parser

                     NER                                                NER
        Ranking Candidate Answers

    Q066: Name the first private citizen to fly in space.


   Answer type: Person
   Text passage:
       “Among them was Christa McAuliffe, the first private
       citizen to fly in space. Karen Allen, best known for her
       starring role in “Raiders of the Lost Ark”, plays McAuliffe.
       Brian Kerwin is featured as shuttle pilot Mike Smith...”


   Best candidate answer: Christa McAuliffe
            Features for Answer Ranking

       relNMW number of question terms matched in the answer passage
       relSP number of question terms matched in the same phrase as the
              candidate answer
       relSS number of question terms matched in the same sentence as the
              candidate answer
       relFP flag set to 1 if the candidate answer is followed by a punctuation
              sign
       relOCTW number of question terms matched, separated from the candidate
               answer by at most three words and one comma
       relSWS number of terms occurring in the same order in the answer
               passage as in the question
       relDTW average distance from candidate answer to question term matches




SIGIR ‗01
       Answer Ranking based on
          Machine Learning
Relative relevance score computed for each pair of
candidates (answer windows)
relPAIR = wSWS  relSWS + wFP  relFP
         + wOCTW  relOCTW + wSP  relSP + wSS  relSS
         + wNMW  relNMW + wDTW  relDTW + threshold
  If relPAIR positive, then first candidate from pair is more
  relevant
Perceptron model used to learn the weights
Scores in the 50% MRR for short answers, in the 60%
MRR for long answers
                  Evaluation on the Web

   - test on 350 questions from TREC (Q250-Q600)
   - extract 250-byte answers


                   Google   Answer            AltaVista   Answer extraction
                            extraction from               from AltaVista
                            Google


Precision score     0.29         0.44             0.15           0.37

Questions with a    0.44         0.57             0.27           0.45
correct answer
among top 5
returned answers
   Can we make this simpler?
One reason systems became so complex is that they have to pick out
one sentence within a small collection
    The answer is likely to be stated in a hard-to-recognize manner.
Alternative Idea:
    What happens with a much larger collection?
    The web is so huge that you‟re likely to see the answer stated in a
    form similar to the question
Goal: make the simplest possible QA system by exploiting this
redundancy in the web
    Use this as a baseline against which to compare more elaborate
    systems.
    The next slides based on:
     – Web Question Answering: Is More Always Better? Dumais, Banko, Brill, Lin, Ng, SIGIR‟02
     – An Analysis of the AskMSR Question-Answering System, Brill, Dumais, and Banko,
       EMNLP‟02.
AskMSR System Architecture

          1             2


                  3



          5            4
Step 1: Rewrite the questions

Intuition: The user‘s question is often syntactically
quite close to sentences that contain the answer

   Where is the Louvre Museum located?

   The Louvre Museum is located in Paris

   Who created the character of Scrooge?

   Charles Dickens created the character of Scrooge.
                    Query rewriting
Classify question into seven categories

          Who is/was/are/were…?
          When is/did/will/are/were …?
          Where is/are/were …?

a. Hand-crafted category-specific transformation rules
      e.g.: For where questions, move ‗is‘ to all possible locations
            Look to the right of the query terms for the answer.
                                                                       Nonsense,
         ―Where   is the Louvre Museum located?‖                       but ok. It‟s
                  ―is the Louvre Museum located‖                      only a few
                  ―the is Louvre Museum located‖                      more queries
                  ―the Louvre is Museum located‖                      to the search
                  ―the Louvre Museum is located‖                      engine.
                  ―the Louvre Museum located is‖

b. Expected answer ―Datatype‖ (eg, Date, Person, Location, …)
      When was the French Revolution?  DATE
         Query Rewriting - weighting

    Some query rewrites are more reliable than others.

       Where is the Louvre Museum located?

       Weight 1                              Weight 5
Lots of non-answers                          if a match,
could come back too                          probably right
                             +“the Louvre Museum is located”


      +Louvre +Museum +located
 Step 2: Query search engine

Send all rewrites to a Web search engine
Retrieve top N answers (100-200)
For speed, rely just on search engine‘s ―snippets‖,
not the full text of the actual document
  Step 3: Gathering N-Grams

Enumerate all N-grams (N=1,2,3) in all retrieved snippets
Weight of an n-gram: occurrence count, each weighted by
―reliability‖ (weight) of rewrite rule that fetched the document
    Example: “Who created the character of Scrooge?”

     Dickens                  117
     Christmas Carol           78
     Charles Dickens           75
     Disney                    72
     Carl Banks                54
     A Christmas               41
     Christmas Carol           45
     Uncle                     31
   Step 4: Filtering N-Grams

Each question type is associated with one or more
―data-type filters‖ = regular expression
When…
Where…                     Date
What …
                          Location
Who …
                          Person
Boost score of n-grams that match regexp
Lower score of n-grams that don‘t match regexp
Details omitted from paper….
       Step 5: Tiling the Answers
Scores

20         Charles    Dickens
                                     merged,    discard
15                   Dickens                    old n-grams
       Mr Charles
10

                       Score 45    Mr Charles Dickens


               tile highest-scoring n-gram
     N-Grams                                 N-Grams

               Repeat, until no more overlap
                    Results
Standard TREC contest test-bed (TREC 2001):
    ~1M documents; 900 questions
  Technique doesn‟t do too well (though would have
  placed in top 9 of ~30 participants)
   – MRR: strict: .34
   – MRR: lenient: .43
   – 9th place
                  Results
From EMNLP‘02 paper
  MMR of .577; answers 61% correctly
  Would be near the top of TREC-9 runs
Breakdown of feature contribution:
                     Issues

Works best/only for ―Trivial Pursuit‖-style fact-based
questions
Limited/brittle repertoire of
   question categories
   answer data types/filters
   query rewriting rules
     Intermediate Approach:
    Surface pattern discovery
Based on:
   Ravichandran, D. and Hovy E.H. Learning Surface Text Patterns for a Question
   Answering System, ACL’02
   Hovy, et al., Question Answering in Webclopedia, TREC-9, 2000.


Use of Characteristic Phrases
"When was <person> born‖
   Typical answers
     – "Mozart was born in 1756.”
     – "Gandhi (1869-1948)...”
   Suggests regular expressions to help locate correct answer
     – "<NAME> was born in <BIRTHDATE>”
     – "<NAME> ( <BIRTHDATE>-”
        Use Pattern Learning
Examples:
  “The great composer Mozart (1756-1791) achieved
  fame at a young age”
  “Mozart (1756-1791) was a genius”
  “The whole world would always be indebted to the
  great music of Mozart (1756-1791)”
Longest matching substring for all 3 sentences is
"Mozart (1756-1791)‖
  Suffix tree would extract "Mozart (1756-1791)" as an
  output, with score of 3
Reminiscent of IE pattern learning
    Pattern Learning (cont.)

Repeat with different examples of same question type
  “Gandhi 1869”, “Newton 1642”, etc.


Some patterns learned for BIRTHDATE
  a. born in <ANSWER>, <NAME>
  b. <NAME> was born on <ANSWER> ,
  c. <NAME> ( <ANSWER> -
  d. <NAME> ( <ANSWER> - )
            QA Typology from ISI
Typology of typical question forms—94 nodes (47 leaf nodes)
Analyzed 17,384 questions (from answers.com)

 (THING                                             (SPATIAL-QUANTITY
  ((AGENT                                             (VOLUME-QUANTITY AREA-QUANTITY DISTANCE-
    (NAME (FEMALE-FIRST-NAME (EVE MARY ...))              QUANTITY)) ... PERCENTAGE)))
        (MALE-FIRST-NAME (LAWRENCE SAM ...))))     (UNIT
        (COMPANY-NAME (BOEING AMERICAN-EXPRESS))    ((INFORMATION-UNIT (BIT BYTE ... EXABYTE))
        JESUS ROMANOFF ...)                          (MASS-UNIT (OUNCE ...)) (ENERGY-UNIT (BTU ...))
    (ANIMAL-HUMAN (ANIMAL (WOODCHUCK YAK ...))       (CURRENCY-UNIT (ZLOTY PESO ...))
            PERSON)                                  (TEMPORAL-UNIT (ATTOSECOND ... MILLENIUM))
    (ORGANIZATION (SQUADRON DICTATORSHIP ...))       (TEMPERATURE-UNIT (FAHRENHEIT KELVIN CELCIUS))
    (GROUP-OF-PEOPLE (POSSE CHOIR ...))              (ILLUMINATION-UNIT (LUX CANDELA))
    (STATE-DISTRICT (TIROL MISSISSIPPI ...))         (SPATIAL-UNIT
    (CITY (ULAN-BATOR VIENNA ...))                    ((VOLUME-UNIT (DECILITER ...))
    (COUNTRY (SULTANATE ZIMBABWE ...))))               (DISTANCE-UNIT (NANOMETER ...))))
  (PLACE                                               (AREA-UNIT (ACRE)) ... PERCENT))
    (STATE-DISTRICT (CITY COUNTRY...))             (TANGIBLE-OBJECT
    (GEOLOGICAL-FORMATION (STAR CANYON...))         ((FOOD (HUMAN-FOOD (FISH CHEESE ...)))
    AIRPORT COLLEGE CAPITOL ...)                     (SUBSTANCE
  (ABSTRACT                                           ((LIQUID (LEMONADE GASOLINE BLOOD ...))
   (LANGUAGE (LETTER-CHARACTER (A B ...)))             (SOLID-SUBSTANCE (MARBLE PAPER ...))
   (QUANTITY                                           (GAS-FORM-SUBSTANCE (GAS AIR)) ...))
    (NUMERICAL-QUANTITY INFORMATION-QUANTITY         (INSTRUMENT (DRUM DRILL (WEAPON (ARM GUN)) ...)
    MASS-QUANTITY MONETARY-QUANTITY                  (BODY-PART (ARM HEART ...))
    TEMPORAL-QUANTITY ENERGY-QUANTITY                (MUSICAL-INSTRUMENT (PIANO)))
    TEMPERATURE-QUANTITY ILLUMINATION-QUANTITY       ... *GARMENT *PLANT DISEASE)
               Experiments

6 different question types
   from Webclopedia QA Typology
    –   BIRTHDATE
    –   LOCATION
    –   INVENTOR
    –   DISCOVERER
    –   DEFINITION
    –   WHY-FAMOUS
Experiments: pattern precision

BIRTHDATE:
   1.0       <NAME> ( <ANSWER> - )
   0.85      <NAME> was born on <ANSWER>,
   0.6       <NAME> was born in <ANSWER>
   0.59      <NAME> was born <ANSWER>
   0.53      <ANSWER> <NAME> was born
   0.50      - <NAME> ( <ANSWER>
   0.36      <NAME> ( <ANSWER> -
INVENTOR
   1.0       <ANSWER> invents <NAME>
   1.0       the <NAME> was invented by <ANSWER>
   1.0       <ANSWER> invented the <NAME> in
         Experiments (cont.)

DISCOVERER
  1.0     when <ANSWER> discovered <NAME>
  1.0     <ANSWER>'s discovery of <NAME>
  0.9     <NAME> was discovered by <ANSWER> in
DEFINITION
  1.0     <NAME> and related <ANSWER>
  1.0     form of <ANSWER>, <NAME>
  0.94    as <NAME>, <ANSWER> and
          Experiments (cont.)
WHY-FAMOUS
   1.0      <ANSWER> <NAME> called
   1.0      laureate <ANSWER> <NAME>
   0.71     <NAME> is the <ANSWER> of
LOCATION
   1.0      <ANSWER>'s <NAME>
   1.0      regional : <ANSWER> : <NAME>
   0.92     near <NAME> in <ANSWER>
Depending on question type, get high MRR (0.6–0.9), with
higher results from use of Web than TREC QA collection
  Shortcomings & Extensions

Need for POS &/or semantic types
    – "Where are the Rocky Mountains?”
    – "Denver's new airport, topped with white fiberglass
      cones in imitation of the Rocky Mountains in the
      background , continues to lie empty”
    – <NAME> in <ANSWER>
NE tagger &/or ontology could enable system to
determine "background" is not a location
      Shortcomings... (cont.)

Long distance dependencies
   "Where is London?”
     "London, which has one of the busiest airports in the world,
      lies on the banks of the river Thames”
   would require pattern like:
   <QUESTION>, (<any_word>)*, lies on <ANSWER>
   Abundance & variety of Web data helps system to find an
   instance of patterns w/o losing answers to long distance
   dependencies
      Shortcomings... (cont.)
System currently has only one anchor word
   Doesn't work for Q types requiring multiple words from
   question to be in answer
     – "In which county does the city of Long Beach lie?”
     – "Long Beach is situated in Los Angeles County”
     – required pattern:
       <Q_TERM_1> is situated in <ANSWER> <Q_TERM_2>
Does not use case
     – "What is a micron?”
     – "...a spokesman for Micron, a maker of semiconductors, said
       SIMMs are..."
If Micron had been capitalized in question, would be a perfect
answer
         Question Answering
Today:
   Introduction to QA
   A typical full-fledged QA system
   A very simple system, in response to this
   An intermediate approach
Wednesday:
   Using external resources
    – WordNet
    – Encyclopedias, Gazeteers
   Incorporating a reasoning system
   Machine Learning of mappings
   Alternative question types

				
DOCUMENT INFO