CSE 574 – Artificial Intelligence II _NLP_ - Computer Science

Document Sample
CSE 574 – Artificial Intelligence II _NLP_ - Computer Science Powered By Docstoc
					 CSE 574 – Artificial Intelligence II (NLP)
EE 517 – Statistical Language Processing

      Prof. Luke Zettlemoyer (CSE)
        Prof. Mari Ostendorf (EE)

       [Numerous slides adapted from Regina Barzilay]
            3 Jan -- Overview
•   Course structure
•   Natural language processing (NLP)
•   Syllabus overview
•   Focus on statistical (data-driven) methods
•   Issues in corpus-based work
  CSE/EE Course Combination
• Why the course merger?
  – We’re accidentally teaching the same topics
  – Want to develop one that can be cross-listed
• Complication:
  – EE course has 4 units, CSE has 3
  – Solution: extra 1 hr per week paper discussion
    (required for EE, optional for CSE)
• Grading and project advising will be
  handled by faculty member in your dept
                     Course Info
• Web page:
• Schedule
   – MW 1:30-2:50 lecture, T 4:30-5:20 discussion
   – Finals week project presentation
• Goals:
   – Understand theoretical foundation of key algorithms
   – Gain practical experience with system & experiment design
   – Build technical communication skills related to NLP
• Book – several resources provided, not required but
  highly recommended
          Course Info (cont.)
• Expectations:
  – Computer labs: 40% CSE, 35% EE
    • 2 competitive labs on common data (language
      modeling, text classification)
    • 1 project-related lab (demonstrate feasibility)
  – Project: 60% CSE, 55% EE
    • Project proposal – week 4
    • Written report – week 10
    • Presentation – finals week
  – Paper discussions: 10% EE
   What is NLP? (from Google)
• Natural Language Processing
   – the branch of information science that deals with natural
     language information
   – instead of using Boolean logic, the user simply can type in a
     question as a query….
   – A range of computational techniques for analyzing and
     representing naturally occurring text (free text) at one or more
     levels of linguistic analysis (eg, morphological, syntactic,
     semantic, pragmatic) for the purpose of achieving human-like
     language processing for knowledge-intensive ...
• Ignoring… Neuro-Linguistic Programming, National
  Labor Party, Nonlinear Programming, No-Longer
  Polymers, …
       What is NLP? (cont)
Computer processing of human language

  Language              Computer             Language

  NL Info Extraction & Understanding

                             NL Generation

                   NL Transformation
               (translation, paraphrasing)
        What is NLP? (cont)
• May be for a variety of needs
  – Human-computer interaction
  – Computer-mediated human-human interaction
  – Information management & mining
  – Computer-based education and training
• Different levels of work
  – Core sub-problems (linguistic analysis)
  – Applications (summarization, question
    answering, translation, dialog systems, …)
             NLP is AI-complete
• To solve every possible NLP problem
  – need to solve all of artificial intelligence
  – basis for the Turing test:
 Turing (1950): “I believe that in
 about fifty yearsʼ it will be
 possible, to programme
 computers, with a storage
 capacity of about 109, to make
 them play the imitation game so
 well that an average interrogator
 will not have more than 70 per
 cent chance of making the right     – luckily, we don’t need to
 identification after five minutes
 of questioning.”
                                       solve it all in one step...
              Information Extraction
Goal: Build database entries from text
      10TH DEGREE is a full service advertising agency specializing in
      direct and interactive marketing. Located in Irvine CA, 10TH DEGREE
      is looking for an Assistant Account Manager to help manage and
      coordinate interactive marketing initiatives for a marquee automative
      account. Experience in online marketing, automative and/or the
      advertising field is a plus. Assistant Account Manager Responsibilities
      Ensures smooth implementation of programs and initiatives Helps
      manage the delivery of projects and key client deliverables ...
      Compensation: $50,000-$80,000

               INDUSTRY           Advertising
               POSITION           Assistant Account Manager
               LOCATION           Irvine, CA
               COMPANY            10th Degree
          Question Answering
Goal: Provide structured answers to user queries
          Machine Translation

One of the oldest NLP problems, started with code breaking
techniques in the 1950s
Goal: Participate in (goal-driven) conversations


One of the early NLP applications for AI
researchers (SHRDLU: Winograd, 72)
Deciphering lost languages: Ugaritic
  Knowledge engineering bottleneck
                          .#/01(#+ #*)2,()'#$3)&0)**4$#&5,6#6$7&,8*#69#$(#:(#+#&'

We need:
 • Knowledge about language
 • Knowledge about the world
Possible solutions: !"#$%&'()*+,-).+%/0'01)2(3
 • Manual engineering approach:
   encode all the required
   information into computer
 • Statistical / ML approach: infer
   language properties from
   examples of language use
         NLP History: pre-stat. / ML
“(1) Colorless green ideas sleep furiously.
  (2) Furiously sleep ideas green colorless.
It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) had ever occurred in an English discourse.
Hence, in any statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as equally "remote"
from English. Yet (1), though nonsensical, is grammatical, while (2)
is not.” (Chomsky 1957)

1970ʼs and 1980ʼs: non-statistical NLP
  • emphasis on deeper models, syntax
  • toy domains / manual grammars (SHRDLU, etc.)
  • weak empirical evaluation
     NLP: statistical / ML approaches
“Whenever I fire a linguist our system performance
improves. ” (Jelinek 1988)

1990ʼs: The Empirical Revolution
 • Corpus-based methods yield the first generation of widely
     used NL tools (syntax, MT, ASR)
 •   Deep analysis is often traded for robust approximations
 •   Empirical evaluation is crucial

2000ʼs: Richer linguistic representations
embedded in the statistical framework
            Topics Covered
• High level:
  – Foundational material
  – Important general statistical models (no
    linguistics required)
  – Core NLP sub-problems
  – Selected NLP applications
        Foundational Material
• Issues in corpus-based work (this week)
• Mathematical background: on your own!
  – Basic probability (pre-req, see also M&S 2.1)
  – Estimation & detection theory (key results
    reviewed in general model section)
  – Information theory (key concepts in M&S 2.2)
• Basic linguistics (later)
   Important General Methods
• Handling variable-length sequences
  – N-grams & extensions
  – Bag-of-words & extensions
  – HMMs, CRFs, …
• Models for vector observations
  – SVMs, log-linear models
  – Classification vs. reranking
• Intro to different learning strategies
         Core NLP Sub-Problems
•   Part of speech tagging
•   Sentiment Classification
•   Grammars and Parsing
•   Formal Semantics

         Selected NLP Applications
• Dialog systems
• Translation
 Why the Statistical Approach?
• Ambiguity in language (one text can have
  different meanings)
• Variability in language (one meaning can
  be expressed with different words)

 Need for “Ignorance Modeling”
           Why are these funny?
•   Iraqi Head Seeks Arms
•   Ban on Nude Dancing on Governorʼs Desk
•   Juvenile Court to Try Shooting Defendant
•   Teacher Strikes Idle Kids
•   Stolen Painting Found by Tree
•   Kids Make Nutritious Snacks
•   Local HS Dropout Cut in Half
•   Hospitals Are Sued by 7 Foot Doctors
            (Example from Jurafsky & Martin)
• I made her duck.
• Possible interpretations of the text out of context:
   – I cooked waterfowl for her.
   – I cooked the waterfowl that is on the plate in front of
   – I created a toy (or decorative) waterfowl for her.
   – I caused her to quickly lower her head.
• Possible variations in spoken forms:
   –   I made HER duck. vs. I made her DUCK.
   –   I made her duck? (doubt, disbelief) vs. statement form
   –   Ai made her duck. (where “Ai” is a name of a person)
   –   A maid heard “uck”.
            (Example from Lillian Lee)
• “Finally a computer that understands you like
  your mother’’
   – Possible interpretations?
   – Different syntax
     • understands [(that) you like your mother]
     • understands [you] [like your mother (does)]
  – Different word senses
     • Female parent; a source or origin; slimy substance
       added to cider or wine to make vinegar
  – Overall statement is vague / requires
    knowledge to understand
     • Does your mother understand you well, or poorly?
• Different ways of saying the same thing
  – The chicken crossed the road.
  – The road was crossed by the chicken.
  – The chicken has traversed the road.
  – Across the road went the chicken.
  – The daughter of the rooster made it to the other side
    of the street.
  – A chick- uh I mean the chicken you know like crossed
    the the road.
• Variation can involve syntactic or word choices
• Depends on modality, genre, topic, author, …
        Variation: Reading Level
• While the Portuguese Man o' War resembles a jellyfish, it is in
  fact a siphonophore - a colony of four kinds of minute, highly
  modified individuals, which are specialized polyps and
  medusoids. Each such zooid in these pelagic colonial
  hydroids or hydrozoans has a high degree of specialization
  and, although structurally similar to other cnidarians, are all
  attached to each other and physiologically integrated rather
  than living independently.
• The Portuguese Man o' War looks like a jellyfish, but it is
  really not. It is a siphonophore. This is a colony of four kinds
  of zooids. Zooids are very small, highly modified individuals.
  These zooids are structurally similar to other solitary animals,
  but the zooids do not live by themselves. Instead, they are
  attached to each other.
              Variation – Translation
•   Mr. Chang Jun Hsung, chairman of the Executive branch, expressed
    during a ceremony to celebrate the forming of the multi-party Association
    For … at the Legislative branch, that his idea is to use cooperation
    instead of confrontation because the political culture of confrontation and
    opposition of the past has cost a lot of people dearly.

•   Mr. Chang Jun Hsung, chairman of the Executive branch, announced
    today, during the celebration of the inaugural meeting of the multi-party
    Association For … at the Legislative branch, that he is replacing
    confrontation with cooperation because in the past the political culture of
    confrontation has cost a lot people.

•   Chairman Chang Jun Hsung of the Executive branch, while attending a
    ceremony to celebrate the forming of the multi-party Association For …
    at the Legislative branch, expressed that his idea is to replace
    confrontation with cooperation because the past political culture of
    confrontation and opposition has cost a lot people dearly.
• Variation impacts performance evaluation
  as well as system design
• For many problems, there may be more
  than one “correct” answer.
            Ignorance Modeling
• The basic idea:
   – acknowledge that you don’t yet have rules that account for all
     sources of variability/ambiguity
   – Allow for different alternatives in the model; use data-driven
• Examples:
   – From speech recognition: Gaussian mixture models for
     observation distributions can represent a range of pronunciations
     for a given word.
   – From language processing:
      • Probabilistic grammars do better than deterministic grammars at
        handling disfluencies
      • Grammar checking -- consider determiner case study, next
   Case Study: Determiner Placement
   Automatically place determiners: “a”, “the”, or null

 Scientists in United States have found way of turning lazy monkeys
 into workaholics using gene therapy. Usually monkeys work hard
 only when they know reward is coming, but animals given this
 treatment did their best all time. Researchers at National Institute of
 Mental Health nearWashington DC, led by Dr Barry Richmond, have
 now developed genetic treatment which changes their work ethic
 markedly. "Monkeys under influence of treatment don't procrastinate,"
 Dr Richmond says. Treatment consists of anti-sense DNA - mirror
 image of piece of one of our genes - and basically prevents that gene
 from working. But for rest of us, day when such treatments fall into
 hands of our bosses may be one we would prefer to put off.
     How do we choose a determiner?
Largely determined by:
– Type of noun (countable, uncountable)
– Uniqueness of reference
– Information value (given, new)
– Number (singular, plural)

However, there are many exceptions and special cases:
– The definite article is used with newspaper titles (The Times),
but zero article in names of magazines and journals (Time)
– Highway names vary by region: I-5 vs. the I-5

Hard to manually encode this information!
        A Simple Statistical Approach

•   Collect a large collection of texts relevant to your
    domain (e.g. newspaper text)

•   For each noun seen during training, compute its
    probability to take a certain determiner

•   Given a new noun, select a determiner with the
    highest likelihood as estimated on the training
               A Classification Approach
•   Predict: {“the”, “a”, null}
•   Define a problem representation (features):
     - plural? (yes/no)
     - first appearance? (yes/no)
     - head word token
        Plural?        First?       Word      Determiner
           N             Y        defendant       a
           Y             N          cars         null
           N             N          FBI          the

Goal: Learn classification function that can predict unseen
             How well does it work?
• Implementation details:
 - Training --- first 21 sections of the Wall Street
     Journal corpus, testing -- the 23th section
 -   Prediction accuracy: 71.5%

• The results are not great, but surprisingly
 high for such a simple method
 - A large fraction of nouns in this corpus always
     appear with the same determiner
        - for example: ``the FBI'', ``the defendant''
  Limitations of Data Alone
                                   3 online data-driven systems:
                                     How old are you?
                              MT     How to keep your
  How come it’s always you?
                                     How always you

Too many possible combinations to rely simply
on counts in a corpus
  e.g. V=100k  V5=1025
       Never Enough Data…
• Language has “lopsided sparsity” –
  infrequent events happen frequently
• Larger units (e.g. word tuples vs. words)
  require more data
• Counts of one corpus may not generalize
  to another domain, e.g.
  – Web is not representative of children’s articles
  – Newswire has very few “uh” and “um” and
    relatively few “I” and “you”, compared to
    conversational speech
               The NLP Cycle
• Gather / find a corpus
• Build a baseline
• Repeat:
- Analyze most common errors
- Think of ways to fix
- Modify the model
    ‣   Add new features
    ‣   Change the structure of the model
    ‣   Use a new learning method
Issues for Corpus-Based Methods
• Different types of data
• Honest estimates of performance
• Text pre-processing
                  Types of Data
• Different sources:
  – Documents, audio recordings, dictionaries, the Web, …
• Different forms
  – Text: newswire, blogs, email, chat
  – Speech: talk shows, speeches, call centers, hearings
  – Multimodal: speech&video, text&images, …
• Different units on which to base quantity
  –   words for language modeling
  –   sentences for parsing
  –   sentence pairs for translation
  –   articles for text classification
  –   article collections for multi-doc summarization
                     Text Corpora
Antique corpus:
      •   Rosetta Stone
Examples of corpora used today:
      •   Penn Treebank: 1M words of
          parsed text
      •   Brown Corpus: 1M words of
          tagged text
      •   North American News: 300M
      •   English Gigaword: 3.5B words
      •   The Web
                       Corpus for MT
Pairs of parallel sentences. For example, one sentence
from the Europarl corpus (Koehn, 2005):

   Danish: det er næsten en personlig rekord for mig dette efter˚ r .
                       u               o
   German: das ist f¨ r mich fast pers¨ nlicher rekord in diesem herbst .
   English that is almost a personal record for me this autumn !
   Spanish: es la mejor marca que he alcanzado este oto˜ o .
                                       a         a a
   Finnish: se on melkein minun enn¨ tykseni t¨ n¨ syksyn¨ ! a
   French: c ’ est pratiquement un record personnel pour moi , cet automne !
   Italian: e ’ quasi il mio record personale dell ’ autunno .
   Dutch: dit is haast een persoonlijk record deze herfst .
   Portuguese: e quase o meu recorde pessoal deste semestre !
                  ¨ a                          o               o
   Swedish: det ar n¨ stan personligt rekord f¨ r mig denna h¨ st !

                             Figure 2: One sentence aligned across 11 languages
                   Corpus for Parsing
From the Penn Treebank:
Canadian Utilities had 1988 revenue of $ 1.16 billion , mainly from its natural
gas -*&'*,.'212?0#+?'%0+1+0>';%&+,2&&2&'+,'@1;2#0*<'A32#2'032'?"=$*,>'&2#72&' about
    and electric utility businesses in Alberta, where the company serves
800,000 customers .
          Honest Evaluation
• Data is usually divided into 3 or more sets:
  – Training/learning set
  – Development test/tuning/validation set
  – Evaluation test set
• To have an unbiased performance
  estimate, need to test on data not used in
• Most models have parameters that can be
  tuned  requires more independent data
Never Enough Data… Revisited
• Language has “lopsided sparsity”…
• Larger units (e.g. sentences vs. tuples)
  require more data
• Hand annotated data can be expensive,
  especially for larger units
• Workarounds:
  – Cross-validation in learning/tuning (in lab 2)
  – Learning with some unlabeled data (later

Shared By: