Learning Center
Plans & pricing Sign in
Sign Out
Get this document free

EECS 595 LING 541 SI 661 Natural Language Processing


									EECS 595 / LING 541 / SI 661&761

Natural Language Processing

           Fall 2005
        Lecture Notes #1
             Course logistics
• Instructor: Prof. Dragomir Radev (
      Ph.D., Computer Science, Columbia University
      Formerly at IBM TJ Watson Research Center
• Times: Thursdays 2:40-5:25 PM, in 411, West Hall
• Office hours: TBA, 3080 West Hall Connector

  Course home page:
Example (from a famous movie)

  Dave Bowman: Open the pod bay doors, HAL.
  HAL: I’m sorry Dave. I’m afraid I can’t do that.

                I saw her fall

• How many different interpretations does the
  above sentence have? How many of them
  are reasonable/grammatical?
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                  Example 1
The Standard and Poor's 500 and the Nasdaq
composite index both reached four-year highs
Thursday as investors, unfazed by oil prices nearing
$70 per barrel, welcomed a raft of strong earnings
                             Example 2
Accenture posts higher earnings

Consulting and technology services firm beats estimates; stock gains
in after-hours trading.
July 7, 2005: 4:35 PM EDT

NEW YORK (Reuters) - Accenture Ltd., one of the world's largest
consulting and technology services firms, posted a higher quarterly
profit Thursday boosted by a rebound in consulting demand.

Fiscal third-quarter net income more than doubled to about $484 million, or
51 cents a share, from $210 million, or 37 cents a share, a year earlier, the
company said.

Analysts had expected earning of 43 cents a share, according to First Call.
Accenture stock rose about 2 percent in after-hours trading after falling nearly
6 percent in regular New York Stock Exchange trading.
• Gary Larson (“The Far Side”) cartoon:
• What we say to dogs:
  – “Okay Ginger! I’ve had it! You stay out of the
    garbage! Understand, Ginger?“
• What they hear:
  – “Blah Ginger! blah blah blah blah blah blah
    blah blah blah blah blah Ginger?"
Time Warner to hold off on Cablevision

But top Time Warner execs said it may eventually be interested in the cable assets.
July 8, 2005: 7:20 PM EDT

SUN VALLEY, Idaho (Reuters) - A top Time Warner Inc. executive said Friday it could not bid for
Cablevision until it completes a deal to buy Adelphia Communications Corp., splashing cold water on
early buyout speculation.
Time Warner is in a joint deal with Comcast Corp. to buy bankrupt cable provider Adelphia Communications
"We can't do anything else until we get it (Adelphia) integrated," said Don Logan, chairman of Time Warner's
media and communications group.
But he added, "We've always said we are interested in Cablevision. ... Anything is possible."
In June, the Dolan family offered Cablevision shareholders about $33.50 per share in a $7.9 billion deal to
take the company private.
Analysts and one of Cablevision's top investors have said the offer is too low and could put the cable system,
which serves 3 million customers in the New York area, into play for other suitors, including Time Warner
Cable and Comcast.
Wall Street analysts said in June that Time Warner, if it were to bid, could top the offer with a $35 to $40 per
share bid. Time Warner is the parent company of this Web site.
Time Warner chief executive Dick Parsons said on Friday his company's decision about whether to buy
Cablevision Corp. rests on whether the Dolan family decides to put it up for sale.
"Chuck (Dolan) controls it and it's not as if we could take it away from him," Parsons said during a break at
the Allen & Co. conference in Sun Valley, Idaho. "When he's ready to bring that asset to market he knows
we're here."
Parsons would not comment on whether he has had recent conversations with Dolan about buying
Parsons said he and Dolan agree that cable assets are undervalued and that now is a good time to buy them.
Time Warner is the parent company of CNN/Money.
Stocks edge up
Major gauges make tentative gains at Friday's open after steep Fed-inspired selloff.
July 1, 2005: 9:46 AM EDT

NEW YORK (CNN/Money) - Stocks inched higher early Friday, recovering some from the big selloff after the Federal
Reserve boosted interest rates again, and signaled it didn't intend to pause anytime soon.
The Dow Jones industrial average (down 99.51 to 10,274.97, Charts), the broader Standard & Poor's 500 (up 2.50 to
1,193.83, Charts) index and the Nasdaq composite (up 4.84 to 2,061.80, Charts) all added a few points in the early going, with
the Nasdaq lagging the blue chip indicators a bit.
Stocks ended a mixed quarter on a down note Thursday, with the Dow losing more than 100 points after the Fed raised the
target for its fed funds rate, an overnight bank lending rate, another quarter point to 3.25 percent.
In the closely watched statement, the central bankers acknowledged the impact of higher energy prices and other negatives,
but said the economic expansion remains on track. They also pledged to keep raising rates at a "measured" pace, all of which
suggested that they don't plan to pause in the near term.
Gains early Friday were broad based, with 27 out of 30 Dow issues rising.
In corporate news, Microsoft (up $0.02 to $24.86, Research) has settled antitrust claims made by IBM (unchanged at $74.20,
Research), the companies said Friday. The software leader will pay IBM $775 million as part of the deal.
A number of economic reports were due around 10 a.m. ET.
The Institute for Supply Management's manufacturing index for June was expected to have risen to 51.5 in the month from
51.4 in May, according to a consensus of economists surveyed by
The revised read on June consumer sentiment from the University of Michigan was also due, as was the May read on
construction spending.
Treasury prices slipped after Thursday's big rally. The fall raised the yield on the 10-year note to 3.94 percent from 3.92
percent late Thursday. Treasury prices and yields move in opposite directions.
In currency trading, the dollar jumped versus the euro and the yen.
U.S. light crude oil for August delivery rose 32 cents to trade at $56.82 a barrel in electronic trading. Crude set a record closing
price for a nearby futures contract at $60.54 on Monday.
COMEX gold fell $1.20 to $435.90 an ounce.
In global trade, Asian-Pacific markets ended mostly lower, and European markets rose at midday.
Google cracks $300
Shares of the popular search engine pass $300 for the first time and are now up 260% since IPO.
June 27, 2005: 5:52 PM EDT
By Paul R. La Monica, CNN/Money senior writer
NEW YORK (CNN/Money) - Shares of Google, the popular search-engine company, surpassed the $300 level for the first time on Monday,
sparking memories of the dot-com stock craze of the late 1990s.
Google gained 2.3 percent to finish at $304.10, slightly below its high for the day of $304.30. The stock has now gained nearly 260 percent since it went
public last August at $85 a share.
Much of the optimism surrounding Google comes from the fact that it is the leader in the white-hot online advertising industry. The company reported
much better than expected sales and earnings for the first quarter, thanks to a booming market for online advertising, particularly ads tied to specific
keyword searches.
And during the past few weeks, Google has released several new features -- including a desktop search function for businesses and a test version of a
personalized home page tool -- that should help the company remain competitive against rivals Yahoo! and Microsoft.
Several analysts have also speculated that Google will soon launch an online payment service that could compete against eBay's PayPal. In addition,
many investors have been betting that the company, which now has a market value of nearly $85 billion, will soon be added to the benchmark S&P 500
But the stock's meteoric rise as of late -- shares have surged more than 50 percent since the company reported first-quarter results in mid-April -- has
some analysts thinking that the stock could take a hit in the near future.
"You might see the stock pause temporarily," said Marianne Wolk, an analyst with Susquehanna Financial Group. "For the longer term, we're still very
bullish but in the very short term it wouldn't be a surprise to see the stock stabilize or pull back."
The key for Google will be how strong its second quarter results are. Google is set to report these numbers on July 21. Analysts expect Google's sales,
excluding revenues it shares with affiliates, a figure known as traffic acquisition costs or TAC, to come in at $840 million, nearly double last year's levels.
Earnings, excluding certain one-time charges, are forecast at $1.21, an increase of 121 percent from a year ago.
Wolk thinks that Google should meet these targets but does not believe the company will report results that are significantly better than consensus
projections. And if Google does not continue to beat estimates, the stock could take a bath.
"For Google to keep heading higher, it's absolutely critical that they keep hitting numbers. Everyone now believes the story," said John Tinker, an
analyst with ThinkEquity Partners.
Still, many investors are finding it hard to bet against Google because it has been posting extremely strong levels of sales growth and healthy profit
margins as a public company. So the comparisons to the late 1990s, when shares of many unprofitable Internet companies soared solely due to hype,
may not be apt.
To that end, Google is expected to generate nearly $3.6 billion in sales, excluding TAC and revenue of $5 billion next year as the company continues to
benefit from a shift of advertising dollars from more mainstream media sources such as television, radio, and newspapers, to the Web.
In addition to its ubiquitous search engine, Google has branched out into related areas in order to capitalize on the boom in online advertising. The
company has a comparison shopping site, Froogle, a free e-mail service called Gmail which features ads embedded in e-mails, and a local search site
that operates as kind of a Web version of the Yellow Pages.
Google also has expanded rapidly abroad, with sales from outside the U.S. accounting for nearly 40 percent of total sales in the first quarter.
What's more, some argue that Google is not overvalued, since it continues to trade at a discount to its top rival, Yahoo. However, this gap has narrowed
significantly as of late. Google's price-to-earnings ratio, based on 2005 earnings estimates, is 58. Yahoo trades at 61.5 times earnings estimates for this
"Google is not an undiscovered stock any more," said Tinker. "It's no longer inefficiently priced."
And Google also potentially faces the issue of the summer sluggishness that typically affects Internet stocks. Last year, shares of several Internet
companies plunged in July as results did not live up to lofty expectations.
                  Silly sentences
•   Children make delicious snacks
•   Stolen painting found by tree
•   I saw the Grand Canyon flying to New York
•   Court to try shooting defendant
•   Ban on nude dancing on Governor’s desk
•   Red tape holds up new bridges
•   Iraqi head seeks arms
•   Blair wins on budget, more lies ahead
•   Local high school dropouts cut in half
•   Hospitals are sued by seven foot doctors
•   In America a woman has a baby every 15 minutes. How does
    she do that?
     Main problems in language
• Novel words and usages
   – Blogs, little “r” me,7342.67
   – Spam as verb, email
• Inconsistencies
   – Beverly Hills, Beverly Sills
   – junior college, college junior
   – pet spray, pet llama
• Parsing problems
   – Cup holder
   – Federal Reserve Board Chairman
• Implicature/reasoning
• World knowledge
• Subjectivity, scoping, negation
                    Types of ambiguity
•   Morphological: Joe is quite impossible. Joe is quite important.
•   Phonetic: Joe’s finger got number.
•   Part of speech: Joe won the first round.
•   Syntactic: Call Joe a taxi.
•   Pp attachment: Joe ate pizza with a fork. Joe ate pizza with meatballs. Joe ate pizza
    with Mike. Joe ate pizza with pleasure.
•   Sense: Joe took the bar exam.
•   Modality: Joe may win the lottery.
•   Subjectivity: Joe believes that stocks will rise.
•   Scoping: Joe likes ripe apples and pears.
•   Negation: Joe likes his pizza with no cheese and tomatoes.
•   Referential: Joe yelled at Mike. He had broken the bike.
                 Joe yelled at Mike. He was angry at him.
•   Reflexive: John bought him a present. John bought himself a present.
•   Ellipsis and parallelism: Joe gave Mike a beer and Jeremy a glass of wine.
•   Metonymy: Boston called and left a message for Joe.

The S&P 500 climbed 6.93, or 0.56 percent, to 1,243.72,     its best close      since June 12, 2001.

 The Nasdaq gained 12.22, or 0.56 percent, to 2,198.44 for its best showing since June 8, 2001.

   The DJIA    rose 68.46, or 0.64 percent, to 10,705.55,   its highest level   since March 15.
    What is Natural Language
• Natural Language Processing (NLP) is the
  study of the computational treatment of
  natural language.
• NLP draws on research in Linguistics,
  Theoretical Computer Science,
  Mathematics and Statistics, Artificial
  Intelligence, Psychology, etc.
•   Information extraction
•   Named entity recognition
•   Trend analysis
•   Subjectivity analysis
•   Text classification
•   Anaphora resolution, alias resolution
•   Cross-document crossreference
•   Parsing
•   Semantic analysis
•   Word sense disambiguation
•   Word clustering
•   Question answering
•   Summarization
•   Document retrieval (filtering, routing)
•   Structured text (relational tables)
•   Paraphrasing and paraphrasing/entailment ID
•   Text generation
•   Machine translation
    What is needed: (1) linguistic knowledge

• Examples:
    – Zipf’s law: rank(wi)*freq(wi) = const
    – Collocations:
        • Strong beer but *powerful beer
        • Big sister but *large sister
        • Stocks rise but ?stocks ascend (225,000 hits on Google vs. 47 hits)
    – Constituents:
        •   Children eat pizza.
        •   They eat pizza.
        •   My cousin’s neighbor’s children eat pizza.
        •   _ Eat pizza!
    – Burstiness
        • P(ct=2|ct>=1)
• How to get it:
    – Manual rules
    – Automatically acquired from large text collections (corpora)
• Knowledge about language:
  –   Phonetics and phonology - the study of sounds
  –   Morphology - the study of word components
  –   Syntax - the study of sentence and phrase structure
  –   Lexical semantics - the study of the meanings of words
  –   Compositional semantics - how to combine words
  –   Pragmatics - how to accomplish goals
  –   Discourse conventions - how to deal with units larger
      than utterances
       What is needed: (2) mathematical and
                computational tools
•   Language models
•   Estimation methods
•   Hidden Markov Models (HMM): for sequences
•   Context-free grammars (CFG): for trees
•   Conditional Random Fields (CRF)
•   Generative/discriminative models
•   Maximum entropy models
•   Random walks
•   Latent semantic indexing (LSI)
•   + Representation issues
•   + Feature engineering
  Theoretical Computer Science
• Automata
   – Deterministic and non-deterministic finite-state automata
   – Push-down automata
• Grammars
   – Regular grammars
   – Context-free grammars
   – Context-sensitive grammars
• Complexity
• Algorithms
   – Dynamic programming
      Mathematics and Statistics
•   Probabilities
•   Statistical models
•   Hypothesis testing
•   Linear algebra
•   Optimization
•   Numerical methods
          Artificial Intelligence
• Logic
  – First-order logic
  – Predicate calculus
• Agents
  – Speech acts
• Planning
• Constraint satisfaction
• Machine learning
          Existing applications
•   Web search
•   Natural language interfaces to databases
•   Parsing job postings
•   Military intelligence
•   Summarizing medical records
•   Information extraction for databases
•   Wrapper induction
         Potential applications
• Trend recognition
• Db conversion + named entity extraction +
  classification + relation extraction
• Detecting change
• Summarization
• Social network analysis
• Assigning subjectivity scores (stars)
• Sentiment classification
• Alignment of text w/ other signal (time series)
• Record linkage
         Current work at CLAIR
•   Semi-supervised entity and relation extraction
•   Subjectivity analysis + factuality extraction
•   Protein interaction recognition
•   Text summarization
•   Text mining from the Web
•   Lexical network models of the Web
•   Syntactic alignment
•   Chronology recovery
•   Classification
                         Final remarks
•   Language is not adversarial
•   It is used to convey useful information
•   Hard to extract this information automatically
•   Need to use NLP
    –   Inference: mathematics, statistics, machine learning
    –   Networks/fields
    –   Graph theory
    –   Differential equaitions
    –   Statistics/optimization
    –   Linguistics/KR/AI
    –   Sequence alignment
    –   Linear algebra/vector analysis
                  I saw her fall.

• The categories of knowledge of language can be
  thought of as ambiguity-resolving components
• How many different interpretations does the above
  sentence have?
• How can each ambiguous piece be resolved?
• Does speech input make the sentence even more
                Time flies like an arrow.
             The alphabet soup
    (NLP vs. CL vs. SP vs. HLT vs. NLE)
• NLP (Natural Language Processing)
• CL (Computational Linguistics)
• SP (Speech Processing)
• HLT (Human Language Technology)
• NLE (Natural Language Engineering)
• Other areas of research: Speech and Text Generation,
  Speech and Text Understanding, Information Extraction,
  Information Retrieval, Dialogue Processing, Inference
• Related areas: Spelling Correction, Grammar Correction,
  Text Summarization
                           Some demos
•   AT&T Labs Text to Speech (
•   Babelfish (
•   OneAcross (
•   AskJeeves (
•   IONaut ( – seems to be down
•   NSIR (
•   AnswerBus (
•   NewsInEssence (
                 The Turing Test
• Alan Turing: the Turing test (language as test for intelligence)
• Three participants: a computer and two humans (one is an
• Interrogator’s goal: to tell the machine and human apart
• Machine’s goal: to fool the interrogator into believing that a
  person is responding
• Other human’s goal: to help the interrogator reach his goal

   Q: Please write me a sonnet on the topic of the Forth Bridge.
   A: Count me out on this one. I never could write poetry.
   Q: Add 34957 to 70764.
   A: 105621 (after a pause)
             Some brief history
• Foundational insights (40’s and 50’s): automaton (Turing),
  probabilities, information theory (Shannon), formal
  languages (Backus and Naur), noisy channel and decoding
  (Shannon), first systems (Davis et al., Bell Labs)
• Two camps (57-70): symbolic and stochastic.
  Transformation grammar (Harris, Chomsky), artificial
  intelligence (Minsky, McCarthy, Shannon, Rochester),
  automated theorem proving and problem solving (Newell
  and Simon)
  Bayesian reasoning (Mosteller and Wallace)
  Corpus work (Kučera and Francis)
            Some brief history
• Four paradigms (70-83): stochastic (IBM), logic-
  based (Colmerauer, Pereira and Warren, Kay,
  Bresnan), nlu (Winograd, Schank, Fillmore),
  discourse modelling (Grosz and Sidner)
• Empiricism and finite-state models redux (83-93):
  Kaplan and Kay (phonology and morphology),
  Church (syntax)
• Late years (94-03): strong integration of different
  techniques, different areas (including speech and
  IR), probabilistic models, machine learning
 The state of the art and the near-
            term future
• World-Wide Web (WWW)
• Sample scenarios:
  –   generate weather reports in two languages
  –   teaching deaf people to speak
  –   translate Web pages into different languages
  –   speak to your appliances
  –   find restaurants
  –   answer questions
  –   grade essays (?)
  –   closed-captioning in many languages
  –   automatic description of a soccer game
            Structure of the course
• Three major parts:
    – Linguistic, mathematical, and computational background
    – Computational models of morphology, syntax, semantics, discourse,
    – Applications: text generation, machine translation, information extraction,
• Three major goals:
    – Learn the basic principles and theoretical issues underlying natural
      language processing
    – Learn techniques and tools used to develop practical, robust systems that
      can communicate with users in one or more languages
    – Gain insight into many open research problems in natural language
     • Speech and Language
        (Daniel Jurafsky and James
        Prentice-Hall, 2000
        ISBN: 0-13-095069-6
     • Handouts given in class
     • 1-2 chapters per week
Optional readings:
 Natural Language Understanding by Allen
 Foundations of Statistical Natural Language Processing by Manning and Schütze.
•   Four homework assignments (40%)
•   Midterm (15%)
•   Final project (20%)
•   Final exam (25%)
•   Additional requirements for SI761
• (subject to change)
   – Finite-state modeling, part of speech tagging, and
     information extraction
      • Fsmtools/lextools/JMX (Bell Labs, Penn)
   – Tagging and parsing
      • Brill tagger/Charniak parser (JHU, Brown)
   – Machine translation
      • GIZA++/Rewrite decoder (Aachen, JHU, ISI)
   – Text generation
      • FUF/Surge (Columbia)
Introduction (JM1)
Linguistic Fundamentals
Regular Expressions and Automata (JM2)
Morphology and Finite-State Transducers (JM3)
Word Classes and Part of Speech Tagging (JM8)
Context-Free Grammars for English (JM9)
Parsing with Context-Free Grammars (JM10)
Features and Unification (JM11)
Lexicalized and Probabilistic Parsing (JM12)
Natural Language Generation (JM20) (Cont’d)
The Functional Unification Formalism (Handout)
Language and Complexity (JM13)
Representing Meaning (JM14)
Semantic Analysis (JM15)
Discourse (JM18)
Rhetorical Analysis (Handout)
Dialogue and Conversational Agents (JM19)
              Other meetings
• CLAIR meeting
• Artificial Intelligence Seminar
  (Tuesdays 4-5:30)
  (Thursdays 4-5:30)

Each student will be responsible for designing and completing a research project that
demonstrates the ability to use concepts from the class in addressing a practical
problem. A significant part of the final grade will depend on the project assignment.
Students can elect to do a project on an assigned topic, or to select a topic of their own.

The final version of the project will be put on the World Wide Web, and will be
defended in front of the class at the end of the semester (procedure TBA).
In some cases (and only with instructor’s approval), students may be allowed to work
in pairs when the project’s scope is significant.
                   Sample projects
•   Noun phrase parser             •   Text summarization
•   Paraphrase identification      •   Sentence compression
•   Question answering             •   Definition extraction
•   NL access to databases         •   Crossword puzzle generation
•   Named entity tagging           •   Prepositional phrase attachment
•   Rhetorical parsing             •   Machine translation
•   Anaphora resolution, entity    •   Generation
    crossreference                 •   Semi-structured document
•   Document and sentence              parsing
    alignment                      •   Semantic analysis of short
•   Using bioinformatics methods       queries
•   Encyclopedia                   •   User-friendly summarization
•   Information extraction         •   Number classification
•   Speech processing              •   Domain-specific PP attachment
•   Sentence normalization         •   Time-dependent fact extraction
    Main research forums and other
• Conferences: ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP, Coling,
  HLT, EACL/NAACL, AMTA/MT Summit, ICSLP/Eurospeech
• Journals: Computational Linguistics, Natural Language Engineering,
  Information Retrieval, Information Processing and Management, ACM
  Transactions on Information Systems, ACM TALIP, ACM TSLP
• University centers: Columbia, CMU, JHU, Brown, UMass, MIT,
  UPenn, USC/ISI, NMSU, Michigan, Maryland, Edinburgh,
  Cambridge, Saarland, Sheffield, and many others
• Industrial research sites: IBM, SRI, BBN, MITRE, MSR, (AT&T, Bell
  Labs, PARC)
• Startups: Language Weaver,, LCC
• The Anthology:
           What this course is NOT
•   EECS 597 / LING 792 / SI 661 “Language and Information”, last taught in
    Winter 2005, essentially an introduction to corpus-based and statistical NLP.
     – Topics covered: introduction to computational linguistics, information theory, data
       compression and coding, N-gram models, clustering, lexicography, collocations,
       text summarization, information extraction, question answering, word sense
       disambiguation, analysis of style, and other topics .
•   SI 760 “Information Retrieval”, last taught Winter 2005.
     – Topics covered: information need, IR models, documents, queries, query
       languages, relevance, retrieval evaluation, reference collections, query expansion
       and relevance feedback, indexing and searching, XML retrieval, language modeling
       approaches, crawling the Web, hyperlink analysis, measuring the Web, similarity
       and clustering, social network analysis for IR, hubs and authorities, PageRank and
       HITS, focused crawling, relevance transfer, question answering
•   The new advanced NLP/IR course, to be offered Winter 2006.
•   An undergraduate Linguistics course such as Ling 212 “Intro to the Symbolic
    Analysis of Language” or Ling 320 “Programming for Linguistics and
    Language Studies”
                         Other sites
• Johns Hopkins University (Jason

• Cornell University (Lillian Lee)

• Stanford University (Chris Manning)

• JHU Summer workshop
• J&M Chapters 1, 2
• “What is Computational Linguistics” by
  Hans Uszkoreit
• Lecture notes #1

To top