It is the best of times (and the worst of times)
Kenneth Church Microsoft church@microsoft.com
Responsibility; Attribute Dangerous Positions to Others
Wow!
Interesting & Controversial Preaching to Choir
(What a difference a decade makes)
• Empiricism has come of age
– Radical Fringe Mainstream
Lonely
100% 80% 60% 40% 20% 0%
– Intended to be a 1-time event – But so successful that it evolved into a series of EMNLP conferences
• EMNLP-2004 received so many submissions that the program committee had to be expanded at the last minute
– Success/Catastrophe
% Statistical Papers
• 1993: Workshop on Very Large Corpora (WVLC)
Bob Moore
1985
ACL Meeting Fred Jelinek
1990
1995
2000
2005
July 25, 2004
EMNLP-2004 & Senseval-2004
2
The Structure of Scientific Revolutions (1962) – Kuhn (p.10)
•
–
Paradigms
Examples from Physics
• • • • • • Aristotle’s Physica Ptolemy’s Almagest Newton’s Principia and Optics Franklin’s Electricity Lavoisier’s Chemistry Lyell’s Geology
•
Two characteristics:
1. 2. Sufficiently unprecedented to attract an enduring group of adherents from competing modes of scientific activity Simultaneously, sufficiently open-ended to leave all sorts of problems for the redefined group of practitioners to resolve
July 25, 2004
EMNLP-2004 & Senseval-2004
3
Organizational Innovations
(Radical Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons
Innovation
– Short review cycles Freshness
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas) – Co-chair (Established, Knows how it has been done)
• Avoid incremental papers
– Reviewers prefer boring papers over radical ones – Reviewers do what reviewers do; chairs correction
• Inclusiveness: Diversity Growth (Sales)
– Thankless chores Marketing carrots – 1/3 promising, 1/3 stability, 1/3 outreach – Hold conferences in Europe, Asia & America
July 25, 2004
Checks & Balances
4
Short term ≠ Long term
EMNLP-2004 & Senseval-2004
What Worked and What Didn’t?
• Stay on msg: It is data, stupid!
– –
• Then you don’t need a lot of methodology
Data
WVLC (Very Large) >> EMNLP (Empirical Methods) If you have a lot of data, Methodology
•
Empiricism means diff things to diff people
1. Machine Learning (Self-organizing Methods) 2. Exploratory Data Analysis (EDA) Kucera & Francis gave 3. Corpus-Based Lexicography
•
great invited talk Lots of papers on 1 (but they couldn’t – EMNLP-2004 theme (error analysis) 2submitted talks) follow
–
Senseval grew out of 3
July 25, 2004
EMNLP-2004 & Senseval-2004
5
Word Sense Disambiguation (WSD) History
• Bar-Hillel (1960):
– Abandoned Machine Translation (MT) – Couldn’t see how to make progress on WSD (pen) – Can’t translate without disambiguating
• bank (money) banque • bank (river) banc
• Yarowsky:
– Parallel corpus encyclopedia + thesaurus – Bilingual ≠ Monolingual
• interest • wear
– ML: Co-training
• Supervised Unsupervised
• 1990s
– Parallel text ≈ Labeled corpus for supervised training and testing – Isn’t it great the translators have WSD labeled all this data for us!
July 25, 2004
• Lexicography: Hector
– Joint collaboration: Oxford University Press & DEC – flagging flogging
• Senseval
6
EMNLP-2004 & Senseval-2004
A Road Rarely Taken:
Tukey’s Exploratory Data Analysis (EDA)
• Linear Regression
• Plug data into off-theshelf package • Publish (if “significant”)
Time
50000 40000 30000 20000 10000 0 0 10 20 30 Sentence Length
– Standard practice:
– Better:
• Check for outliers • Bowed residuals
– Evidence of a positive or negative derivative
No Result
• Deviations from assumptions (normality)
– Fanout
• Slocum’s Thesis (1981)
– “Proof” that CKY takes linear time
July 25, 2004
50000 Standard texts (e.g., Aho)… 40000 consider … worst case… This 30000 assumption clearly fails to apply to 20000 natural language… Our 10000 experiments have shown that 0 0 10 20 30 average-case time performance… Sentence Length is approximately linear (p. 102)
EMNLP-2004 & Senseval-2004
Time
7
Many Machine Learning (ML) Techniques (SVMs, Perceptrons) are Similar to (Logistic) Regression; Rarely see EDA (Robust Statistical) Methods in ML
The Elements of Statistical Learning – Hastie, Tibshirani, Friedman (2001), p 380
July 25, 2004
EMNLP-2004 & Senseval-2004
8
Historical Context
• 1950s:
– Rigorous methodology
% Statistical Papers
• Information theory • Behaviorism
Empiricists feel lonely
100% 80% 60% 40% 20% 0%
Rationalists feel lonely
• Unfulfilled unrealistic expectations video
– ALPAC report – Whither Speech Recognition?
1985
1990
1995
2000
2005
• 1970s:
– Let it all hang out
Kuhn Crisis
ACL Meeting Bob Moore Fred Jelinek
• Artificial Intelligence • Cognitive Psychology
• 1990s:
July 25, 2004
Kuhn Crisis
9
– Revival of empiricism
EMNLP-2004 & Senseval-2004
Borrowed Slide: Jelinek (LREC)
“Whither Speech Recognition?”
Also, ALPAC (chair) & Bell Labs exec
Pierce, JASA 1969
…ASR is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, or going to the moon. Most recognizers behave not like scientists, but like mad inventors or untrustworthy engineers. …performance will continue to be very limited unless the recognizing device understands what is being said with something of the facility of a native speaker (that is, better than a foreigner fluent in the language) Any application of the foregoing discussion to work in the general area of pattern recognition is left as an exercise for the reader.
July 25, 2004 EMNLP-2004 & Senseval-2004 10
ALPAC (1966): the (in)famous report
John Hutchins
• The best known event in the history of MT is …
– Automatic Language Processing Advisory Committee (ALPAC)
• Its effect was to bring to an end the substantial funding of MT research in US for some twenty years.
– More significantly was the clear message to the general public and the rest of the scientific community that MT was hopeless. – For years afterwards, an interest in MT was something to keep quiet about; it was almost shameful. – To this day, the 'failure' of MT is still repeated by many as an indisputable fact.
• The impact of ALPAC is undeniable
– While the fame or notoriety of ALPAC is familiar, – What the report actually said is now becoming less familiar and often forgotten or misunderstood…
July 25, 2004
EMNLP-2004 & Senseval-2004
11
Theory
ALPAC Recommendations
The committee recommends expenditures in two distinct areas
• Improvement of translation:
1. practical methods for evaluation of translations; 2. means for speeding up the human translation process; 3. evaluation of quality and cost of various sources of translations; 4. investigation of the utilization of translations, to guard against production of translations that are never read; 5. study of delays in the over-all translation process, and means for eliminating them, both in journals and in individual items; 6. evaluation of the relative speed and cost of various sorts of machine-aided and should not be translation; judged by any 7. adaptation of existing mechanized immediate or editing and production processes in translation; foreseeable contribution 8. the over-all translation process; and to practical translation 9. production of adequate reference works for the translator, including the adaptation of glossaries that now exist Practice primarily for automatic dictionary lookup in machine translation EMNLP-2004 & Senseval-2004 12
• Computational linguistics as part of linguistics
– Studies of parsing, generation… including experiments in translation… – Linguistics should be supported as science,
•
July 25, 2004
Best of Times
Outline
• We’re making consistent progress, or • We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
July 25, 2004
EMNLP-2004 & Senseval-2004
13
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Moores: Bob ≠ Gorden ≠ Roger
July 25, 2004 EMNLP-2004 & Senseval-2004 14
Borrowed Slide Audrey Le (NIST)
Error Rate
Moore’s Law Time Constant: • 10x improvement per decade
Date (15 years)
July 25, 2004 EMNLP-2004 & Senseval-2004 15
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing Expectations
•
Controversial in 1980s
– But not in 1990s
–
Though, grumbling
•
Benefits
1. Agreement on what to do 2. Limits endless discussion 3. Helps sell the field
• • Manage expectations Fund raising
•
Risks (similar to benefits)
1. All our eggs are in one basket (lack of diversity) 2. Not enough discussion
• Hard to change course
3. Methodology Burden
July 25, 2004
EMNLP-2004 & Senseval-2004
16
Hockey Stick Business Case
$
2003
Last Year
July 25, 2004
2004
This Year
2005
Next Year
17
t
EMNLP-2004 & Senseval-2004
Where have we been and where are we going? Manage Consistent Progress over Time
Extrapolation/Prediction is Applicable
Expectations
Extrapolation/Prediction is Not Applicable
$
2003
2004 t
2005
July 25, 2004
EMNLP-2004 & Senseval-2004
18
When will we see the last nonstatistical paper? 2010?
100% 80% 60% 40% 20% 0%
% Statistical Papers
Bob Moore
July 25, 2004
1985
ACL Meeting Fred Jelinek
19
EMNLP-2004 & Senseval-2004
1990
1995
2000
2005
Top Ten Metrics of Success
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Value Creation (Reality) Speech Stock Prices (Belief) Startup Companies Raise Venture Capital (Excitement) Prototype Applications (Plausibility) Senseval Grand-Students (Survive the Test of Time) wants to Students Get Good Jobs be here We Students Finish PhD Theses are Citations here Conference Registrations Publications (Quantity) Search
July 25, 2004
EMNLP-2004 & Senseval-2004
20
Outline
• We’re making consistent progress, or • We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
Best of Times (Not!) Been there; Done that
July 25, 2004
EMNLP-2004 & Senseval-2004
21
It has been claimed that
Recent progress made possible by Empiricism Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers
•
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data” • Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data? • Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
•
2010s: Revival of Rationalism (?)
EMNLP-2004 & Senseval-2004 22
July 25, 2004
It has been claimed that
Recent progress made possible by Empiricism Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers
•
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data” • Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data? • Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
•
2010s: Revival of Rationalism (?)
EMNLP-2004 & Senseval-2004 23
July 25, 2004
It has been claimed that
Recent progress made possible by Empiricism Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers
•
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data” • Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data? • Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
•
2010s: Revival of Rationalism (?)
EMNLP-2004 & Senseval-2004 24
July 25, 2004
It has been claimed that
Recent progress made possible by Empiricism Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory)
• Periodic signals are continuous • Support extrapolation/prediction • Progress? Consistent progress?
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers
•
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data” • Quantity >> Quality (balance)
– Pragmatic focus:
Consistent progress?
• What can we do with all this data? • Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
•
2010s: Revival of Rationalism (?)
July 25, 2004
EMNLP-2004 & Senseval-2004
Extrapolation/Prediction: Applicable? 25
Speech Language Has the pendulum swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
26
Speech Language Has the pendulum swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: •
– Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
27
Speech Language Has the pendulum swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: •
– Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
28
Speech Language Has the pendulum swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including:
– Statistics and Machine Learning – as well as Linguistic Theory
• History repeats itself:
– – – – 1950s: empiricism 1970s: rationalism (empiricist methodology became too burdensome) 1990s: empiricism 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
Grandparents and grandchildren have a natural alliance… EMNLP-2004 & Senseval-2004
29
Rationalism
Well-known Chomsky, Minsky advocates Model Competence Model Contexts of Interest Phrase-Structure
Empiricism
Shannon, Skinner, Firth, Harris Noisy Channel Model N-Grams Minimize Prediction Error (Entropy) Descriptive Applied Collocations & Word Associations Forward-Backward (HMMs), Inside-outside (PCFGs) Recognition Noisy Channel Applications
30
All and Only
Goals
Explanatory Theoretical
Linguistic Agreement & WhGeneralizations movement
Principle-Based, Parsing Strategies CKY (Chart), ATNs, Unification Understanding
Applications Who did what to
July 25, 2004
whomEMNLP-2004 & Senseval-2004
Covering all the Bases
It is hard to make predictions (especially about the future)
• When will we see the last non-statistical paper?
– 2010?
• Revival of rationalism:
– 2010?
The answer to any question: 6 years!
July 25, 2004
EMNLP-2004 & Senseval-2004
31
Outline
• We’re making consistent progress, or • We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
Rising tide of data lifts all boats No matter what happens, it’s goin’ be great!
July 25, 2004
EMNLP-2004 & Senseval-2004
32
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• 1985: “There is no data like more data”
– Fighting words uttered by radical fringe elements (Mercer at Arden House)
• 1993 Workshop on Very Large Corpora
– Perfect timing: Just before the web – Couldn’t help but succeed – Fate
• 1995: The Web changes everything • All you need is data (magic sauce)
– – – – –
July 25, 2004
No linguistics No artificial intelligence (representation) No machine learning No statistics No error analysis
EMNLP-2004 & Senseval-2004 33
“It never pays to think until you’ve run out of data” – Eric Brill
Moore’s Law Constant: Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) Data Collection Rates Improvement Rates No consistently best learner Quoted out of context
34
More data is better data!
Fire everybody and spend the money on data
July 25, 2004 EMNLP-2004 & Senseval-2004
Borrowed Slide: Jelinek (LREC)
Benefit of Data
LIMSI: Lamel (2002) – Broadcast News
WER
hours Supervised: transcripts Lightly supervised: closed captions
July 25, 2004
EMNLP-2004 & Senseval-2004
35
The rising tide of data will lift all boats! TREC Question Answering & Google:
What is the highest point on Earth?
July 25, 2004
EMNLP-2004 & Senseval-2004
36
The rising tide of data will lift all boats! Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets
England
Japan
Cat
cat
France Germany Italy Ireland Spain Scotland Belgium Canada Austria Australia
July 25, 2004
China India Indonesia Malaysia Korea Taiwan Thailand Singapore Australia Bangladesh
Dog Horse Fish Bird Rabbit Cattle Rat Livestock Mouse Human
more ls rm mv cd cp mkdir man tail pwd
37
EMNLP-2004 & Senseval-2004
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• More data better results
– TREC Question Answering
• Remarkable performance: Google and not much else
– Norvig (ACL-02) – AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– We tried similar things » but with tiny corpora » which we called large
July 25, 2004
EMNLP-2004 & Senseval-2004
38
Applications
• 5 Ian Andersons
– Information Retrieval (IR)
• • •
– but failed to find much (if anything)
Don’t worry; Be happy
What good is word sense disambiguation (WSD)?
Salton: Tried hard to find ways to use NLP to help IR Croft: WSD doesn’t help because IR is already using those methods Sanderson (next two slides) Original motivation for much of the work on WSD But IR arguments may apply just as well to MT
–
Machine Translation (MT)
• •
• •
–
What good is POS tagging? Parsing? NLP? Speech? Commercial Applications of Natural Language Processing, CACM 1995
$100M opportunity (worthy of government/industry’s attention)
•
July 25, 2004
ALPAC Warning: premature commercialization is risky
EMNLP-2004 & Senseval-2004 39
1. 2.
Search (Lexis-Nexis) Word Processing (Microsoft)
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Not much?
F
– Introducing ambiguity by pseudo-words doesn’t hurt (much)
Query Length (Words)
July 25, 2004
Short queries matter most, but hardest for WSD
EMNLP-2004 & Senseval-2004
40
5 Ian Andersons
• Could WSD help IR? • Answer: no
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Soft WSD?
F
• Resolving ambiguity badly is worse than not resolving at all
– 75% accurate WSD degrades performance – 90% accurate WSD: breakeven point
Query Length (Words)
July 25, 2004
EMNLP-2004 & Senseval-2004
41
An example of Error Analysis/Representation
Some Promising Suggestions
(Generate lots of conference papers, but may not support the field)
• Two Languages are Better than One
– For many classic hard NLP problems
• Word Sense Disambiguation (WSD) • PP-attachment • Conjunction • Predicate-argument relationships • Japanese and Chinese Word breaking
• Demonstrate that NLP is good for something
– Statistical methods (IR & WSD) focus on bags of nouns,
• Ignoring verbs, adjectives, predicates, intensifiers, etc.
– Hypothesis: Ignored because perceptrons can’t model XOR – Task: classify “comments” into “good,” “bad” and “neutral”
• Lots of terms associated with just one category • Some associated with two
– Depending on argument
– Parallel corpora plenty of annotated (labeled) testing and training data – Don’t need unsupervised magic (data >> magic)
• Good & Bad, but not neutral: Mickey Mouse, Rinky Dink
– Bad: Mickey Mouse(us) – Good: Mickey Mouse(them)
Senseval++
July 25, 2004
– Current IR/WSD methods don’t capture predicateargument relationships
42
EMNLP-2004 & Senseval-2004
IT R I-
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
Baseline
July 25, 2004
C L
Supervision >> Magic > Baseline
W UN AS E P D R S-W - LS es ea ork -U rc be h - D nch IM IIT A P 2 ( IIT R) 1 (R ) IIT 2 IIT JH 1 U (R SM ) St an U fo K ls Si rd UN ne - C L qu S P a- 22 LI 4 A N -S C T TA D LP ul ut h 3 BC U JH U MD U -e hu SS -d T l is t D -al l ul ut D h5 ul ut h C D ul ut h D ul 4 ut D h2 ul u D th 1 ul ut h D A U ul u N th ED B -L SAl BC ic T an U te Ba - e se hu IR Ba l in -d ST se l l i Ba e L ist Ba ne G se e sk -be l in se C st r l in oup e C or e in om pu s G ro g L m Ba up esk one se ing Ba C st C or se l in l in e G om pus m e G rou on ro p up ing est Ba in Le se g s l in L e Ba esk k se G D ro u l ine ef Ba ping Le se R sk a li Ba ne ndo Le m se l in sk D e R ef an do m
Magic Supervision
0.2 44 0.2 39 0.2 32 0.2 2 0.6 42 0.6 38 0.6 29 0.6 17 0.6 13 0.5 94 0.3 19 0.2 93 0.4 01
Unsupervised Supervised
Bragging Rights
0.2 49 0.2 33 0.4 11
Baseline
http://www.sle.sharp.co.uk/senseval2/Results/all_graphs.xls
EMNLP-2004 & Senseval-2004
0.2 68 0.2 3 0.2 26 0.1 83 0.1 63 0.1 41
0.5 71 0.5 68 0.5 68 0.5 64 0.5 54 0.5 5 0.5 42 0.5 39 0.5 34 0.5 23 0.5 08 0.4 98
English Lexical Sample (fine-grained scoring)
0.4 37 0.4 27
0.4 76
0.5 12
Precision Recall
43
Breakdown by Systems & Words
• Spelling correction task
– Golding & Schabes (1996)
• Some methods work better on some words
– and other methods work better on other words
• Should breakdown Senseval results by both systems and words • Discover opportunities for hybrids across systems • Error analysis
– POS distinctions (easy) – Local context (trigrams) – Larger contexts (IR)
July 25, 2004 EMNLP-2004 & Senseval-2004 44
• Benchmarking:
• Shared learnings
• Marketing & Sales
• Not bragging rights:
July 25, 2004
– Scores going up and up Funding goes up and up – Rising tide lifts all boats
– Compare and contrast – What works and what doesn’t? – Error analysis
– Mirror, mirror on the wall, who’s the smartest of them all…
IT R I-
– How hard are various problems? – What makes problems easier or harder? – Rate of progress?
Goals of Shared Evaluations
EMNLP-2004 & Senseval-2004 45
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
C
L
Unsupervised Supervised Baseline
0.2 44 0.2 39 0.2 32 0.2 2
0.2 93
0.3 19
0.4 01
0.2 49 0.2 33
0.4 11
0.6 42 0.6 38 0.6 29 0.6 17 0.6 13 0.5 94 0.5 71 0.5 68 0.5 68 0.5 64 0.5 54 0.5 5 0.5 42 0.5 39 0.5 34 0.5 23 0.5 08 0.4 98
English Lexical Sample (fine-grained scoring)
0.2 68 0.2 3 0.2 26 0.1 83 0.1 63 0.1 41
0.4 37 0.4 27
0.5 12 0.4 76
Precision Recall
W UN AS E P D R S-W - LS es ea ork -U rc be h - D nc h IM IIT A P 2 IIT (R) 1 (R ) IIT 2 IIT JH 1 U (R SM ) St an U fo K ls Si rd UN ne - C L qu S P 22 aLI 4 A N -S C T TA D LP ul ut h 3 BC U J U MD HU -e hu SS -d T l is D t-al l ul ut D h5 ul ut h C D ul ut D h4 ul ut D h2 ul u D th 1 ul ut h D A U ul u N ED th B -L Al S-T BC ic an U te Ba - e s e hu I Ba l in - d RS se T l l in Ba e L is t Ba e s e e sk -be s e Gr l ine C st ou o l in p Co rp e G ing m m u s ro L Ba up esk one Ba s e ing C st s e l in C or l in e G om pus e m G rou on ro p up ing es t Ba in Le se g s l in L e Ba esk k G se ro l in De u f e Ba ping Le s se l in Ran k Ba e d s e Le om l in sk D e R ef an do m
Outline
• We’re making consistent progress, or • We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
According to unnamed sources: Speech Winter Language Winter Dot Boom Dot Bust
July 25, 2004 EMNLP-2004 & Senseval-2004 46
Kuhn Crisis
Early Warning Signs for Future
– Too little dissent: students aren’t rebelling against their teachers – I get uncomfortable when
• There is so much agreement on what to do and so much optimism • And so few worries and so little dissent/controversy.
• Mindless Metrics
– Whatever you measure, you get… – Scores go up and up and up, but are we really doing better?
• According to the scores, parsing is doing well without words, • But you can’t solve classic problems (PPs) without words!
• Burdensome Methodology Exclusiveness
– Can’t play (in speech) unless you work in a big lab
• Following Speech off a Cliff
Been great, – Empirical methods: Speech Language – Speech Winter Language Winter (Dot Boom Dot Bust) – What goes up, (usually) comes down…
EMNLP-2004 & Senseval-2004
but…
47
July 25, 2004
Campbell (ACL-04): Rules >> ML
• Senseval feels the need to demonstrate applications of their stuff (and maybe there aren’t any) • Complacency (don’t worry; be happy)
July 25, 2004
EMNLP-2004 & Senseval-2004
48
July 25, 2004
EMNLP-2004 & Senseval-2004
49
Sample of 20 Survey Questions
(Strong Emphasis on Applications)
• When will
– More than 50% of new PCs have dictation on them, either at purchase or shortly after. – Most telephone Interactive Voice Response (IVR) systems accept speech input. – Automatic airline reservation by voice over the telephone is the norm. – TV closed-captioning (subtitling) is automatic and pervasive. – Telephones are answered by an intelligent answering machine that converses with the calling party to determine the nature and priority of the call. – Public proceedings (e.g., courts, public inquiries, parliament, etc.) are transcribed automatically.
• Two surveys of ASRU attendees: 1997 & 2003
July 25, 2004 EMNLP-2004 & Senseval-2004 50
2003 Responses ≈ 1997 Responses + 6 Years (6 years of hard work No progress)
July 25, 2004
EMNLP-2004 & Senseval-2004
51
Top Ten Metrics of Success
(Risky to Promise Apps and Fail to Deliver) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Value Creation (Reality) Speech Stock Prices (Belief) Startup Companies Raise Venture Capital (Excitement) Prototype Applications (Plausibility) Senseval Grand-Students (Survive the Test of Time) wants to Students Get Jobs be here We Students Finish PhD Theses are Citations here Conference Registrations Publications (Quantity) Search
July 25, 2004
EMNLP-2004 & Senseval-2004
52
Wrong Apps?
• New Priorities
– Increase demand for space >> Data entry
• Old Priorities
– Dictation app dates back to days of dictation machines – Speech recognition has not displaced typing
• Speech recognition has improved • But typing skills have improved even more
– My son will learn typing in 1st grade – Sec rarely take dictation
• New Killer Apps
– Search >> Dictation
• Speech Google!
– Data mining
– Dictation machines are history
• My son may never see one • Museums have slide rulers and steam trains
– But dictation machines?
July 25, 2004
EMNLP-2004 & Senseval-2004
53
Speech Data Mining & Call Centers:
An Intelligence Bonanza
• Some companies are collecting information with technology designed to monitor incoming calls for service quality. • Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers. • But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director.
July 25, 2004 EMNLP-2004 & Senseval-2004 54
Speech Data Mining
• Label calls as success or failure based on some subsequent outcome (sale/no sale) • Extract features from speech • Find patterns of features that can be used to predict outcomes • Hypotheses:
– Customer: “I’m not interested” no sale – Agent: “I just want to tell you…” no sale
Inter-ocular effect (hits you between the eyes); Don’t need a statistician to know which way the wind is blowing
July 25, 2004 EMNLP-2004 & Senseval-2004 55
Ways for Conferences to Fail
• Incrementalism/Burdensome Methodology (Lesson from 1950s)
– We do research for fun and profit – Arno Penzias – Fun and/or Profit >> By-the-Book Correctness
• Arrogance, Mindless Metrics, etc. • Control
– Too much control
• • • • Excessive Exclusiveness (mutual admiration society/old-boy network) Change (serendipity) is essential: New and Different Fun and Excitement Growth and prosperity depends on new talent (students) & new topics Can’t afford to keep doing what we already know how to do
– Too little control
• Stay on msg: It’s data, stupid! (Our msg ≠ ACL’s)
• Set Inappropriate Expectations
– Promise too little
• Senseval feels the need to become more applied
Rarely a problem, especially with thesis proposals
– Promise too much: Promise Applications and Fail to Deliver – Success/Catastrophe Rarely a problem
• What if we actually achieved all our goals?
EMNLP-2004 & Senseval-2004
(except for March of Dimes)
56
July 25, 2004
Ways for Conferences to Succeed
• • • • • I wish I knew… Fate (can’t fail)
– Rising Tide of Data Lifts All Boats
Luck/timing: WVLC-93 was just before Web Sales & Marketing
–
– – –
Evaluation, Evaluation, Evaluation
In retrospect, 1993 WVLC worked wonderfully Distinguished us from mainstream Offered excitement and hope for future
• Especially appealing to students (growth opportunity)
Strategic Vision
July 25, 2004
EMNLP-2004 & Senseval-2004
57
Borrowed Slide: Jelinek (LREC)
Great Strategy Success
Great Challenge: Annotating Data
• Produce annotated data with minimal supervision Self-organizing “Magic” ≠ Error Analysis • Active learning
– Identify reliable labels – Identify best candidates for annotation
• Co-training • Bootstrap (project) resources from one application to another
July 25, 2004
EMNLP-2004 & Senseval-2004
58
Grand Challenges
ftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf
July 25, 2004
EMNLP-2004 & Senseval-2004
59
Roadmaps: Structure of a Strategy
(not the union of what we are all doing)
• Goals
– Example: Replace keyboard with microphone – Exciting (memorable) sound bite – Broad grand challenge that we can work toward but never solve
•
Small is beautiful
– Quantity is not a good thing – Awareness – 1-slide version
• if successful, you get maybe 3 more slides
•
Metrics
– Examples:
• WER: word error rate • Time to perform task
•
Size of container
– Goal: 1-3 – Metrics: 3 – Milestones: a dozen
• Mostly for next year: Q1-4 • Plus some for years 2, 5, 10 & 20
– Easy to measure
•
Milestones
– Should be no question if it has been accomplished – Example: reduce WER on task x by y% by time t
– Accomplishments: a dozen
•
Broad applicability & illustrative
– Don’t cover everything – Highlight stuff that
• Applies to multiple groups • Forward-Looking / Exciting
•
Accomplishments v. Activities
– Accomplishments are good – Activity is not a substitute for accomplishments – Milestones look forward whereas accomplishments look backward
July 25, 2004
• Serendipity is good!
EMNLP-2004 & Senseval-2004
60
Goals: 1. The multilingual companion 2. Life log
Grand Challenges
Goal: Produce NLP apps that improve the way people communicate with one another Goal: Reduce barriers to entry
€€€
Apps & Techniques
Evaluation
EMNLP-2004 & Senseval-2004 61
Resources
July 25, 2004
Substance: Recommended if…
Summary: What Worked and What Didn’t? What’s the right
answer?
WVLC (Very Large) >> EMNLP (Empirical Methods) If you have a lot of data,
– Then you don’t need a lot of methodology
•
–
Data
Stay on msg: It is the data, stupid!
• • •
Rising Tide of Data Lifts All Boats
•
–
Methodology
1. 2. 3. Machine Learning (Self-organizing Methods) Exploratory Data Analysis (EDA) Corpus-Based Lexicography EMNLP-2004 theme (error analysis) 2 Senseval grew out of 3
EMNLP-2004 & Senseval-2004
There’ll be a quiz at the end of the decade…
Empiricism means different things to different people
–
Lots of papers on 1
• •
Magic: Recommended if… Short term ≠ Long term Lonely
62
Promise: Recommended if…
July 25, 2004
Backup
Speech Language
• Been great so far,
– But too much of a good thing…
• Take the good
July 25, 2004
EMNLP-2004 & Senseval-2004
64
Fire
• Fuel
– Infrastructure: Shared datasets and lexical resources
• Wordnet, LDC, the Web
– Organizers
• Walker & Zampolli
– Funding
• Darpa (Charles Wayne), EU…
• Sparks
– Exciting Applications (The Web) – Grand Challenges – Leaders: Jelinek, Mercer, Miller, Kucera & Francis, Leech, Sinclair, Tukey, Liberman…
July 25, 2004
EMNLP-2004 & Senseval-2004
65
• Hi Ken, • Rada probably has more to add, but obviously we would like to hear something about WSD or word senses. We are currently trying to move Senseval to include application-specific evaluations (eg within MT or IR, or in specialized domains) and to more general semantic analysis of text (eg frames or subcats). Something to inspire people in this direction would be great.
• Phil.
July 25, 2004
EMNLP-2004 & Senseval-2004
66
Organizational Innovations
(Radical Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons
– Short review cycles Freshness
Innovation
Checks & Balances
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas) – Co-chair (Established, Knows how it has been done)
• Inclusiveness:
– Thankless Chores Marketing Carrots (Maximize # of reviewers) – Balance program committee, reviewers (and hopefully submissions, acceptances and registrations):
• 1/3 stability, 1/3 promising, 1/3 outreach • Diversity: experience, gender, geography, topic
– Hold conferences in Europe, Asia & America
• Huge potential market in Asia: 4 out of 5 jumbo jets
– Maintain 20-25% acceptance rate Parallel Sessions & Posters
• Avoid incremental papers
– Average grades (low grade dominates) Advocate + Second
July 25, 2004 EMNLP-2004 & Senseval-2004 67