It is the best of times (and the worst of times)
Document Sample


It is the best of times
(and the worst of times)
Kenneth Church
Microsoft
church@microsoft.com
Responsibility; Attribute Interesting &
Wow!
Dangerous Positions to Others Controversial
(What a difference a decade makes)
• Empiricism has come of age Lonely Preaching to Choir
– Radical Fringe Mainstream
• 1993: Workshop on Very Large 100%
% Statistical
Corpora (WVLC) 80%
Papers
– Intended to be a 1-time event 60%
– But so successful that it 40%
evolved into a series of 20%
EMNLP conferences
0%
• EMNLP-2004 received so
1985
1990
1995
2000
2005
many submissions that the
program committee had to be
expanded at the last minute ACL Meeting
– Success/Catastrophe Bob Moore Fred Jelinek
July 25, 2004 EMNLP-2004 & Senseval-2004 2
The Structure of Scientific
Revolutions (1962) – Kuhn (p.10)
• Paradigms
– Examples from Physics
• Aristotle’s Physica
• Ptolemy’s Almagest
• Newton’s Principia and Optics
• Franklin’s Electricity
• Lavoisier’s Chemistry
• Lyell’s Geology
• Two characteristics:
1. Sufficiently unprecedented to attract an enduring group of
adherents from competing modes of scientific activity
2. Simultaneously, sufficiently open-ended to leave all sorts of
problems for the redefined group of practitioners to resolve
July 25, 2004 EMNLP-2004 & Senseval-2004 3
Organizational Innovations
(Radical Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons Innovation
– Short review cycles Freshness
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas)
– Co-chair (Established, Knows how it has been done)
• Avoid incremental papers
– Reviewers prefer boring papers over radical ones
– Reviewers do what reviewers do; chairs correction
• Inclusiveness: Diversity Growth (Sales)
– Thankless chores Marketing carrots
– 1/3 promising, 1/3 stability, 1/3 outreach Checks &
– Hold conferences in Europe, Asia & America Balances
July 25, 2004
Short term ≠ Long term
EMNLP-2004 & Senseval-2004 4
What Worked and What Didn’t?
Data
• Stay on msg: It is data, stupid!
– WVLC (Very Large) >> EMNLP (Empirical Methods)
– If you have a lot of data, Methodology
• Then you don’t need a lot of methodology
• Empiricism means diff things to diff people
1. Machine Learning (Self-organizing Methods)
2. Exploratory Data Analysis (EDA)
3. Corpus-Based Lexicography Kucera & Francis gave
great invited talk
• Lots of papers on 1 (but they couldn’t
– EMNLP-2004 theme (error analysis) 2submitted talks)
follow
– Senseval grew out of 3
July 25, 2004 EMNLP-2004 & Senseval-2004 5
Word Sense Disambiguation (WSD) History
• Bar-Hillel (1960): • Yarowsky:
– Abandoned Machine – Parallel corpus
Translation (MT) encyclopedia + thesaurus
– Couldn’t see how to make – Bilingual ≠ Monolingual
progress on WSD (pen) • interest
– Can’t translate without • wear
disambiguating – ML: Co-training
• bank (money) banque • Supervised
• bank (river) banc Unsupervised
• 1990s • Lexicography: Hector
– Parallel text ≈ Labeled – Joint collaboration: Oxford
corpus for supervised University Press & DEC
training and testing – flagging flogging
– Isn’t it great the translators • Senseval
have WSD labeled all this
data for us!
July 25, 2004 EMNLP-2004 & Senseval-2004 6
A Road Rarely Taken:
Tukey’s Exploratory Data Analysis (EDA)
• Linear Regression 50000
40000
– Standard practice: 30000
Time
• Plug data into off-the- 20000
shelf package 10000
• Publish (if “significant”) 0
– Better: 0 10 20 30
Sentence Length
• Check for outliers No Result
• Bowed residuals
Standard texts (e.g., Aho)…
50000
– Evidence of a positive
or negative derivative consider … worst case… This
40000
30000
Time
• Deviations from assumption clearly fails to apply to
20000
assumptions (normality) natural language… Our
10000
– Fanout experiments have shown that
0
• Slocum’s Thesis (1981) average-case time performance…
0 10 20 30
– “Proof” that CKY takes Sentence Length
is approximately linear (p. 102)
linear time
July 25, 2004 EMNLP-2004 & Senseval-2004 7
Many Machine Learning (ML) Techniques (SVMs,
Perceptrons) are Similar to (Logistic) Regression;
Rarely see EDA (Robust Statistical) Methods in ML
The Elements of Statistical Learning
– Hastie, Tibshirani, Friedman
(2001), p 380
July 25, 2004 EMNLP-2004 & Senseval-2004 8
Historical Context
Empiricists Rationalists
• 1950s: feel lonely feel lonely
– Rigorous methodology
• Information theory 100%
% Statistical
• Behaviorism 80%
Papers
• Unfulfilled unrealistic 60%
40%
expectations video
20%
– ALPAC report 0%
– Whither Speech
1985
1990
1995
2000
2005
Recognition? Kuhn Crisis
• 1970s: ACL Meeting
– Let it all hang out
Bob Moore Fred Jelinek
• Artificial Intelligence
• Cognitive Psychology
• 1990s: Kuhn Crisis
– Revival of empiricism
July 25, 2004 EMNLP-2004 & Senseval-2004 9
Borrowed Slide: Jelinek (LREC)
“Whither Speech Recognition?”
Also, ALPAC (chair)
& Bell Labs exec Pierce, JASA 1969
…ASR is attractive to money. The attraction is perhaps
similar to the attraction of schemes for turning water
into gasoline, extracting gold from the sea, or going
to the moon.
Most recognizers behave not like scientists, but like
mad inventors or untrustworthy engineers.
…performance will continue to be very limited unless
the recognizing device understands what is being
said with something of the facility of a native speaker
(that is, better than a foreigner fluent in the language)
Any application of the foregoing discussion to work in
the general area of pattern recognition is left as an
exercise for the reader.
July 25, 2004 EMNLP-2004 & Senseval-2004 10
ALPAC (1966): the (in)famous report
John Hutchins
• The best known event in the history of MT is …
– Automatic Language Processing Advisory Committee (ALPAC)
• Its effect was to bring to an end the substantial funding
of MT research in US for some twenty years.
– More significantly was the clear message to the general public
and the rest of the scientific community that MT was hopeless.
– For years afterwards, an interest in MT was something to keep
quiet about; it was almost shameful.
– To this day, the 'failure' of MT is still repeated by many as an
indisputable fact.
• The impact of ALPAC is undeniable
– While the fame or notoriety of ALPAC is familiar,
– What the report actually said is now becoming less familiar and
often forgotten or misunderstood…
July 25, 2004 EMNLP-2004 & Senseval-2004 11
Theory ALPAC Recommendations
The committee recommends expenditures in two distinct areas
• Computational • Improvement of translation:
1. practical methods for evaluation of
linguistics as part of translations;
2. means for speeding up the human
linguistics translation process;
3. evaluation of quality and cost of various
– Studies of parsing, sources of translations;
generation… including 4. investigation of the utilization of
translations, to guard against
experiments in production of translations that are
never read;
translation… 5. study of delays in the over-all
translation process, and means for
– Linguistics should be eliminating them, both in journals and
in individual items;
supported as science, 6. evaluation of the relative speed and
• and should not be cost of various sorts of machine-aided
translation;
judged by any 7. adaptation of existing mechanized
immediate or editing and production processes in
foreseeable contribution translation;
to practical translation 8. the over-all translation process; and
9. production of adequate reference
works for the translator, including the
Practice adaptation of glossaries that now exist
primarily for automatic dictionary look-
July 25, 2004 EMNLP-2004 & Senseval-2004
up in machine translation 12
Best of Times Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
July 25, 2004 EMNLP-2004 & Senseval-2004 13
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Moores: Bob ≠ Gorden ≠ Roger
July 25, 2004 EMNLP-2004 & Senseval-2004 14
Borrowed Slide
Audrey Le (NIST)
Error Rate
Moore’s Law Time Constant:
• 10x improvement per decade
Date (15 years)
July 25, 2004 EMNLP-2004 & Senseval-2004 15
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing • Controversial in 1980s
Expectations – But not in 1990s
– Though, grumbling
• Benefits
1. Agreement on what to do
2. Limits endless discussion
3. Helps sell the field
• Manage expectations
• Fund raising
• Risks (similar to benefits)
1. All our eggs are in one
basket (lack of diversity)
2. Not enough discussion
• Hard to change course
3. Methodology Burden
July 25, 2004 EMNLP-2004 & Senseval-2004 16
Hockey Stick
Business Case
$
2003 2004 2005
Last
This t Next
Year
Year Year
July 25, 2004 EMNLP-2004 & Senseval-2004 17
Where have we been and where are we going?
Consistent Progress over Time Manage
Expectations
Extrapolation/Prediction Extrapolation/Prediction
is Applicable is Not Applicable
$
2003 2004 2005
t
July 25, 2004 EMNLP-2004 & Senseval-2004 18
When will we see the last non-
statistical paper? 2010?
% Statistical 100%
Papers 80%
60%
40%
20%
0%
1985
1990
1995
2000
2005
ACL Meeting
Bob Moore Fred Jelinek
July 25, 2004 EMNLP-2004 & Senseval-2004 19
Top Ten Metrics of Success
Search
1. Value Creation (Reality)
2. Stock Prices (Belief) Speech
3. Startup Companies Raise Venture Capital (Excitement)
4. Prototype Applications (Plausibility) Senseval
5. Grand-Students (Survive the Test of Time) wants to
6. Students Get Good Jobs We be here
7. Students Finish PhD Theses are
8. Citations here
9. Conference Registrations
10. Publications (Quantity)
July 25, 2004 EMNLP-2004 & Senseval-2004 20
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy Best of Times
(Not!)
• We’re going off a cliff…
Been there;
Done that
July 25, 2004 EMNLP-2004 & Senseval-2004 21
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
• 1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
• 1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
• 2010s: Revival of Rationalism (?)
July 25, 2004 EMNLP-2004 & Senseval-2004 22
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
• 1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
• 1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
• 2010s: Revival of Rationalism (?)
July 25, 2004 EMNLP-2004 & Senseval-2004 23
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
• 1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
• 1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
• 2010s: Revival of Rationalism (?)
July 25, 2004 EMNLP-2004 & Senseval-2004 24
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
• 1950s: Empiricism was at its peak • Periodic signals are continuous
– Dominating a broad set of fields • Support extrapolation/prediction
• Ranging from psychology (Behaviorism) • Progress? Consistent progress?
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
• 1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
• 1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance) Consistent progress?
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech Language
• 2010s: Revival of Rationalism (?) Extrapolation/Prediction: Applicable?
July 25, 2004 EMNLP-2004 & Senseval-2004 25
Speech Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004 EMNLP-2004 & Senseval-2004 26
Speech Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility Translation
• that the pendulum might swing the other way conferences
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004 EMNLP-2004 & Senseval-2004 27
Speech Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility Translation
• that the pendulum might swing the other way conferences
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004 EMNLP-2004 & Senseval-2004 28
Speech Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility Translation
• that the pendulum might swing the other way conferences
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself:
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
Grandparents and grandchildren have a natural alliance…
EMNLP-2004 & Senseval-2004 29
Rationalism Empiricism
Well-known Shannon, Skinner, Firth,
advocates
Chomsky, Minsky
Harris
Model Competence Model Noisy Channel Model
Contexts of Interest Phrase-Structure N-Grams
Minimize Prediction Error
All and Only
(Entropy)
Goals
Explanatory Descriptive
Theoretical Applied
Linguistic Agreement & Wh- Collocations & Word
Generalizations movement Associations
Principle-Based, Forward-Backward
Parsing Strategies CKY (Chart), (HMMs), Inside-outside
ATNs, Unification (PCFGs)
Understanding Recognition
Applications Who did what to
Noisy Channel Applications
July 25, 2004
whomEMNLP-2004 & Senseval-2004 30
Covering all the Bases
It is hard to make predictions (especially about the future)
• When will we see the last
non-statistical paper?
– 2010?
• Revival of rationalism: The answer to any
question: 6 years!
– 2010?
July 25, 2004 EMNLP-2004 & Senseval-2004 31
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy Rising tide of data
• We’re going off a cliff… lifts all boats
No matter what
happens, it’s goin’
be great!
July 25, 2004 EMNLP-2004 & Senseval-2004 32
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• 1985: “There is no data like more data”
– Fighting words uttered by radical fringe elements (Mercer at
Arden House)
• 1993 Workshop on Very Large Corpora
– Perfect timing: Just before the web
– Couldn’t help but succeed
– Fate
• 1995: The Web changes everything
• All you need is data (magic sauce)
– No linguistics
– No artificial intelligence (representation)
– No machine learning
– No statistics
– No error analysis
July 25, 2004 EMNLP-2004 & Senseval-2004 33
“It never pays to think until you’ve
run out of data” – Eric Brill
Moore’s Law Constant:
Data Collection Rates Improvement Rates
Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001)
No consistently
best learner
More
Quoted out of context
data is
better
data!
Fire everybody and
spend the money on data
July 25, 2004 EMNLP-2004 & Senseval-2004 34
Borrowed Slide: Jelinek (LREC)
Benefit of Data
LIMSI: Lamel (2002) – Broadcast News
WER
hours
Supervised: transcripts
Lightly supervised: closed captions
July 25, 2004 EMNLP-2004 & Senseval-2004 35
The rising tide of data will lift all boats!
TREC Question Answering & Google:
What is the highest point on Earth?
July 25, 2004 EMNLP-2004 & Senseval-2004 36
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
England Japan Cat cat
France China Dog more
Germany India Horse ls
Italy Indonesia Fish rm
Ireland Malaysia Bird mv
Spain Korea Rabbit cd
Scotland Taiwan Cattle cp
Belgium Thailand Rat mkdir
Canada Singapore Livestock man
Austria Australia Mouse tail
Australia Bangladesh Human pwd
July 25, 2004 EMNLP-2004 & Senseval-2004 37
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• More data better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– We tried similar things
» but with tiny corpora
» which we called large
July 25, 2004 EMNLP-2004 & Senseval-2004 38
Don’t worry;
Applications Be happy
• What good is word sense disambiguation (WSD)?
– Information Retrieval (IR)
5 Ian Andersons
• Salton: Tried hard to find ways to use NLP to help IR
– but failed to find much (if anything)
• Croft: WSD doesn’t help because IR is already using those
methods
• Sanderson (next two slides)
– Machine Translation (MT)
• Original motivation for much of the work on WSD
• But IR arguments may apply just as well to MT
• What good is POS tagging? Parsing? NLP? Speech?
• Commercial Applications of Natural Language
Processing, CACM 1995
– $100M opportunity (worthy of government/industry’s attention)
1. Search (Lexis-Nexis)
2. Word Processing (Microsoft) ALPAC
• Warning: premature commercialization is risky
July 25, 2004 EMNLP-2004 & Senseval-2004 39
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Not much?
• Could WSD help IR?
5 Ian Andersons
F
• Answer: no
– Introducing ambiguity
by pseudo-words
doesn’t hurt (much)
Query Length (Words)
July 25, 2004 Short queries matter most, but hardest for WSD
EMNLP-2004 & Senseval-2004 40
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Soft WSD?
F
• Resolving ambiguity
badly is worse than not
resolving at all
– 75% accurate WSD
degrades performance
– 90% accurate WSD:
breakeven point
Query Length (Words)
July 25, 2004 EMNLP-2004 & Senseval-2004 41
An example of Error Analysis/Representation
Some Promising Suggestions
(Generate lots of conference papers, but may not support the field)
• Two Languages are • Demonstrate that NLP is good
Better than One for something
– Statistical methods (IR &
– For many classic hard NLP WSD) focus on bags of nouns,
problems • Ignoring verbs, adjectives,
• Word Sense predicates, intensifiers, etc.
Disambiguation (WSD) – Hypothesis: Ignored because
• PP-attachment perceptrons can’t model XOR
• Conjunction – Task: classify “comments” into
• Predicate-argument “good,” “bad” and “neutral”
relationships • Lots of terms associated with
• Japanese and Chinese just one category
Word breaking • Some associated with two
– Depending on argument
– Parallel corpora plenty • Good & Bad, but not neutral:
of annotated (labeled) Mickey Mouse, Rinky Dink
testing and training data – Bad: Mickey Mouse(us)
– Don’t need unsupervised – Good: Mickey Mouse(them)
magic (data >> magic) – Current IR/WSD methods
don’t capture predicate-
argument relationships
Senseval++
July 25, 2004 EMNLP-2004 & Senseval-2004 42
Magic
IT
R
I-
W UN
AS E
C P D
L
10%
20%
30%
40%
50%
60%
70%
80%
90%
0%
July 25, 2004
R S-W - LS
es
ea ork -U
rc be 0.4
h 01
- D nch
IM 0.3
IIT A P
19
2 0.2
( 93
IIT R)
1 0.2
(R 44
)
IIT 0.2
39
Unsupervised
2
IIT 0.2
JH 1
32
U 0.2
(R 2
St SM ) 0.6
an U 42
fo K ls 0.6
Si rd UN
ne - C L 38
qu S P
a- 22 0.6
LI 4 29
A N
-S 0.6
C 17
T
Supervision
TA 0.6
D LP 13
ul 0.5
ut
h 94
3
BC U JH 0.5
U MD U
71
-e - 0.5
hu SS 68
-d T
l is 0.5
t 68
D -al l
ul 0.5
ut
D h5 64
ul
ut 0.5
h 54
D C
ul
ut 0.5
Supervised
D h 5
ul 4 0.5
ut
D h2 42
ul
u 0.5
Bragging Rights
D th 1 39
ul
ut 0.5
h
D A 34
U ul u
N th 0.5
EMNLP-2004 & Senseval-2004
ED B 23
-L 0.5
Al S- 08
BC ic T 0.4
U
(fine-grained scoring)
an 98
te
English Lexical Sample
Ba - e
Ba se hu IR 0.4
l in -d ST 11
se l
l i Ba e L ist 0.2
Ba ne G se e sk -be 49
l in
se r C st
l in oup e C or 0.2
e 33
G in om pu s
ro g L m 0.5
Ba up esk one 12
Ba se ing C st
se l in C or 0.4
l in e G om pus 76
e m
G rou on 0.4
ro p 37
Ba up ing est
in
g
se Le 0.4
l in L s 27
e Ba esk k
G se D
0.2
ro 68
u l ine ef
Ba ping Le 0.2
Baseline
se R sk 3
li a 0.2
Ba ne ndo 26
Le m
Recall
se
l in sk 0.1
e D 83
R ef
Precision
an 0.1
do 63
m
0.1
41
http://www.sle.sharp.co.uk/senseval2/Results/all_graphs.xls
Supervision >> Magic > Baseline
43
Baseline
Breakdown by
Systems & Words
• Spelling correction task
– Golding & Schabes (1996)
• Some methods work
better on some words
– and other methods work
better on other words
• Should breakdown
Senseval results by both
systems and words
• Discover opportunities for
hybrids across systems
• Error analysis
– POS distinctions (easy)
– Local context (trigrams)
– Larger contexts (IR)
July 25, 2004 EMNLP-2004 & Senseval-2004 44
July 25, 2004
harder?
– Error analysis
• Benchmarking:
• Shared learnings
– Rate of progress?
• Marketing & Sales
• Not bragging rights:
– Compare and contrast
– Rising tide lifts all boats
Funding goes up and up
the smartest of them all…
– Scores going up and up
– What works and what doesn’t?
– Mirror, mirror on the wall, who’s
– What makes problems easier or
– How hard are various problems?
IT
R
I-
W UN
AS E
EMNLP-2004 & Senseval-2004
C P D
L
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
R S-W - LS
es
ea ork -U
rc be 0.4
h 01
- D nc h
IM 0.3
IIT A P
19
2 0.2
IIT (R) 93
1 0.2
(R 44
)
IIT 0.2
39
Unsupervised
2
IIT 0.2
JH 1
32
U 0.2
(R 2
St SM ) 0.6
an U 42
fo K ls 0.6
Si rd UN
ne - C L 38
qu S P 0.6
a- 22
LI 4 29
A N
-S 0.6
C 17
T
TA 0.6
D LP 13
ul 0.5
ut
h 94
3
BC U J 0.5
U MD HU
71
-e - 0.5
hu SS 68
-d T
l is 0.5
D t-al l 68
ul 0.5
ut
D h5 64
ul 0.5
ut
h 54
D C
ul 0.5
ut
Supervised
D h4 5
ul 0.5
ut
D h2 42
ul
u 0.5
D th 1 39
ul
ut
h
0.5
D A 34
U ul u 0.5
N 23
ED th B
-L 0.5
Al S-T 08
BC ic 0.4
U an
(fine-grained scoring)
te 98
English Lexical Sample
Ba - e
Ba s e hu I 0.4
l in - d RS 11
se l T
l in Ba e L is t 0.2
Ba e s e e sk -be 49
s e Gr l ine C st
l in ou o 0.2
e p Co rp 33
G ing m m u s
ro L 0.5
Ba up esk one 12
Ba s e ing C st 0.4
s e l in C or
l in e G om pus 76
e m 0.4
G rou on
ro p 37
Ba up ing es t
in 0.4
se g Le
l in L s 27
e Ba esk k 0.2
G se
ro l in De 68
u e f
Ba ping Le 0.2
Baseline
se s 3
l in Ran k 0.2
Ba e d 26
Recall
s e Le om
l in sk 0.1
e D 83
R ef
Precision
an 0.1
do 63
m
Goals of Shared Evaluations
0.1
41
45
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
According to unnamed sources:
Speech Winter Language Winter
Dot Boom Dot Bust
July 25, 2004 EMNLP-2004 & Senseval-2004 46
Kuhn Crisis
Early Warning Signs for Future
• Senseval feels the need to demonstrate applications of their stuff
(and maybe there aren’t any)
• Complacency (don’t worry; be happy)
Campbell (ACL-04):
– Too little dissent: students aren’t rebelling against their teachers
– I get uncomfortable when
Rules >> ML
• There is so much agreement on what to do and so much optimism
• And so few worries and so little dissent/controversy.
• Mindless Metrics
– Whatever you measure, you get…
– Scores go up and up and up, but are we really doing better?
• According to the scores, parsing is doing well without words,
• But you can’t solve classic problems (PPs) without words!
• Burdensome Methodology Exclusiveness
– Can’t play (in speech) unless you work in a big lab
• Following Speech off a Cliff
– Empirical methods: Speech Language Been great, but…
– Speech Winter Language Winter (Dot Boom Dot Bust)
– What goes up, (usually) comes down…
July 25, 2004 EMNLP-2004 & Senseval-2004 47
July 25, 2004 EMNLP-2004 & Senseval-2004 48
July 25, 2004 EMNLP-2004 & Senseval-2004 49
Sample of 20 Survey Questions
(Strong Emphasis on Applications)
• When will
– More than 50% of new PCs have dictation on them, either at
purchase or shortly after.
– Most telephone Interactive Voice Response (IVR) systems
accept speech input.
– Automatic airline reservation by voice over the telephone is the
norm.
– TV closed-captioning (subtitling) is automatic and pervasive.
– Telephones are answered by an intelligent answering machine
that converses with the calling party to determine the nature and
priority of the call.
– Public proceedings (e.g., courts, public inquiries, parliament,
etc.) are transcribed automatically.
• Two surveys of ASRU attendees: 1997 & 2003
July 25, 2004 EMNLP-2004 & Senseval-2004 50
2003 Responses ≈ 1997 Responses + 6 Years
(6 years of hard work No progress)
July 25, 2004 EMNLP-2004 & Senseval-2004 51
Top Ten Metrics of Success
(Risky to Promise Apps and Fail to Deliver)
Search
1. Value Creation (Reality)
2. Stock Prices (Belief) Speech
3. Startup Companies Raise Venture Capital (Excitement)
4. Prototype Applications (Plausibility) Senseval
5. Grand-Students (Survive the Test of Time) wants to
6. Students Get Jobs We be here
7. Students Finish PhD Theses are
8. Citations here
9. Conference Registrations
10. Publications (Quantity)
July 25, 2004 EMNLP-2004 & Senseval-2004 52
Wrong Apps?
• New Priorities • Old Priorities
– Dictation app dates back to
– Increase demand for days of dictation machines
space >> Data entry – Speech recognition has not
• New Killer Apps displaced typing
• Speech recognition has
– Search >> Dictation improved
• Speech Google! • But typing skills have
– Data mining improved even more
– My son will learn typing in
1st grade
– Sec rarely take dictation
– Dictation machines are history
• My son may never see one
• Museums have slide rulers
and steam trains
– But dictation machines?
July 25, 2004 EMNLP-2004 & Senseval-2004 53
Speech Data Mining
& Call Centers:
An Intelligence Bonanza
• Some companies are collecting
information with technology
designed to monitor incoming calls
for service quality.
• Last summer, Continental Airlines
Inc. installed software from
Witness Systems Inc. to monitor
the 5,200 agents in its four
reservation centers.
• But the Houston airline quickly
realized that the system, which
records customer phone calls and
information on the responding
agent's computer screen, also was
an intelligence bonanza, says
André Harris, reservations training
and quality-assurance director.
July 25, 2004 EMNLP-2004 & Senseval-2004 54
Speech Data Mining
• Label calls as success or failure based on
some subsequent outcome (sale/no sale)
• Extract features from speech
• Find patterns of features that can be used
to predict outcomes
• Hypotheses:
– Customer: “I’m not interested” no sale
– Agent: “I just want to tell you…” no sale
Inter-ocular effect (hits you between the eyes);
Don’t need a statistician to know which way the wind is blowing
July 25, 2004 EMNLP-2004 & Senseval-2004 55
Ways for Conferences to Fail
• Incrementalism/Burdensome Methodology (Lesson from 1950s)
– We do research for fun and profit – Arno Penzias
– Fun and/or Profit >> By-the-Book Correctness
• Arrogance, Mindless Metrics, etc.
• Control
– Too much control
• Excessive Exclusiveness (mutual admiration society/old-boy network)
• Change (serendipity) is essential: New and Different Fun and Excitement
• Growth and prosperity depends on new talent (students) & new topics
• Can’t afford to keep doing what we already know how to do
– Too little control
• Stay on msg: It’s data, stupid! (Our msg ≠ ACL’s) Rarely a problem,
• Set Inappropriate Expectations especially with
– Promise too little thesis proposals
• Senseval feels the need to become more applied
– Promise too much: Promise Applications and Fail to Deliver
– Success/Catastrophe Rarely a problem
• What if we actually achieved all our goals? (except for
March of Dimes)
July 25, 2004 EMNLP-2004 & Senseval-2004 56
Ways for Conferences to Succeed
• I wish I knew…
• Fate (can’t fail)
– Rising Tide of Data Lifts All Boats
• Luck/timing: WVLC-93 was just before Web
• Sales & Marketing
– Evaluation, Evaluation, Evaluation
• Strategic Vision
– In retrospect, 1993 WVLC worked wonderfully
– Distinguished us from mainstream
– Offered excitement and hope for future
• Especially appealing to students (growth opportunity)
July 25, 2004 EMNLP-2004 & Senseval-2004 57
Borrowed Slide: Jelinek (LREC) Great Strategy Success
Great Challenge: Annotating Data
• Produce annotated data with minimal
supervision Self-organizing “Magic” ≠ Error Analysis
• Active learning
– Identify reliable labels
– Identify best candidates for annotation
• Co-training
• Bootstrap (project) resources from one
application to another
July 25, 2004 EMNLP-2004 & Senseval-2004 58
Grand Challenges
ftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf
July 25, 2004 EMNLP-2004 & Senseval-2004 59
Roadmaps: Structure of a Strategy
(not the union of what we are all doing)
• Goals
– Example: Replace keyboard with • Small is beautiful
microphone – Quantity is not a good thing
– Exciting (memorable) sound bite – Awareness
– Broad grand challenge that we – 1-slide version
can work toward but never solve • if successful, you get maybe 3
more slides
• Metrics
– Examples:
• Size of container
• WER: word error rate – Goal: 1-3
• Time to perform task – Metrics: 3
– Easy to measure – Milestones: a dozen
• Milestones • Mostly for next year: Q1-4
• Plus some for years 2, 5, 10 & 20
– Should be no question if it has
been accomplished – Accomplishments: a dozen
– Example: reduce WER on task x • Broad applicability & illustrative
by y% by time t – Don’t cover everything
• Accomplishments v. Activities – Highlight stuff that
– Accomplishments are good • Applies to multiple groups
– Activity is not a substitute for • Forward-Looking / Exciting
accomplishments
– Milestones look forward whereas
accomplishments look backward
July 25, 2004 • Serendipity is good! EMNLP-2004 & Senseval-2004 60
Goals:
1. The multilingual companion
2. Life log
Grand Challenges
Goal: Produce NLP apps
that improve the way
people communicate
with one another
Goal: Reduce
barriers to entry €€€
Apps &
Resources Techniques
July 25, 2004
Evaluation
EMNLP-2004 & Senseval-2004 61
Substance: Recommended if…
Summary: What Worked
and What Didn’t? What’s the right
answer?
• Data
– Stay on msg: It is the data, stupid!
• WVLC (Very Large) >> EMNLP (Empirical Methods)
• If you have a lot of data,
– Then you don’t need a lot of methodology
There’ll be a
• Rising Tide of Data Lifts All Boats quiz at the end
• Methodology of the decade…
– Empiricism means different things to different people
1. Machine Learning (Self-organizing Methods)
2. Exploratory Data Analysis (EDA)
3. Corpus-Based Lexicography
Magic: Recommended if…
– Lots of papers on 1
• EMNLP-2004 theme (error analysis) 2
• Senseval grew out of 3
Short term ≠ Long term
Promise: Recommended if…
July 25, 2004 EMNLP-2004 & Senseval-2004 Lonely 62
Backup
Speech Language
• Been great so far,
– But too much of a good thing…
• Take the good
July 25, 2004 EMNLP-2004 & Senseval-2004 64
Fire
• Fuel
– Infrastructure: Shared datasets and lexical resources
• Wordnet, LDC, the Web
– Organizers
• Walker & Zampolli
– Funding
• Darpa (Charles Wayne), EU…
• Sparks
– Exciting Applications (The Web)
– Grand Challenges
– Leaders: Jelinek, Mercer, Miller, Kucera & Francis,
Leech, Sinclair, Tukey, Liberman…
July 25, 2004 EMNLP-2004 & Senseval-2004 65
• Hi Ken,
• Rada probably has more to add, but obviously we would
like to hear something about WSD or word senses. We
are currently trying to move Senseval to include
application-specific evaluations (eg within MT or IR, or in
specialized domains) and to more general semantic
analysis of text (eg frames or subcats). Something to
inspire people in this direction would be great.
• Phil.
July 25, 2004 EMNLP-2004 & Senseval-2004 66
Organizational Innovations
(Radical Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons Innovation
– Short review cycles Freshness
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas) Checks &
– Co-chair (Established, Knows how it has been done)
• Inclusiveness: Balances
– Thankless Chores Marketing Carrots (Maximize # of reviewers)
– Balance program committee, reviewers (and hopefully submissions,
acceptances and registrations):
• 1/3 stability, 1/3 promising, 1/3 outreach
• Diversity: experience, gender, geography, topic
– Hold conferences in Europe, Asia & America
• Huge potential market in Asia: 4 out of 5 jumbo jets
– Maintain 20-25% acceptance rate Parallel Sessions & Posters
• Avoid incremental papers
– Average grades (low grade dominates) Advocate + Second
July 25, 2004 EMNLP-2004 & Senseval-2004 67
Related docs
Get documents about "