Computational Linguistics at
OSU
Chris Brew
Linguistics, Cognitive Science and CSE
The Ohio State University
Who am I?
Chris Brew, Associate Professor
Full-time in NLP since about 1984.
B.Sc Chemistry (Bristol)
Masters and Ph.D (Sussex)
NLP done in a Psychology department!
Research positions at Sussex, Edinburgh and in
industry (Sharp)
Faculty in Linguistics at Ohio State since 2000
Joint appointment in CSE
What I’ve done
Parsing and Dialogue
Machine Translation (teaching class now)
XML and corpus annotation
Learning word meanings from large datasets
Sound/Meaning relations
Other stuff…
Linguistics
Linguistics is the scientific study of language
and communication.
Linguists run experiments, do surveys, build
simulations, do proofs.
Linguistics at OSU is:
In the top 10 nationally
Diverse and open-minded
Strengths of Linguistics at OSU
Syntax, Semantics, Pragmatics
Phonetics: the study of how people make and
perceive the sounds of language
Psycholinguistics: the study of how people
process sounds, words, sentences, intonation
Sociolinguistics: the study of how society and
social situations change the way we speak.
Computational Linguistics and NLP
Computational Linguistics at OSU
3 faculty members and 20 students based in
Linguistics (Oxley Hall)
Detmar Meurers (Parsing, Corpus Annotation, Computer-
aided Language Learning)
Chris Brew (Statistical NLP)
Michael White (Natural Language Generation)
Close ties with Drs. Byron and Fosler-Lussier in
CSE.
We are willing and able to advise or co-advise on
research, and have projects that cross the
departmental boundaries.
Computational Linguistics
Data Intensive Linguistics: using large
datasets to answer questions about language
How do children learn language?
How do technical terms get their meanings?
Why do people have so little difficulty
understanding what each other are saying?
How are words stored in the brain?
Computational Linguistics
Machine understanding: building machines
that read, write, converse using natural
language.
Several well-known subtasks
Tokenization:
Parsing: building syntax trees
Building meaning representations (MR)
Generating language from MR
Computational Linguistics
NLP: building systems that do useful or
interesting things with language
Summarization
Machine Translation
Question Answering
Document Understanding
Relation to CSE
Challenging problems in working with large
datasets.
Document classification is large along three
dimensions
Large number of available predictive features (104
different words in typical collections)
Many instances (1000s or millions of sentences)
Many possible outputs (e.g. classify against the 100s of
labels in the DMOZ hierarchy)
Relation to CSE
Consumer of CS tools
Tokenization, Parsing
Could use lex and yacc (javacc/antlr), but beware
ambiguity
Many special purpose parsers, taggers, chunkers that
use machine learning to achieve robustness
Machine understanding
AI-complete
Prolog and other PL innovations caused by NL research
Why the world cares
1700 biology papers per day. Nobody can keep
up UNDERSTAND/SUMMARIZE
Ad placement in search engines. Perhaps you can
spot a search for flights to Paris, place a successful
sidebar ad for expensive and elegant evening wear.
INTENT
Automated essay grading CLASSIFICATION
Too many emails to monitor. Spooks can’t keep up.
Especially in Arabic
There is demand…
Develop language-independent algorithms,
techniques, and methodologies to support rapid
development of the basic resources … for any
arbitrary language with a written form. Corpus-based
unsupervised and lightly-supervised methods are
acceptable, as are lightweight elicitation
methodologies from untrained native speakers or
other generally available (in the US) informants.
Research on English and Foreign Language
EXploitation (REFLEX)Broad Agency
Announcement (BAA)BAA 04-01-FH15 March
2004
Current work
NSF Career project
Key idea: dimensionality reduction for linguistic
data.
Hypothesis: neighborhood structure is more
important and cognitively salient than (for
example) preserving detail of long-distance
relationships
Compare: min-cut, LLE, SNE, LSI
Paul Davis
Statistical Machine Translation
Is there a simple and flexible architecture for
Statistical MT?
Why: current systems are all built on an IBM design.
they all mess up
they all mess up in much the same way
Alternatives are needed.
Graduated 2002:now at Motorola Research
Martin Jansche
Learning String-to-String Transductions
(mostly for text-to-speech)
Bucks -> /b u k z/
Why: People were doing lots of this, but the
theory, the evaluation criteria and the quality
of the resulting systems left much to be
desired.
Graduated 2003: now at Columbia Center for
Machine Learning as research faculty
Nathan Vaillette
Formally verified string-to-string transductions.
Rule: aa -> b
Input aaacaa. What is the output?
bbcb ?
bacb ?
abcb ?
Why: rules like these are used a lot, but no
convincing account of exactly what they mean.
… Nathan Vaillette
Used technology from hardware verification
(!) to build and implement formal model of
string rewriting process.
First ever implementation of this widely used
component for which the specification is clear
and the correspondence between
specification and implementation provably
correct.
Graduated 2003 Now teaching AI at
Hampshire College
Sabine Schulte im Walde
Inducing German Verb Classes from Corpus
Data.
Why: build better dictionaries automatically
Why: difficult large dataset
Technology: k-means, spectral clustering
Graduated:2003 from University of Stuttgart
Language Technology Manager with Duden
dictionaries, then research staff University of
Saarbrücken
Kyuchul Yoon
Grapheme to Phoneme conversion for
Korean
Why: words of foreign origin need special
treatment, existing machine learning
approaches are too knowledge-free
Graduated 2005 Now at Pusan University
Anna Feldman
Using Czech language resources to
bootstrap resources for Russian
Why: Czech and Russian are supposed to be
related, but can we use this fact technologically?
Yes. Works, but not perfectly.
Same thing, for Spanish and Portugese
Anton Rytting
Computational and experimental studies of
spoken language, emphasis on word
segmentation strategies that might be useful
to infants
Why: infants should be able to learn any
language.
Medical Informatics (very new)
Collaboration with John Pestian, Cincinnati
Hospital Children's Medical Center
Why: doctors provide discharge summaries
(i.e. text), we want information (mundanely:
ICD-9 terms as billing codes)
How: neural networks, careful encoding of
domain knowledge. Tuning of ICD-9 to
include/exclude terms that do/don't occur in
radiology summaries
What I’d like to do more of
Very large scale work
Unsupervised and lightly supervised learning
Cute applications of machine learning
Distributed and parallel NLP
What I am looking for?
People who can take an idea about learning from
data and turn it into a Master’s thesis. Especially
people who have side expertise in an application
area, such as medicine, biology, business, lion-
taming.
Might have funding for the right person, though
Linguistics Ph.D students take precedence.
What I am looking for?
People who can take an idea about learning from
data and turn it into a Master’s thesis. Especially
people who have side expertise in an application
area, such as medicine, biology, business, lion-
taming.
People with very good communication and
programming skills who could collaborate with a
Linguistics student to make something better than
either could alone. Cognitive Science summer
fellowships.
Interesting new problems that can be learned from
data.