Embed
Email

Learning for NLP

Document Sample

Shared by: xiaoyounan
Categories
Tags
Stats
views:
1
posted:
12/25/2011
language:
pages:
26
Computational Linguistics at

OSU



Chris Brew

Linguistics, Cognitive Science and CSE

The Ohio State University

Who am I?



 Chris Brew, Associate Professor

 Full-time in NLP since about 1984.

 B.Sc Chemistry (Bristol)

 Masters and Ph.D (Sussex)

 NLP done in a Psychology department!

 Research positions at Sussex, Edinburgh and in

industry (Sharp)

 Faculty in Linguistics at Ohio State since 2000

 Joint appointment in CSE

What I’ve done



 Parsing and Dialogue

 Machine Translation (teaching class now)

 XML and corpus annotation

 Learning word meanings from large datasets

 Sound/Meaning relations

 Other stuff…

Linguistics



 Linguistics is the scientific study of language

and communication.

 Linguists run experiments, do surveys, build

simulations, do proofs.

 Linguistics at OSU is:

 In the top 10 nationally

 Diverse and open-minded

Strengths of Linguistics at OSU



 Syntax, Semantics, Pragmatics

 Phonetics: the study of how people make and

perceive the sounds of language

 Psycholinguistics: the study of how people

process sounds, words, sentences, intonation

 Sociolinguistics: the study of how society and

social situations change the way we speak.

 Computational Linguistics and NLP

Computational Linguistics at OSU



 3 faculty members and 20 students based in

Linguistics (Oxley Hall)

 Detmar Meurers (Parsing, Corpus Annotation, Computer-

aided Language Learning)

 Chris Brew (Statistical NLP)

 Michael White (Natural Language Generation)

 Close ties with Drs. Byron and Fosler-Lussier in

CSE.

 We are willing and able to advise or co-advise on

research, and have projects that cross the

departmental boundaries.

Computational Linguistics



 Data Intensive Linguistics: using large

datasets to answer questions about language

 How do children learn language?

 How do technical terms get their meanings?

 Why do people have so little difficulty

understanding what each other are saying?

 How are words stored in the brain?

Computational Linguistics



 Machine understanding: building machines

that read, write, converse using natural

language.

 Several well-known subtasks

 Tokenization:

 Parsing: building syntax trees

 Building meaning representations (MR)

 Generating language from MR

Computational Linguistics



 NLP: building systems that do useful or

interesting things with language

 Summarization

 Machine Translation

 Question Answering

 Document Understanding

Relation to CSE



 Challenging problems in working with large

datasets.

 Document classification is large along three

dimensions

 Large number of available predictive features (104

different words in typical collections)

 Many instances (1000s or millions of sentences)

 Many possible outputs (e.g. classify against the 100s of

labels in the DMOZ hierarchy)

Relation to CSE



 Consumer of CS tools

 Tokenization, Parsing

 Could use lex and yacc (javacc/antlr), but beware

ambiguity

 Many special purpose parsers, taggers, chunkers that

use machine learning to achieve robustness

 Machine understanding

 AI-complete

 Prolog and other PL innovations caused by NL research

Why the world cares



 1700 biology papers per day. Nobody can keep

up UNDERSTAND/SUMMARIZE

 Ad placement in search engines. Perhaps you can

spot a search for flights to Paris, place a successful

sidebar ad for expensive and elegant evening wear.

INTENT

 Automated essay grading CLASSIFICATION

 Too many emails to monitor. Spooks can’t keep up.

 Especially in Arabic

There is demand…



 Develop language-independent algorithms,

techniques, and methodologies to support rapid

development of the basic resources … for any

arbitrary language with a written form. Corpus-based

unsupervised and lightly-supervised methods are

acceptable, as are lightweight elicitation

methodologies from untrained native speakers or

other generally available (in the US) informants.

Research on English and Foreign Language

EXploitation (REFLEX)Broad Agency

Announcement (BAA)BAA 04-01-FH15 March

2004

Current work



 NSF Career project

 Key idea: dimensionality reduction for linguistic

data.

 Hypothesis: neighborhood structure is more

important and cognitively salient than (for

example) preserving detail of long-distance

relationships

 Compare: min-cut, LLE, SNE, LSI

Paul Davis



 Statistical Machine Translation

 Is there a simple and flexible architecture for

Statistical MT?

 Why: current systems are all built on an IBM design.

 they all mess up

 they all mess up in much the same way

 Alternatives are needed.

 Graduated 2002:now at Motorola Research

Martin Jansche



 Learning String-to-String Transductions

(mostly for text-to-speech)

 Bucks -> /b u k z/

 Why: People were doing lots of this, but the

theory, the evaluation criteria and the quality

of the resulting systems left much to be

desired.

 Graduated 2003: now at Columbia Center for

Machine Learning as research faculty

Nathan Vaillette



 Formally verified string-to-string transductions.

 Rule: aa -> b

 Input aaacaa. What is the output?

 bbcb ?

 bacb ?

 abcb ?

 Why: rules like these are used a lot, but no

convincing account of exactly what they mean.

… Nathan Vaillette



 Used technology from hardware verification

(!) to build and implement formal model of

string rewriting process.

 First ever implementation of this widely used

component for which the specification is clear

and the correspondence between

specification and implementation provably

correct.

 Graduated 2003 Now teaching AI at

Hampshire College

Sabine Schulte im Walde



 Inducing German Verb Classes from Corpus

Data.

 Why: build better dictionaries automatically

 Why: difficult large dataset

 Technology: k-means, spectral clustering

 Graduated:2003 from University of Stuttgart

Language Technology Manager with Duden

dictionaries, then research staff University of

Saarbrücken

Kyuchul Yoon



 Grapheme to Phoneme conversion for

Korean

 Why: words of foreign origin need special

treatment, existing machine learning

approaches are too knowledge-free

 Graduated 2005 Now at Pusan University

Anna Feldman



 Using Czech language resources to

bootstrap resources for Russian

 Why: Czech and Russian are supposed to be

related, but can we use this fact technologically?

 Yes. Works, but not perfectly.

 Same thing, for Spanish and Portugese

Anton Rytting



 Computational and experimental studies of

spoken language, emphasis on word

segmentation strategies that might be useful

to infants

 Why: infants should be able to learn any

language.

Medical Informatics (very new)



 Collaboration with John Pestian, Cincinnati

Hospital Children's Medical Center

 Why: doctors provide discharge summaries

(i.e. text), we want information (mundanely:

ICD-9 terms as billing codes)

 How: neural networks, careful encoding of

domain knowledge. Tuning of ICD-9 to

include/exclude terms that do/don't occur in

radiology summaries

What I’d like to do more of



 Very large scale work

 Unsupervised and lightly supervised learning

 Cute applications of machine learning

 Distributed and parallel NLP

What I am looking for?



 People who can take an idea about learning from

data and turn it into a Master’s thesis. Especially

people who have side expertise in an application

area, such as medicine, biology, business, lion-

taming.

 Might have funding for the right person, though

Linguistics Ph.D students take precedence.

What I am looking for?



 People who can take an idea about learning from

data and turn it into a Master’s thesis. Especially

people who have side expertise in an application

area, such as medicine, biology, business, lion-

taming.

 People with very good communication and

programming skills who could collaborate with a

Linguistics student to make something better than

either could alone. Cognitive Science summer

fellowships.

 Interesting new problems that can be learned from

data.



Related docs
Other docs by xiaoyounan
AUSRANK2011W
Views: 0  |  Downloads: 0
G117464796
Views: 0  |  Downloads: 0
absolutist_vs_constitutionalist
Views: 0  |  Downloads: 0
Seminar_10_12_2011
Views: 0  |  Downloads: 0
Excel-Tool Potentialanalyse VDA-6.3-2010_en
Views: 1  |  Downloads: 0
07sanin-ballot-hirei
Views: 0  |  Downloads: 0
DOGs
Views: 0  |  Downloads: 0
smith-waterman_NDSS
Views: 0  |  Downloads: 0
t31c015
Views: 0  |  Downloads: 0
2011-02-13_sermon
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!