Docstoc

Smoothing

Document Sample
Smoothing Powered By Docstoc
					CS224N Section 3:
 Project,Corpora




      Shrey Gupta
   January 28, 2011



        (Thanks to Bill MacCartney, Helen Kwong
         and Pi-Chuan Chang for these materials)
    Agenda
• Go through administrative details regarding the
  final project
• Presentations by research groups
• Resources for final project
    Final Project
• Proposal due in 2 weeks – Wed. 2/9
• Other details
  • Please read the final project guide
  • Projects from previous years:
    http://nlp.stanford.edu/courses/cs224n/
  • Proposal - Intended as a sanity check and to make sure
    that the topic is relevant to the course.
  • 34% of your grade
  • Team size: 1-3 member(s)
  • Reports and code due on 3/9(late days allowed)
  • Project presentations on 3/17
    Project Ideas
• Topics from Syllabus
• Ideas listed in the project guide
• Papers from NLP conferences -
  http://www.cs.rochester.edu/~tetreaul/conferen
  ces.html
• Collaboration with research groups at Stanford
• Something you are really interested in !
Presentations by research
         groups
    Topics
• Relation Extraction in the Knowledge Base
  Population (KBP) context
• BioNLP Event Extraction
• Predicting U.S. Elections with Twitter
• Litigation Analysis - Outcome Prediction, Field
  Classification, Attorney Recommendation, Entity
  Resolution
• Document classification to identify outbreak-
  related web content
Resources
    Corpora
• Corpora@Stanford
  • http://www.stanford.edu/dept/linguistics/corpora/
  • Some are on AFS (/afs/ir/data/linguistic-data/); some
    are available on DVD/CDs in the linguistic department
• LDC (Linguistic Data Consortium)
  • http://www.ldc.upenn.edu/Catalog/
• Links to many resources
  • http://nlp.stanford.edu/links/statnlp.html
    Treebanks
• Most widely used: Penn Treebank
   • There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3
   • Contains:
     • 50,000 sentences (1,000,000 words) of WSJ text from
       1989
     • 30,000 sentences (400,000 words) of Brown corpus
   • Parsed WSJ trees:
     • /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/
• BLLIP: like PTB, WSJ text, but 30m words, parsed
  automatically by Charniak
• Switchboard: telephone conversations
    Parsed corpora in other
    languages
• Penn Arabic Treebank Corpus
  • 734 stories (140,000 words)
• Penn Chinese Treebank Corpus
  • 50,000 sentences
• German (newspaper text):
  • NEGRA
    • http://www.coli.uni-saarland.de/projects/sfb378/negra-
      corpus/
  • TIGER
    • http://www.ims.uni-stuttgart.de/projekte/TIGER/
  • Tueba-D/Z
    • http://www.sfs.uni-tuebingen.de/en/tuebadz.shtml
    Part-of-speech tagged corpora
• POS tags from treebanks
• British National Corpus (BNC)
   • 100m words
   • wide sample of British English: newspapers, books,
     letters
   • http://www.natcorp.ox.ac.uk/
    Named Entity Recognition (NER)
• Message Understanding Conference (MUC)
  • We have MUC-6 and MUC-7
  • Example: /afs/ir/data/linguistic-
    data/MUC_7/muc_7/data/training.ne.eng.keys.980205
• CoNLL shared tasks: Language-Independent
  Named Entity Recognition (I), (II)
  • 2002: http://www.cnts.ua.ac.be/conll2002/ner/
  • 2003: http://www.cnts.ua.ac.be/conll2003/ner/
    Anaphora resolution
• Data: MUC-6 and MUC-7
• Example: Pam went home because she felt sick
  • Demo: http://lingpipe-demos.com:8080/lingpipe-
    demos/coref_en_news_muc6/textInput.html
• Unsolved problem
  • Harder example:
     • We gave the bananas to the monkeys because they were
       hungry
     • We gave the bananas to the monkeys because they were
       ripe.
   Semantics
• WordNet
  • Website: http://wordnet.princeton.edu/
  • Browse online:
    http://wordnetweb.princeton.edu/perl/webwn
  • 150,000 nouns, verbs, adjectives, adverbs
  • Groups words into “synsets” with short, general
    definitions, and records various relations between
    synsets, e.g. hypernym (kind-of) hierarchy.
  • Neat visual interface:
    http://www.visualthesaurus.com/?vt
  • Problems with WordNet:
    • fine-grained senses
    • sense ordering sometimes funny (see "airline")
    Semantic Role Labeling
• Detection of semantic arguments associated
  with each verb in a sentence
• Example: “I [agent] sold you [patient] a book
  [theme]”
• CoNLL shared task 2004, 2005
   • http://www.lsi.upc.es/~srlconll/
• PropBank
   • Adds predicate-argument relations to PTB syntax trees
• FrameNet: http://framenet.icsi.berkeley.edu/
• Demo from UIUC:
  http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php
    More corpora for specific tasks
• Word Sense Disambiguation (WSD)
   • Senseval: http://www.senseval.org/
• Question Answering
   • e.g. "What film introduced Jar Jar Binks?"
   • TREC competition, Question Answering track
     • http://trec.nist.gov/data/qamain.html
• Textual Entailment
   • Recognizing Textual Entailment (RTE) challenges
   http://pascallin.ecs.soton.ac.uk/Challenges/RTE/
• Events, temporal relations
   • TimeBank corpus:
     http://timeml.org/site/timebank/browser_1.2/
    More corpora for specific tasks
• Topic Detection and Tracking
  • Given documents, separate into different topics
  • http://projects.ldc.upenn.edu/TDT/
    Speech & Dialogue
• Speech
  • BNC: 10m words
• Dialogue
  • Switchboard corpus
    • Conversations of two speakers recorded over the phone
    • Transcriptions of their speech, with speakers labeled
    • Example:
      http://www.ldc.upenn.edu/Catalog/readme_files/switchbo
      ard.readme.html#txt
    Email/Spam
• Enron corpus
  • /afs/ir/data/linguistic-data/Enron-Email-
    Corpus/maildir/skilling-j/
  • Annotated subsets(for NER):
    http://www.cs.cmu.edu/~einat/datasets.html
• TREC Spam track
  • http://trec.nist.gov/data/spam.html
    Tools
• Many links to tools on the StatNLP page
   • http://nlp.stanford.edu/links/statnlp.html
• Parsers
   • Stanford Parser (English, Chinese, German and Arabic)
     • http://nlp.stanford.edu/software/lex-parser.shtml
     • Online parser: http://josie.stanford.edu:8080/parser/
   • Collin’s parser, Charniak’s parser, MiniPar, etc.
   • http://nlp.stanford.edu/fsnlp/probparse/
• POS taggers
• Named entity recognizers
• Language modeling toolkits
    Machine learning tools
• Stanford classifier
   • conditional loglinear (aka maximum entropy) model
   • http://nlp.stanford.edu/software/classifier.shtml
• Weka
   • Java library containing (nearly) every machine learning
     algorithm -Naive Bayes, perceptron, decision tree,
     MaxEnt, SVM, etc.
   • http://www.cs.waikato.ac.nz/ml/weka/
• Mallet
   • Java; useful for statistical NLP, document classification,
     clustering, topic modeling, information extraction…
   • http://mallet.cs.umass.edu/
    Thank You !
• Any questions ?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/7/2011
language:English
pages:22