Docstoc

The 360 million word BYU Corpus of American English _1990-2007_

Document Sample
The 360 million word BYU Corpus of American English _1990-2007_ Powered By Docstoc
					 The 360 million word
BYU Corpus of American
  English (1990-2007)

       Mark Davies
 Brigham Young University
  www.americancorpus.org
        Why have a new corpus of
          American English?
   Already have the web (e.g. via Google), but:
     – No POS tagging, lemmatization, etc
     – Very difficult to determine genre and date
   American National Corpus
     – Only 22 million words; hasn’t been updated in last
       two years
     – Not very balanced in terms of genres and sources
       (1/2 million words fiction, 2 magazines, 1
       newspaper, 2 academic journals)
   Need something comparable to the British National
    Corpus (BNC)
    BYU Corpus of American English
      (www.americancorpus.org)
   360+ million words
   From nearly 150,000 texts
   20 million words each year from 1990-2007
   Will be updated twice a year; unique linguistic
    history
   Tagged by CLAWS (same tagger as for the BNC)
   Uses the same architecture and interface as BYU-
    BNC, TIME Corpus, Corpus del Español, Corpus
    do Português, etc. (See corpus.byu.edu)
   For each year (and therefore overall, as well),
    evenly divided between Spoken, Fiction, Popular
    Magazines, Newspapers, and Academic Journals
        Composition of the corpus
   Spoken: (76+ million words) Transcripts of unscripted conversation from more than
    150 different TV and radio programs (examples: All Things Considered (NPR),
    Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes
    (CBS), Hannity and Colmes (Fox), Jerry Springer, etc).

   Fiction: (70 million words) Short stories and plays from literary magazines, children’s
    magazines, popular magazines, first chapters of first edition books 1990-present, and
    movie scripts.

   Popular Magazines: (78+ million words) Nearly 100 different magazines, with a good
    mix (overall, and by year) between specific domains (news, health, home and
    gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s
    Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports
    Illustrated, etc.

   Newspapers: (73+ million words) Ten newspapers from across the US, including:
    USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle,
    etc. In most cases, there is a good mix between different sections of the newspaper,
    such as local news, opinion, sports, financial, etc.

   Academic Journals: (73+ million words) Nearly 100 different peer-reviewed journals.
    These were selected to cover the entire range of the Library of Congress classification
    system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world
    history), K (education), T (technology), etc.), both overall and by number of words
    per year
Interface: same as for (our version of the) BNC, and the TIME Corpus (100m words)
Charts: five main genres + time: perfect storm
KWIC: Normal display
KWIC: Expanded display and full source information
Charts: five main genres + time: [end] up [vvg]
Charts: click to see matching strings in any genre or time period: [end] up [vvg]
Charts: sub-genres (bling)
Frequency of each matching string in each genre and time period: [vv*] * ground
Collocates (up to 10 words L / R): [nn*] collocates of chip.[nn*]
Collocates: sort by MI score / view by genre: [vvg] collocates of [feel] like
Sections: frequency by genre and time period
Sections: frequency by genre and time period (compared to second section)
Sections: frequency by genre and time period: Verbs in Magazine: Sports
Sections: frequency by genre and time period: Magazine: Sports vs Magazines
Sections: by time period: *dom in 2000s vs 1990s
Sections: by time period: [vvi] in 2007 vs 1990s (neologisms)
Sections: frequency by genre and time period: we [vv*] that in ACADEMIC
Sections: frequency by genre and time period: collocates of chair (ACAD / FIC)
Word comparisons
Word comparisons: utter vs sheer + [NN*]
Word comparisons: Democrats vs Republicans + [AJ*]
Synonyms (“a thesaurus on steroids”): by genre and time periods: 60,000+ entries
Synonyms: [[=clean]].[v*] the [n*]
Synonyms: [=strong] in ACADEMIC vs MAGAZINES
Customized word lists (semantic, syntactic, etc)
Customized wordlists: [davies:clothes]].[nn*] NEAR [davies:colors]
    BYU Corpus of American English
   Offers an extremely wide range of queries
   Only large corpus of contemporary American
    English
   Only American corpus with texts from a wide
    variety of genres and sources

   Almost four times as big as BNC, more recent,
    and will be updated
   Largest, most diverse, publically-available corpus
    of any language
   Freely available !

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:31
posted:10/15/2011
language:English
pages:31
iwestaaiegjpuiv iwestaaiegjpuiv
About