Corpus linguistics an introduction by dSNAuo

VIEWS: 7 PAGES: 34

									Corpus linguistics
an introduction

ENG 447
Key points
  Basic notions
   historical development: two competing
  approaches
   Types of corpus
   Exploiting a corpus
   Resources
Basic notions
  Corpus: A collection of naturally occurring
 language text, chosen to characterise a state
 or variety of language (Sinclair)

  A collection of linguistic data, either written
 text or a transcription of recorded data,
 which can be used as starting-point of
 linguistic description or as a means of
 verifying hypotheses about a language
 (Dictionary of linguistics and phonetics)
What is a corpus?
  Large body of evidence typically composed
 of attested language use (McEnery)

  Usually a corpus is in machine-readable
 format and is ideally viewable and
 analysable through (a single) software
 package

  The word corpus comes from Latin body and
 the plural is corpora
“If it happens once, you don't know
anything. If it happens twice, it
suggests further investigation. If it
happens three or more times, then you
have something to write about!”
History
   We have to split the history in two periods:
  before Chomsky and after Chomsky

   Before Chomsky, methods similar to the
  ones in corpus linguistics were used
  (empiricism)

  http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus1/1fra1.htm
Early corpus linguistics
   Before Chomsky
   Computers were not available so it was
  difficult to analyse large collections of text
   Studies of child language using diaries kept
  by parents
   Spelling conventions in a German corpus of
  11 million words
   Foreign language pedagogy
Early corpus linguistics (II)
   All the work of early corpus linguistics was
  underpinned by two fundamental, yet flawed
  assumptions:
      The sentences of a natural language are finite.
      The sentences of a natural language can be
       collected and enumerated.
   Most linguists saw the corpus as the only
  source of linguistic evidence in the formation
  of linguistic theories
Chomsky
    Between 1957 and 1965 Chomsky changed
   the direction of linguistics from empiricism
   towards rationalism

“Any natural corpus will be skewed. Some sentences won’t occur
   because they are obvious, other because they are false, still
   others because they are impolite. The corpus, if natural, will be
   so wildly skewed that the description would be no more than a
   mere list” (Chomsky, 1962)


    Introspection started to be used instead
Problems with introspection
   Naturally occurring data is observable
  and verifiable by everyone.
   Introspective data is artificial.
   Human beings have only the vaguest
  notion of the frequency of a construct
  or a word.
The revival of corpus linguistics

   The research in corpus linguistics was
  continued in small centres
   The hardware still imposed some
  restrictions, the real development will
  start in the 80s
   Fields like computational linguistics
  were not interested to use corpora
Fillmore’s description of the two
approaches
 The corpus linguist : " He has all the
 primary facts that he needs, in the form
 of a corpus of approximately one zillion
 running words, and he sees his job as
 that of deriving secondary facts from
 his primary facts. At the moment, he is
 busy determining the relative
 frequencies of the eleven parts of
 speech as the first word of a sentence
 versus the second word of a sentence.”
The "armchair " (introspective) linguist :
"He sits in a deep soft armchair, with
his eyes closed and his hands clasped
behind his head. Once in a while he
opens his eyes, sits up abruptly
shouting, ‘Wow, what a neat fact!’,
grabs his pencil, and writes something
down… having come still no closer to
knowing what language is really like."
Goals of corpus linguistics
Chomskyan linguistics      Corpus
   ‘Langue’               linguistics
    (competence)              ‘Parole’ (performance)
   Ideal speaker/hearer      Complexity/variation
   Language = innate         Language = social
    mental faculty             phenomenon
   Intuitive evidence        Empirical evidence
   Universals                Differences
   Grammar                   Meaning
Types of corpora

 Written vs Spoken
 General vs Specialised
     e.g. ESP, Learner corpora
 Monolingual vs Multilingual
     e.g. Parallel, Comparable
 Synchronic vs Diachronic; Monitor
 Annotated vs Unannotated
Written corpora

                                 Brown                            LOB

 Time of compilation   1960s                         1970s

 Compiled at           Brown University (US)         Lancaster, Oslo, Bergen

 Language variety      Written American English      Written British English

 Size                  1 million words (500 texts of 2000 words each)

 Design                Balanced corpora; 15 genres of text, incl. press reportage,
                       editorials, reviews, religion, government documents,
                       reports, biographies, scientific writing, fiction
       Specialised corpora

                      CSPAE                               CHILDES

Time of compilation   1990s                               Since 1980s

Compiled at / by      Michael Barlow                      Project started at Carnegie Mellon
                      (Rice Univ)                         Univ; contributors worldwide
Language variety      Spoken professional American        20 languages, incl.: E.Asian,
                      English                             Germanic, Romance, Slavic…;
                                                          mainly conversational data;
Size                  2 million words (tagged)            c. 20 million words (growing)

Design                Transcripts from professional       “Child language data exchange
                      settings (meetings, conferences…)   system”, offering transcripts of
                      by 400 speakers;                    monolingual and bilingual children’s
                      academia (1 M) politics (1 M wds)   language (language acquisition data)
      Other examples of available corpora
                       COMPILED AT           LANGUAGE                       SIZE                       DESIGN

First generation major corpora
Brown Corpus           Brown Univ, US        Written American English       1 million                  15 genres of text: press reportage, religion, fiction…
(1960s)                                                                     (tagged)

Second generation mega corpora
Bank of English        COBUILD,              Written / spoken English       450 million – year 2002    Monitor corpus; mostly written: newspapers, books;
(since 1991)           Birmingham Univ                                      (tagged)                   spoken: conversations, broadcasts, interviews...
International Corpus   UCL, London           Written / spoken British       1 million                  One of 15 projects worldwide preparing different
of English                                   Engl.                          (grammatically parsed)     national / regional varieties of English; 200 written,
[ICE-GB] (1990s)                                                                                       300 spoken texts, various genres

Specialised corpora
Corpus of Spoken       Rice Univ, US         Spoken American English        2 million                  Transcripts from professional settings (meetings,
Professional                                                                (tagged)                   press conferences) by approximately 400 speakers,
American Engl.                                                                                         centred on activities tied to academics and politics
[CSPAE] (1990s)

Learner corpora
International Corpus   Louvain Centre for    Engl. writing by learners of   Over 2 million             Essay writing by advanced learners of English as a
of Learner English     English Corpus        from 19 mother tongue                                     foreign language
[ICLE] (Since          Linguistics, Belg.    backgrounds, incl. Chi.
1990s)

Non-English monolingual corpora
HK Cantonese           Dept Speech &         HK Cantonese                   170,000 characters         Spontaneous speech recorded from phone-in radio
Adult Corpus           Hearing Sci’s, HKU                                                              programs and forums, by 69 speakers
[HKCAC] (2000)

Multilingual / Parallel corpora
International          CRATER project        French, English and            1 million tokens in each   Trilingual parallel corpus from telecommunications
Telecommunications     (Corpus Resources     Spanish                        language                   domain; aligned at sentence level
Corpus [ITU /          & Terminology                                        (tagged)
CRATER] (1995)         Extraction) Lanc U.
Ways to exploit a corpus
  Word (token) / types frequency lists
  N-grams
  Concordances
  Collocations/collegations
  Specially designed programs (especially
  when the corpus is annotated)
Frequency lists
   are lists which indicates the words
  which appear in a corpus and their
  frequency
   they provide a survey of the corpus
   a frequency list becomes more
  meaningful when compared with other
  lists
   they remove a word from its contexts
N-grams
  groups of N words which appear in
 sequence in the text
  they are presented using frequency
 lists
  good way to identify recurring/specific
 expressions for a corpus
  provide limited context for the words
Concordances
   show words in the context they appear
   usually they are obtained using special
  programs which allow to manipulate the
  lists of concordances
   KWIC (Key Word In Context) is the
  most common format
Example of concordance output (from MonoConc)
  Langue - Parole
Langue
paradigmatic
     famous boots. On   the   stroke   of   full time the
    Stoke the lead on   the   stroke   of   half-time with a goal
  Smith sin-binned on   the   stroke   of   half-time, added a
clinched their win on   the   stroke   of   lunch after resuming
chase by declaring on   the   stroke   of   lunch. <p> With a lead
  expectant crowd, on   the   stroke   of   midday. The bird
  hour began not upon   the   stroke   of   midnight but upon the
 of midnight but upon   the   stroke   of   noon. There was,
booked in advance. On   the   stroke   of   seven, a gong summons
          Promptly on   the   stroke   of   six 'clock, the chooks
    from Edinburgh on   the   stroke   of   the Millennium.


                        Parole


                        syntbagmatic
Collocations
   collocation = the occurrence of two or more
  words within a short space of each other in
  text
   the collocates are extracted using a window
  to the left and right of a specified word
   can be used to further analyse the context of
  a word

http://www.sketchengine.co.uk/sampler/
What can we do with a corpus?
   --Two broad approaches

   Corpus-based approaches: hypotheses
  are checked against a corpus
   Corpus-driven approaches: hypotheses
  are drawn from the corpus
Testing new hypothesis



Posit a hypothesis


                         Concordance
-le is a
separate
morpheme for
                     Find all
the concept of                         Test hypothesis
                     occurrences of
future.
                     “le” in the
                     wordforms of
                     the corpus.
                                              Distribution of subject personal pronouns
                                                           across registers


                              140000



                              120000



                              100000
                                                                                                 they
Frequency per million words




                              80000
                                                                                                 we
                                                                                                 it
                                                                                                 he/she
                              60000
                                                                                                 you
                                                                                                 I
                              40000



                              20000



                                  0
                                       Conv   Fict                             News       Acad
                                                            Register
                                     Comparison of individual modal and semi-modal verbs in conversation and academic prose (based on
                                                                         LGSWE, Table 6.6, p. 489)


                              6000
Frequency per million words




                              5000

                              4000

                              3000

                              2000

                              1000

                                0
                                                              MI


                                                                      MU


                                                                               SH


                                                                               HA




                                                                               BE


                                                                                                                W
                                     CA


                                             CO


                                                      MA




                                                                               GO




                                                                                                                         W


                                                                                                                         SH


                                                                                                                                        GO
                                                                                                                 ILL


                                                                                                                          OU
                                                                GH




                                                                                 TT
                                                                                  OU


                                                                                  VE
                                        N




                                                                                                                           AL
                                                                        ST
                                                        Y
                                                UL




                                                                                  T




                                                                                                                                          IN
                                                                                    ER
                                                                                    TO




                                                                                                                             LD
                                                                  T




                                                                                                                                            G
                                                                                                                             L
                                                  D




                                                                                     LD


                                                                                     TO




                                                                                                                                            TO
                                                                 Conversation Academic prose
Fields where corpora are used
   Lexicography to design dictionaries
   Language studies (relations between
  languages, differences between genre,
  evolution of the language)
   Computational linguistics (training and
  testing methods)
   Language teaching (learner’s corpora)
   Cultural studies, psycholinguistics
Web as a corpus
   The Web can be very useful source of
  texts
   The Web is very helpful for languages
  other than English
   Quite often there is not control on the
  language which is investigated
  therefore filtering (if possible) is
  necessary
Existing corpora
  Brown Corpus/LOB corpus
  Bank of English
  Wall Street Journal, Penn Tree Bank, BNC, ANC, ICE,
  WBE, Reuters Corpus
  Canadian Hansard: parallel corpus English-French
  York-Helsinki Parsed corpus of Old Poetry
  Tiger corpus – German
  CORII/CODIS - contemporary written Italian
  MULTEX 1984 and The Republic in many languages
References
  Karin Aijmer and Bengt Altenberg (1991)
  English corpus linguistics, Longman
  Duglas Biber, Susan Cnrad and Randi Reppen
  (1998) Corpus linguistics, Cambridge
  University Press
  Graeme D. Kennedy (1998) An introduction to
  corpus linguistics, Longman
  Tony McEnery and Andrew Wilson (1996)
  Corpus linguistics, Edinburgh University Press
Online resources
语料库语言学在线
   http://www.corpus4u.com/
语料库语言学与英语教育教学
   http://sfs.scnu.edu.cn/corpus4u/


ConCapp
   http://panda.nhce.edu.cn/corpus4u/tools/concord/wconco
    rd.rar
    BNC:http://sara.natcorp.ox.ac.uk/lookup.html
CLEC: http://www.clal.org.cn/corpus/ChiSearchEngine.aspx
CHILDES: http://childes.psy.cmu.edu/

								
To top