TUJ Tokyo by pengxiuhui

VIEWS: 25 PAGES: 107

									Learner corpus research -
hands on
                              Tom Cobb
       Didactique des langues / éducation
         Université du Québec à Montréal

   Saturday, October 31
   8:15am - 10:15am
   Dr. Cobb will provide a "crash course" in
    carrying out research using learner corpora
    and small teacher or researcher built corpora
    generally. He will lead a walk-through of a
    study he has conducted using corpus data
    and address the work that had to be done and
    issues to be resolved at each stage of the
    study, offering a behind-the-scenes look at
    how corpus research is carried out. In addition
    he will display some new and accessible
    online tools for corpus work, hoping to
    encourage instructors or researchers from
    other areas to get some hands-on experience
    in the learner corpus paradigm.
   Dr. Cobb will provide a [1] "crash course" in
    carrying out [1a] research using learner corpora
    and [1b] small teacher or researcher built
    corpora generally. He will lead a [2] walk-
    through of a study he has conducted using
    corpus data and [2a] address the work that had
    to be done and [2b] issues to be resolved at
    each stage of the study, offering a behind-the-
    scenes look at how corpus research is carried
    out. In addition he will display some [3] new and
    accessible online tools for corpus work, hoping
    to [4] encourage instructors or researchers
    from other areas to get some hands-on
    experience in the learner corpus paradigm.
 crash course

       research using learner corpora
       or other small corpora
   walk-through of a study
       address the work that had to be done
       issues to be resolved at each stage
   display online tools for corpus work
   encourage hands-on experience
   + a bit of context
At 10.15 you will know…
        What a corpus is
        Why corpus research is important
        What it has contributed to applied linguistics
        The uses it can have for researchers
                               … for instructors
        How to build a corpus
        Choice points in building a corpus
                    … interpreting a instructors
        Some tools of corpus analysis
        How to do a learner corpus study
        Results from some published studies
        The future of learner corpus studies
Corpora – what
are they?

What is a corpus?
        A large collection of language in use,
                 Not only large
                 Not necessarily so large

        Assembled systematically, according to
         explicit criteria
                 of representativeness

        How large?
            Depends on the goal

Goals and sizes
        Linguistics goal - to represent
         entire language
              • 100 million wds still under-represents
                common collocations
        Pedagogical goal – S`s meet
         common words, structures
              • 1-million-words gives 10 hits for
                frequent words
        Applied linguistics goal – trace
         an acquisition feature
              • 1-200,000 words is common           8
Sub-Goals and sizes
         Pedagogical goal – S`s meet
          common grammar and vocab
               Grammar – 1 million is adequate
                   – All structures get many hits
               Lexis
                • Basic vocab
                   – 1 million gives 10 hits @ 2k level
                • Main collocations
                   – 1 million gives the main ones
                          Torrential rain?
                • “Raining cats and dogs”?
                    – 1 billion gives 5 hits
                • Identify specialist lexis
                    – 200,000 may be enough
A growth industry
   Brown 1970………………..1,000,000 wds

   BNC 1994 .……………… 100,000,000 wds

   Cambridge Int’l 2002....1,000,000,000 wds

   Plus ANC, Bank of English, Cancode …

Design / composition e.g., Brown (1970s)

Page from Lextutor

What does a corpus represent?
     A language as a whole
           • BNC
     Or a part
           • Cancode oral, MICASE academic
     Or of an individual
           • Jack London’s collected works
     Or a group of individuals
             –Class of ESL learners
How do we read a corpus?
        Cannot read it naturally
                –Defeats the goal
        Needs the help of a search
            concordance
            index

            frequency list

            many others



Corpora – why
do we need them?

Why do we need corpora?

     A.   Corpus work is sexy

     B.   We have computers –
          let’s use them

     C.   Linguistic intuitions are
          unreliable                  19
Linguistic intuitions are
notoriously unreliable

         Demo 1: Do you think however
          is more common in spoken or
          in written language?

           By   how much? (3 to 1… etc)

http://www.lextutor.ca/range/range_corpus/   21
   Demo 2: What are the main senses
    of back and which is most

         • By what factor?

   http://www.lextutor.ca/concordance
    rs/concord_e.html                    22
   Demo 3: Can you rank order these
    roughly by frequency

    0 - 2k
    3k - 5k
    6k - 10k

http://www.lextutor.ca/freq/train/     25
Try one? http://www.lextutor.ca/freq/train/   26
But not always

        Demo 4: Which do you think is
         more common, man and woman,
            or woman and man?

          Factor of 10:1, 5:1, 2:1?

        Go Live
        http://www.lextutor.ca/concordancers/concord_e.html

Many linguistic intuitions
are unreliable
Implicit patterns are
extremely slow to extract
from input
            N. Ellis, J. Hulstijn

… because of the severe
limitations on what we can
see and remember
            … unaided
Scientific instrumentation
- a brief history

     Not only linguistic
     intuitions are problematic

For every
many possible

Stand outside on a
starry evening,
what does it look
   The role of the computer in modern science is well
    known. In disciplines like physics and biology, the
    computer's ability to store and process inhumanly
    large amounts of information has disclosed patterns
    and regularities in nature beyond the limits of normal
    human experience. Similarly in language study,
    computer analysis of large texts reveals facts about
    language that are not limited to what people can
    experience, remember, or intuit. In the natural
    sciences, however, the computer merely continues
    the extension of the human sensorium that began
    200 years ago with the telescope and microscope.
    But language study did not have its telescope or
    microscope. The computer is its first analytical
    tool, making feasible for the first time a truly
    empirical science of language.                     31
             – Cobb 1999
   Before the computer, linguists could only
    study small samples of language at a time
    because of their limitations of their powers of
    observation and their memories. Even
    scholars who relentlessly collected instances
    of usage all their lives only had a few
    examples of any particular pattern, and there
    was no way of telling what they had missed.

        Sinclair,   2003, p. ix

Early corpora
   Dr Johnson
   A Dictionary of the English Language
       Longman 1755
   Based on quotations from literature
    copied onto many slips of paper

But using literature has some problems
        - Old and recent lit conflated
        - Is literature truly representative of

            life’s typical situations?
        - Is its lexis «un peu recherché»?
120 years later
- James Murray, OED 1879 – REAL LANGUAGE examples sent in by post
   - Oxford City Post Office sets up a special sub-branch for OED
Most sciences -
supplemented by
technologies from 15th
         BIOLOGY..……….microscope
         ASTRONOMY..…..telescope
         NAVIGATION.……astrolabe
         etc
Language study – late 20th
century –
….machine readable corpora      35
Thus the “corpus

           Dictionaries
           Grammars
           Courses
           Studies

Of particular

Corpus – successes

Fabled Core of English
    is close to disclosure
      Main lexis + coverage
          2000 wd families = 80%, Carrol et al 76
      Main collocations in BNC-speech
          84 HF collocations belong in 1k list, Shin & Nation 2007
      Main phrasal verbs –
          25 Ph vbs = 1/3 of all ph vbs in BNC, Gardner & Davies, 2007
      Main morphologies
        Bauer & Nation, 1993

      Main stress patterns (Murphy & Kandil)

          Cf. All this coming together at the same time as
           the human genome, also a corpus project
Ancient prescriptivism
   is close to defeated in
   language pedagogy
      Except one debate remains
          Corpus-based v. corpus-informed

      Corpus based
          If it`s in the corpus times X, it`s OK
            X   to be defined
      Corpus informed
          Corpus information is one source of
Numerous errors are now
corrected (in principle)
      Definitions no longer harder than the
       defined word
      Simple present no longer automatically
       the first verb tense taught
      Written language no longer the model for
       spoken language
      Status of multi-word units reinstated
      Grammar no longer taught …
          via unknown lexis
          as unconnected to lexis                41
      Grammar as connected to lexis?
      Let’s see what this could mean
          + practice “reading concordances”

      Get out “borders on”
               • (From SInclair http://www.twc.it/)

        What is the pattern?
        What does it mean?

             Can   we call this ``word
   User extract

 became is more than just a way of life – it BORDERS on a religion. But there is of the laws
ut there is of the laws of the sea sometimes BORDERS on arrogance. Not only should the interna
the international collaboration is great and BORDERS on cartel like behaviour. who say using t
behaviour. who say using the extremist label BORDERS on demagoguery and will only serve Yugosl
ly serve Yugoslavia. What is occurring there BORDERS on genocide. No country or society Carele
ciety Careless but losing two in the one day BORDERS on incompetence. Now Charlie Turkey, the
 Charlie Turkey, the only NATO country which BORDERS on Iraq, is playing a key role in Her mas
a key role in Her mastery of the short story BORDERS on perfection. kate saunders country’s st
aunders country’s stagnant growth, which now BORDERS on recession. Here again, the challenge l
ain, the challenge looms ugly when recession BORDERS on slump. Everybody is on edge, The autho
the case_0 of maxim ‘The collector’s passion BORDERS on the chaos of memories.’ before staged
, although and an easy going demeanour which BORDERS on the charismatic, it’s hardly popular m
 Kosovo, a professional solicitousness which BORDERS on the dangerous edge of savings accounts
l Asian clash. He said: ‘The hostility there BORDERS on the dangerous.’ Black players and – a
he sky, a then Claire makes a statement that BORDERS on the downright cocky. When I ask The li
mories.’ before staged protests at these two BORDERS on the east and west of their speaking to
t there is the Sierra Madre” as he dubs them BORDERS on the eccentric. Mountain lions courses
ain lions courses and opportunities, that it BORDERS on the embarrassing. This the straight, b
e. He portrays has a streak of bravery which BORDERS on the foolish. She has delicate to buy.
ause the amount of work he is required to do BORDERS on the incredible. In the case_0 of maxi
rous edge of savings accounts versus shares, BORDERS on the irresponsible. an independent Bosn
is private His love for all things maritime BORDERS on the obsessional. He is truly Not surpri
, four even_0 harbour a passion for DIY that BORDERS on the obsessive. But there is the Sierra
body is on edge, The author, a lifelong fan, BORDERS on the obsessive. He portrays has a strea
en I ask The linear intensity of their songs BORDERS on the paranoid and, although and an easy
 Wander into the The atmosphere of paranoia BORDERS on the pathological. The sky, a then Clair
g. This the straight, but his winning effort BORDERS on the sensational because the amount of
 his own most dangerous regions on Earth. It BORDERS on the Serbian province of Kosovo, a prof
elicate to buy. A family with three children BORDERS on the socially acceptable, four even_0
f their speaking to troops in Xinjian which BORDERS on the Soviet Central Asian clash. He said
players and – and to performing them sort of BORDERS on the surreal. He had his own most dange
He is truly Not surprisingly, the atmosphere BORDERS on the surreal. Wander into the The43
 hardly popular music. In some cases_1, this BORDERS on wholesale plagiarism. That’s * ______
Corpus – failures

And yet…
    “The corpus-driven revolution in applied
      linguistics continues apace, and along
      with it the paradox that as corpora
      change the face of applied linguistics
      (most dictionaries, grammars, and
      course books now claim to be corpus
      based) it is largely without the
      participation of practitioners. Only a few
      teachers or researchers have ever built
      a corpus or delved through
      concordance lines.”
                - Cobb 2008, review of CBLS
Stalled enterprise (-McCarthy, 2008)
       Teachers and researchers need to
         become producers, not just consumers,
         of corpus research
         To evaluate “corpus based” claims
               Often vocab but not grammar is CB, etc
               What kind of corpus?

            To effectively lobby to get their CB
             needs met
               e.g. Gram+lex of specific domains

            To develop their own CB materials
                  Who still uses a course book?

            To build their own corpora for action
             research projects                          46
Stumbling blocks
     Some intimidation remains attached to corpus
     It is not universally appreciated in SLA
        - Widdowson
     Computer stuff looks daunting
       - Seems more linguistics than applied

     There are some fairly clear reasons to do this
       and simple ways to get started

…   The classic corpora are not easy-access

    -   Despite long lists on the Web
        -   Even McCarthy’s Cancode is 100%
            unavailable to researchers
             -   Ref Tribble review of O’keefe et al

    -   Especially in languages other than
        -   Lextutor users’ requests for German =>

        <= [1] Band together (CECL)
        -  [2] Make your own =>                        48
DIY corpus – why?


Why bother – Google is a
     Ref – Robb

          v. corpus
Classic case, breadth v.
     Web-as-corpus gives massive volume

     Even smallish DIY corpus gives
        Better quality search
           Families, starts with, ends with
        Easier access to detail & context
        Better exposure to pattern

        + you can make your own, target your own needs
           Material for learners                         53
           Material from learners
DIY corpus – how?

Build your own - HOW
        Many texts on the Web
            E.g., http://www.lextutor.ca/bookbox/
            Question of selection replaces quesiotn
             of access

        Must be or become text files
          (whatever.txt)               «dot txt
            Whether you want a one-big-file corpus
                 Or several-small-files corpus

Only plain .TXT files make corpora

One big file: a) Insert


One big file: b) Upload


DIY corpus for
learning materials

Using CB tools to select /
develop learning materials?
     Using news texts?
       Check first against CB frequency lists

     Pre-teaching vocab?
       Find the CB keywords

     Writing tests?
       Check it contains gram+lex the S’s
        have actually seen

     Teaching a speaking course?
       Check models are speech not writing
Build corpus as learning
     For some purpose

     Must make some sampling sense

       EG one London – all London

       All course materials

       Corpus of graded readers

Learning materials
– multi-file corpus

Learning materials – one-file corpus

Learning materials – one-file corpus

DIY for research

1. Written

Learner text more and more available
  - Collect & investigate because it is there?

Some typical purposes
  - determine needs
  - check progress
  - Cf. active vs. passive ability
  - explore for experimental hypothesis

  Choose topic carefully
      Does topic suggest just one verb tense?
      Cf capital punishment vs. my holiday
         Very different language demands
Models of LCs
  Learners vs. NSs
  Ls vs. Ls –
     Snapshot or Longitudinal (same Ls at diff times)
         Or diff Ls at diff stages in learning ≅ longitudinal
  Belz (04, citing Cobb 03) 4 LC variables should be
  1.  type of learner (e.g., FL vs. SL),
  2.  stage of learner
  3.  text type/purpose/register/conditions,
  4.  and the availability of a similar corpus of native
      speaker data

NS data must be
     Best example is UCLE’s Locness

     Louvain Corpus of Native Speaker Essays
       149,574 words of argumentative essays
        written by American university students
       18,826 words of literary-mixed essays written
        by American university students
       59,568 words of argumentative and literary
        essays written by British university students
       60,209 words of British A-level
        argumentative essays.
Issues in LC
       Tag or not?
       Spell check or not, or at what point?
       One file or many?

     BIG ISSUE - Granger 2004, p. 124
     What kind of data is a LC?
       “LC typically fall into the category of natural or
       open-ended data” while “SLA researchers
       tend to prefer [1] introspective or [2]
       experimental/elicited data…”

     V BIG ISSUE -
       Is this paradigm an instance of Bley-Vroman’s 70
       (1983) “comparative fallacy”?
Once made, flat or tagged?
     Pro’s of flat corpus
        If for learning materials, = what learners face
                • THEY must make sense of data
                • Tagged does it for them
       Easier to make, you can have more
       Search inputs require some work, Trial +error

     Pro’s of tagged corpus
        Precise comparisons are possible
              Especially for N-N compounds and errors

         But learner data poses special problems
            Tags are needed for error analysis
                • VP + ADV + D OBJ, etc
              Yet learner data confuses taggers           71
Error tagger (UCL Err Extractor – Granger 02)
specific-purpose, known-target tagging
- Unlikely to confuse tagger, but a ton of work

Here’s a set of studies I’m
working on
     LC study typically begins with a practical
           Theoretical conundrums? not so much

     E.g., this problem:
        Montreal learners
        Eight years ESL
        At 18 many switch to English-language
        With insufficient vocabulary for advanced
          study in English
           Fully competent only at 1k               73
Biq question
Input: What lexis are these kids getting in

Do their NNS teachers have enough vocab
 themselves to get kids over the 1k-hump?

    Run Vocab size test on Ts
      Nations’s new 14k – lextutor.ca/tests/

    Get small exploration corpus of their production
      “How could the TESL program be improved?”
          Argumentative + opinion

    Get similar sized NS corpus
      LOCNESS, A-Levels, UK
         “An invention that has changed how we live”

    Compare for structure and lexis
      Quantity (frequency) and quality
       Focus on lexis 2k+                              75
  Look at TESLProg.txt in your handout
     as demo mini-corpus

Writing task was this
   5-minute in-class writing exercise
           Peter Elbow, keep writing idea
   Discursive topic
           How could UQAM new TESL program be improved?
   Homework:
           - identify your main point
           - focus + elaborate for Web publication
   Each paper gets three rounds of feedback
    Comparison text from Locness (ex 1)
Computers have become a huge part of our lives in both the areas of work and
education. But are they such a good thing?

When calculators came along a drop in ability of students for mental arithmetic was
obvious and now they are used for the simplest calculations. The computer could do
 the same thing. Computers encourage laziness in the general public, why work out
something yourself when the computer can do it for you. This is very time saving and
efficient but it is causing people to forget basic ideas. For instance, spelling is no
longer as important as it was you can simply use a "spellcheck" to correct your English,
which is absurd.

For the youth of today computers offer links around the world and millions of facts and
figures. This could be argued to be educational. However, this is killing the imagination
of children and they spend hours sat at a keyboard tapping away in the doom and gloom
of the house. They should be out enjoying themselves and gaining experiences for
themselves instead of reading about them on a flat screen.

It is said that you can meet people through computers and have `relationships'. I find this
preposterous and people are losing the ability to communicate and form relationships.  78
 Comparison corpus from Locness (2)

Computers may be the future but what part will man have in this future. There
will be no need for people to go to school as they could be taught at home,
people would hardly ever talk and the only career available would be for
computer programmers.

I agree that computers are helpful but people should not live through their
computers and be so reliant on them. They should read books and live more in
order to regain their lost imagination and sense of adventure. Also, in schools
I feel that work should be done mainly by hand and calculators and computers
should only be used minimally in mathematics in order to stop the production
of computer addicts and again have normal people.

 More lexis? Less? A little? A lot?
                                 http://www.lextutor.ca/vp/bnc/                   79
Which analysis software?

Basic structure snapshot
(Qc corpus)

Lexis comparison

Lexis comparison

     NNS corpus
     (Quebec TESL trainees)

          155 post-1k word families/3356 tokens

     NS corpus
     (UK A-Levels essay)

        269 post-1k word families/3630 tokens

     But that’s not all
        Split up corpus
           Look at individuals
Almost all post-2ks are used by one writer only
     Interesting peripheral differences for
     another study
               correct but unelaborated
               heavy on the short end,
                 light on the long end
               Low proportion of noun-noun

     Vocab - Heavy reliance on 1k vocab
            Low Post-1k
               Items used by one person

       Yet good recognition scores at 3k+ levels
             Known words are not getting used
                Unlikely to get used in classroom
2. Oral
   production corpus

Let’s learn more about the previous study:

  Follow trainees into their classrooms

  Does the predicted pattern occur?
     If new words appear, are they recycled?

     *See Horst’s Teacher Talk Corpus study
       in a forthcoming RIFL (2011)

  (Note: Different subjects – here we are
    establishing tools & method)

18 hrs of NS-T classroom talk

 Looks like rich lexical input…

   Post-1k words (learning zone)
      1570 families
      900 appear in one class-hour only
            Inc 300 one TIME only

   «Recyclage» is not happening
      Now add this to the NNS data
            Few post-1k used in own writing
       The problem starts to make sense

 Or, Alert’s 108,000 wds,

Went, saw
http://www.lextutor.ca/concordancers/concord_e.html   94
3. Goal

Let’s work through a
published study
     Ovtcharov & Cobb 2006 (en français)

     Situation: Ottawa
     Civil service promotions depend on success
        in L2 oral interview

     Pass/fail evaluated globally

     “A well developed vocabulary” is one of the
       stated criteria
        But what is it?
           The usual soft focus                    96
Needed for the study
     1. Corpus of transcribed oral interviews
          Both passes, fails, & borderlines
             24 of each, 25-35 minutes
               100s of hours work

     2. French version of Vocabprofile
        Lemmatized large-corpus based, k-leveled
          frequency lists?
           Miraculously appear in c. 2001
               See Cobb & Horst, 2004

     3. Usable NS reference corpus
          Provided by Beeching, 2001
            French oral interviews in USA          97

     Identifiable difference at 2k
         Strong difference at 3k+MHL (off-list)
     (Assuming replication)

     One less failure-to-communicate in the
      vastness of high-stakes language

     The instructional design process has a
       place to begin

      Corpus research is a fairly simple,
       bean-counting type of research

      That can solve complex problems in
        language learning & teaching, both

              What do these people need to learn?
              Can examiners’ impressions be
              E.g., Piecing together the portrait of advanced
                interlanguage (Cobb 2003)
Course tie-up

At 10.15 you now know…
        What a corpus is
        Why it is important
        What insights it has yielded in applied
        The uses it can have for researchers
                            … for instructors
        How to build a corpus
        Choice points in building a corpus
        Some tools of corpus analysis
        How to do a learner corpus study
        The results of some published learner
         corpus studies
        The future of learner corpus studies
The Future

Where do we go from here?
     Corpus research carries on shining the
      light into dark corners
          - 2007-2009 work from Dee Gardner, Stuart Webb

     Some increase in corpus awareness
       - Teacher training programs
       - MA methods courses

     Collaboration reduces labour
          - CECL, the Locness reference corpus
          - Promise of automatic corpus
            comparisons at Calper Gold

     Dev. world can play as tools go online
If we have time…
     The final challenge
       to the utility of frequency lists

     As already seen
       We are closing in on the Core of
           This includes a smaller than
           expected group of true homonyms

     No corpus tool-kit so far deals with these
           E.g. a Vocabprofile analysis does not
            distinguish bank and bank
Go live
http://www.lextutor.ca/concordancers/text_concord   106
This PPT at

References list at

To top