What is the BNC by liuhongmeiyes


									The British National Corpus:
where did we go wrong?

Lou Burnard
Oxford University Computing Services
What is the BNC?
§   100 million words of modern British English
§   produced by a consortium of dictionary
    publishers and academic researchers
    Ÿ OUP, Longman, Chambers
    Ÿ Oxford, Lancaster, British Library
§   funded as pre-competitive resource by DTI/
    SERC under JFIT 1990-1994
Where did we go wrong?
  §   (if we did)
  §   or, The Benefit of Hindsight
  §   or, If I'd known then what I know now...
  §   or, Wisdom After the Event
  §   And, Where Do We Go From Here?
Production of the BNC
  §   took three years (at least)
  §   cost GBP 1.6 million (at least)
  §   came about through an unusual coincidence
      of interests amongst:
      Ÿ Lexicographical publishers
      Ÿ Government (DTI)
      Ÿ Engineering and Science Research Council
The Neotenous Nineties
  §   WinWord or WP5? the choice is yours
  §   On your desk … a 386 with 50 Mb
      diskspace (just about enough to run
      Windows 3)
  §   In your lab ... a VAX or a Sparc for serious
  §   On the WWW (maybe) ... Mosaic for X
Intellectual currents
   §   corpus linguistics
       Ÿ the LOB school
       Ÿ the Birmingham school
       Ÿ the LDC view
   §   text encoding theory
   §   language engineering
   §   the JFIT mentality, or Reconciling Town
       and Gown
Stated Project Goals
  §   A synchronic (1990-4) corpus of samples
      both spoken and written from the full range
      of British English language production
  §   of non-opportunistic design, for generic
  §   with word class annotation
  §   and contextual information
Actual (?) project goals
  §   Better ELT dictionaries
      Ÿ authoritative
      Ÿ both speech and writing
  §   A model for European corpus work
      Ÿ design, and encoding
      Ÿ Industrial-academic co-operation
  §   A REALLY BIG corpus
  §   industrial scale text production system
  §   compromises in design and execution
  §   IPR and profitability

       The BNC looks back to Brown and LOB in
       its design and markup, and forward to the
       Web in its scope and indeterminacy
   The BNC “sausage machine”
    Written             Spoken
                                    Selection, clearance, and capture
      OUP                Spoken
(OUP/Chambers)        (Longman)
 (OUP/Chambers)        (Longman)

                                             Enrichment and encoding
     Initial CDIF Conversion
      Initial CDIF Conversion
           and Validation
            and Validation                  Word Class Annotation
                (OUCS)                            (UCREL)

                                   Header generation
                                    Header generation
                                   and final validation
                                   and final validation

                           Documentation, distribution, maintenance
Task groups
  §   permissions
  §   selection, design criteria
  §   encoding and markup
  §   enrichment and annotation
  §   retrieval software
Through-put (million words/quarter)
  §   desire to test annotation scheme
  §   requirement to meet deliverables
      Ÿ slipping goal posts
      Ÿ quantity above quality
  §   … an interesting learning experience for
      both sides!
BNC Selection Criteria
 §   Written selection criteria
     Ÿ predefined proportions of
        • different media (books, newspapers, unpublished…)
        • different domains (informative, entertaining…)
     Ÿ maximum sample size 45000 words
     Ÿ all texts incomplete
 §   Spoken selection criteria
     Ÿ context-governed
     Ÿ demographically-sampled
Word tagging
  §   word-pos pair
  §   white space problems
  §   validation problems
  <s n=00011>
   <w AT0>The <w NP0>Queen<w POS>‘s
   <w AJ0>real <w NN1>annus horribilis
   <w VVD>began <w PRP>
   <w NN0>Sunday<c PUN>.</s>
Sample written text
 <text complete=Y decls='CN000 HN001 QN000 SN000'>
 <div1 complete=Y org=SEQ>
 <head type=MAIN>
 <s n=001>
 <w AT0>No <w CRD>1 </head>
 <head r=it type=SUB>
 <s n=002><w AVQ>How <w NN1>beer <w VBZ>is
 <w AJ0-VVN>brewed </head>
 <p><s n=003>
 <w NN1>Beer <w VVZ>seems <w DT0>such
 <w AT0>a <w AJ0>simple <w NN1>drink <w CJT>that
 <w PNP>we <w VVB>tend <w TO0>to <w VVI>take
 <w PNP>it <w CJS-PRP>for <w VVD-VVN>granted
 <c PUN>.
Transcription practice
  §   Regionalised typists
  §   Markup makes explicit
      Ÿ changes of speaker and overlap
      Ÿ words as perceived by transcriber
      Ÿ plus indications of false starts, truncation, uncertainty
      Ÿ some performance features e.g. pausing, stage
        directions etc.
      Ÿ speaker details where available (always for
        respondents, sometimes for others)
Sample spoken text
 <u who=PS04Y>
 <s n=01296><w ITJ>Mm <pause> <w ITJ>yes <pause dur=7>
 <w PNP>I <w VVD>told <w NP0>Paul <pause>
 <w CJT>that <w PNP>he <w VM0>can <w VVI>bring
 <w AT0>a <w NN1>lady <w AVP>up <pause> <w PRP>at
 <w NN1>Christmas-time<c PUN>.</u>
 <u who=PS04U>
 <s n=01297><w VBZ>Is <w PNP>he <w XX0>not
 <w VVG>going <w AV0>home <w AV0>then<c PUN>?</u>
 <u who=PS04Y>
 <s n=01298><w ITJ>No <pause dur=8> <w CJC>and
 <w UNC>erm <pause dur=7> <w PNP>I<w VBB>'m
 <w VVG>leaving <w AT0>a <w NN1>turkey <w PRP>in
 <w AT0>the <w NN1>freezer<c PUN>
 <s n=01299><w NP0>Paul <w VBZ>is <w AV0>quite
 <w AJ0>good <w PRP>at <w NN1-VVG>cooking <pause>
 <w AJ0>standard <w NN1>cooking<c PUN>.</u>
  §   each text has a TEI header
      Ÿ identification and classification
      Ÿ specific details (e.g. speakers)
      Ÿ housekeeping information
  §   all common data in the corpus header
  §   classification(s) in header pointed to by
      individual texts
Text classifications
   §   spoken texts
       Ÿ age, sex, class (of respondent)
       Ÿ domain, region, type
   §   written texts
       Ÿ author age, sex, type
       Ÿ audience, circulation, status
       Ÿ medium, domain
   §   Intention was to improve coverage, not
In retrospect…
  §   Some classifications were poorly defined
      and only partially populated
      Ÿ Domain or text-type?
      Ÿ Dating
         • date of copy? first publication?
      Ÿ Author age
         • when?
      Ÿ Author ethnic origin, domicile
That famous BNC balance
That famous BNC balance
Written Domains
Written Domains
Written Domains
Spoken domains
  §   BNC end-user licence
      Ÿ commercial exploitation of the corpus is
      Ÿ commercial exploitation of derived works is
      Ÿ OUCS is sole agent for licensing, reporting to
  §   Original restriction to EU has been lifted
Distribution methods
  §   100 million words is (still) a lot of data
  §   IPR agreements imply not-for-profit
      Ÿ (which has its downsides too)
  §   The options are...
      Ÿ install it yourself
      Ÿ online access
      Ÿ the sampler
Install it yourself (version 1)
   §   You need...
       Ÿ £220 for a licence and 3 CDs
       Ÿ £2000 for a Unix box with min 6 Gb disk
           Version 2 will be
       Ÿ some Unix expertise delivered to
           run “standalone” on a suitably
   §   You get...
           configured PC
       Ÿ access to the whole corpus
       Ÿ using the tools of your choice
       Ÿ configurable for a local network
    BNC Online service
§   You need...
    Ÿ access to the Internet
§   You get...
    Ÿ free (but limited) access using any web browser
    Ÿ free (temporary) access using SARA (PC only)
    Ÿ for an annual fee, SARA plus documentation

Accesses per month
The BNC Sampler
  §   You need...
      Ÿ $50 for a CD
      Ÿ A PC with a CD drive and (preferably) 90 Mb
        disk space
  §   You Available at this
          conference, at a half spoken
      Ÿ 2% sample, half written,
          special price !!engines
      Ÿ four different search
      Ÿ documentation
The BNC World Edition (aka
  §   has IPR clearance for world usage (we lose
      about 50 texts)
  §   extensive set of revisions and corrections
  §   catching up with the standards
  §   accompanied with new enhanced version of

          … and it’s nearly ready (honest)
Error correction issues
  §   Nothing can be added
  §   Catching up with the standards
      Ÿ CDIF … TEI … EAGLES… CES …
      Ÿ headers are now in TEI-conformant XML
  §   Indeterminacy of any transcription
      Ÿ On the scale of the BNC, especially
  §   If seven maids with seven mops…
Error Corrections in BNC2
  §   POS correction
      Ÿ Systematic
         • uses improved rules derived from BNC Sampler
         • significantly reduced error rate and indeterminacy
  §   Major production errors fixed
      Ÿ Semi-systematic
         •   duplicate texts
         •   wrongly labelled texts
         •   participant details
         •   classification errors and lacunae
  §   Typos remain... and will do so!
The BNC as an Open Corpus
  §   We chose SGML to encourage
      development of other tools
  §   This is coming more slowly than we
      expected,e.g. the Sampler
  §   But people still think the BNC and SARA
      are the same thing
New features in SARA
  §   POS code searches
  §   Collocation searches
  §   Subcorpora
  §   Lemmatization rules
  §   Usable with any TEI conformant corpus
What lessons have we learned?
  §   know your audience
  §   technological blindspots
  §   missed opportunities
Know your audience
  §   Everyone knows you should research the market
      Ÿ small, specialist research community, lexicographers
  §   The actual market is immense:
      Ÿ language learners
      Ÿ applied linguists
      Ÿ cultural historians
  §   and technically unsophisticated
      Ÿ hence often misled or disappointed
Technological blind spots
  §   we didn't expect the XML revolution!
         • so we wasted time in format conversion and
  §   we didnt foresee pcs with 8Gb disks and
      sound cards!
         • so we didn’t try to get rights to the audio
         • and we focussed efforts on developing a
           client/server application
Missed opportunities: the R-word
  §   Original design talks of Representativeness
  §   This shifted to the idea of the BNC as a
      "fonds" : a source of specialist corpora
  §   This implies
      Ÿ a clearer and agreed taxonomy of text types
      Ÿ better access facilities for subcorpora
Missed opportunities: watching
the river flow
  §   The BNC as a monitor corpus
  §   Diachronic sampling
      Ÿ But this implies a constant ability to fund and
  §   How long will we want to study the
      language of the nineties?
  §   Will the web provide?

To top