666 - PowerPoint

Document Sample
666 - PowerPoint Powered By Docstoc
					    How specialized are specialized
              corpora?
Behavioral evaluation of corpus representativeness for Maltese




               Jerid Francom (Wake Forest University)
               Adam Ussishkin (University of Arizona)
                 Amy LaCross (University of Arizona)

     19 May 2010: O7 (Evaluation of Methodologies), 14.45-15.05
           LREC 2010, Mediterranean Conference Center
                          Valletta, Malta
Acknowledgements
 Generous contribution of data to this project by
 Dr. Albert Gatt (Univ. of Malta)
 Statistical expertise from Jeff Berry (Univ. of
 Arizona)
 Funding from the United States National
 Science Foundation (BCS-0715500) to Adam
 Ussishkin




                        2
Goals
 Issue
 For many languages, the quality of available textual data is
 less than ideal for corpus creation in the light of standard
 sampling practices.

 Propose
 Behavioral data can provide a valuable metric to evaluate
 corpus resources otherwise considered ‘specialized’.

 Case
 PsyCoL Maltese Lexical Corpus

 Contribute
 Novel, cross-discipline metric for evaluating the quality of
 language resources




                              3
Sparse coverage
 Most of the world’s 5-
7000 languages have no
corpus resources

  Efforts to fill the gap,
  often exploit the
  availability of language
  data on the web

     An Crúbadán project,
    446 languages
    (Scannell, 2007)

     McEnery et al.,
    (2006) survey of
    recent work



                             4
Sparse coverage
Low-density languages
(Borin, 2009)
Languages in which
resources exist; but in
limited quantity/quality
  Limited access to print
  and/or electronic data

   Available primary data
  may be less-than-
  representative

      Weakens assurance
     that results from low-
     density language
     resources are credible

                              5
Corpus representativeness
  What is a ‘representative
 corpus’?

    An externally valid sample of
   language use

       A sample that
      approximates what the
      language is.

          Full range of structural
         types (language units)

  What are the characteristics of
 such a sample?

      Genre/register

      Modality
                                6
An issue for low-density languages

  Standard practice to achieve
  representativeness
    Apply rigorous sampling methods
    Collect large amounts of data
  Problematic for low-density languages: a
  representativeness bottleneck
    Lack large amounts of data
    Available data is often limited in register,
    modality, etc.
      Corpus resources are typically specialized
                          7
Assessing representativeness
  How do we know whether we have a
  ‘representative’ sample?
      We don’t, in an absolute sense.
      Faith in survey sampling practices
      Casting the net far and wide
  Can we be assured we don’t have a
  representative sample?
      Not exactly.
  •   It is logically possible that smaller, less diverse
      samples are externally valid for linguistic units
      that appear in the collection.
                            8
Proposal
 Need for an external metric.
 Current proposal suggests findings from
 behavioral experimentation can provide a
 valuable metric to evaluate corpus resources.
   Exploit the correlation between derived
   frequency counts and elicited behavioral
   reactions
     Behavioral data and adjusted frequency
     (Gries 2008; 2009)

   Of particular importance for specialized corpora


                          9
Behavioral findings
  Well-known robust effects for relative frequency
  in language processing
    Word naming RTs (e.g., Forster & Chambers, 1973)
    Lexical decision RTs (e.g., Carroll & White, 1973)
    Sentence reading RTs (e.g., MacDonald, 1994)
    Word familiarity ratings (e.g., Gernsbacher 1984)
  Log frequency is a good predictor of behavior.




                           10
Approach
 Evaluating corpus representativeness through
 behavioral assessment
 1. Derive frequency counts from a specialized
   corpus
 2. Elicit behavioral response of participants from
   target population
 3. Assess correlation strength: how well do
   behavioral responses correlate with corpus
   measures?




                        11
Case study and predictions
  Case study

    Calculate: log frequency of subset of items in a
    Maltese lexical corpus

    Measure: subjective word familiarity ratings of native
    speakers of Maltese
    Assess: relative distribution of the measures

  Prediction

    Congruence between relative distributions indicates a
    representative sample of the language

    Mismatches underscore potential sampling issues


                            12
The specialized corpus
  PsyCoL Maltese Lexical Corpus (PMLC)
  (Francom, Ussishkin, and Woudstra, 2009)
  http://psycol.sbs.arizona.edu/resources/
    Online Maltese newspapers, 1998-1999; 2005 -
    2007
    PsyCoL lab (59.8%) and Dr. Albert Gatt (40.2%)
      3,323,325 total tokens (53,000 unique)
      Token/type ratio of 1.6%
  Typical for low-density languages
    Large corpus, still relatively small (cf. British
    National Corpus 100+million; Corpus of
    Contemporary American English 400+ million)
    Limited in register, modality
                           13
Linguistic variable to quantify
Because there is little previous quantitative
research on Maltese, the empirical focus of this
investigation was narrowed to:
  Semitic-origin verbs/binyanim (also known as
  form)
    Semitic-origin verbs in Maltese conform to the
    classical Semitic binyan system (categories
    based on morphosyntactic and phonological
    properties)

Question: How does frequency as measured in
our corpus correlate with behavior?
Can the binyan categories be exploited to
provide correlations?
                          14
  Maltese binyanim
Binyan                 Function                           Prosodic shape          Example

  1      basic active (transitive or intransitive)           CVCVC             kiser ‘he broke’

  2          intensive of 1, transitive of 1                CVCCVC          kisser ‘he smashed’
  3                   transitive of 1                        CV:CVC          bi:rek ‘he blessed
  5            passive of 2, reflexive of 2                 tCVCCVC        tkisser ‘it got smashed’
                                                                                 tki:teb ‘he
  6            passive of 2, reflexive of 3                 tCV:CVC
                                                                               corresponded’
  7            passive of 1, reflexive of 1                 nCVCVC          nkiser ‘it got broken’
  8            passive of 1, reflexive of 1                  CtVCVC        ftakar ‘he remembered’

  9       inchoative, acquisition of a quality               CCV:C           hma:r ‘he blushed’

 10               originally inchoative                     stVCCVC          stenbah ‘to wake’


                                                     15
A behavioral task: word familiarity
•   We devised three tests to measure corpus
    representativeness
•   Each test measured a different aspect of our
    corpus counts and our behavioral task.
•   The behavioral task involved native Maltese-
    speakers, who gave subjective word familiarity
    ratings for all Semitic-origin Maltese verbs taken
    from Aquilina (2000); n=1536.
        Scale from very unfamiliar to very familiar
        Shown to be a reliable predictor of lexical processing
        (Connine et al. 1990)


                              16
Word familiarity experiment
  Participants
    107 native speakers of Maltese
  Task
    Subjective word familiarity task, online




                         17
Measuring frequency in the corpus
•   We then used the PMLC to calculate word
    frequency measures for the same set of verbs.
•   Using regular expression-enabled searching,
    we counted token frequency for all verbs
    occurring in the PMLC (n=447).
•   Frequency was then encoded as a log-based
    measure.




                         18
Three tests
•   Next, we conducted three distinct statistical
    analyses to assess correlation between these
    corpus measures and the results of our word
    familiarity experiment
    •   1. Statistical regression between corpus log
        frequency and behavioral data.
    •   2. Binned groups by frequency to determine
        whether any correlation is found.
    •   3. Binned items by binyan to determine whether
        any correlation is found.


                             19
1. Statistical regression
•   We found a weak correlation (r=.14); these
    results show at best a trend toward correlation,
    but suggests that familiarity ratings likely do not
    predict word frequency given these results.




                           20
2. Binning by frequency
•       Binning into two bands shows a correlation:




    •   Binning into three bands also shows a
        correlation:




                             21
2. Binning by frequency
•   An LMER analysis of each binning (2 groups
    and 3 groups) shows significance:
    •   All contrasts for two-bin intervals (High/Low=4.2,
        t=2.0) and three-bin intervals (High/Mid=7.1, t=3.9;
        Mid/Low=7.0, t=2.2) were significant.

    •   These results support the hypothesis that
        behavior and corpus measures are correlated.




                                22
3. Binning by binyan
•   Earlier and ongoing work (Frost et al. 1997,
    1998, 2000; Ussishkin et al. in progress) shows
    binyan effects in Hebrew in both visual and
    auditory modalities, so Maltese could be
    expected to show similar effects.
•   Our goal here is to measure whether verbs,
    when grouped by binyan, show a correlation
    between word frequency measures and word
    familiarity ratings.




                         23
3. Binning by binyan
•   Only binyanim 1, 2, 5, 7 were analyzed;
    binyanim 3, 6, 8, 9, and 10 were not included in
    the analyses because they are so sparsely
    populated:




                          24
3. Binning by binyan
•   Word frequency results: significant contrasts
    found between Binyanim 7 and 2 (β=.54, t=6.0);
    and between Binyanim 7 and 5 (β=1.15, t=-2.2).
•   Word familiarity results: no significant contrasts
    found.




    Binyan by word frequency        Binyan by word familiarity
                               25
        General assessment
•   The results show that verb frequency
    distributions in the PMLC pattern to some
    degree with the psychological representations
    of native speakers (the representative
    population)
•   On the surface suggests the PMLC is on the
    right track, but underscores the specialized
    nature of corpus
•   However, a response bias in the
    word familiarity task may play
    a part in the mismatches
    •   Ceiling effect may have contributed
        to lower correlation scores
                                 26
General assessment
•   Reasons to be optimistic about the
    verb distributions in the PMLC:

    •   Distribution of verb count/
        frequency (Zipf, 1949)




    •   Distribution of word length/
        frequency (Li, 1992)



•   Both measures trend as expected for
    representative samples



                                  27
Conclusion
•   Novel methodology: direct comparison between
    corpus resource and behavior.
•   Highlighting a robust effect from
    psycholinguistics (frequency of linguistic units
    predicts behavior).
•   We predicted the opposite could occur; this
    provides a way to validate LDL resources.
•   This approach encourages cross-discipline
    endeavors for resource development and
    theoretical investigation.


                           28
•   Thank you very much!
•   Grazzi ħafna!




                    29

				
DOCUMENT INFO