National Examinations by sanmelody

VIEWS: 11 PAGES: 21

									           Assessing writing ability
 Using anchor essays to enhance reliability



Hiske Feenstra
Cito, The Netherlands
AEA Europe 2010
Outline


• Research project
• Construction of rating scales with anchors
• Evaluation of rating scales with anchors
  –   Research questions & hypotheses
  –   Method
  –   Analyses
  –   First results
• Discussion



                                               2
Research project – context


• Periodical Assessment of Educational
  Level in The Netherlands

  –   Primary school: grades 5 and 8
  –   Teaching
  –   Learning
  –   Trends




                                         3
Research project – assessing writing

Assessment of writing in National Assessment:

• Analytic procedure
   – list of questions on essay
   – different set of questions per task
   – six aspects implicitly scored


• Inter-rater reliability for some aspects is rather
  low (±.70)
• No research on differences between grades
  (known group validity)
• Essays not rated as a whole
Research project – assessing writing

To improve reliability:
• True score vs. observed score

• Sources of unwanted variance
  –   Personal characteristics candidate
  –   Personal characteristics rater
  –   Task
  –   Marking scheme


• Goal: adjusting marking scheme to improve
  reliability
                                           5
Research project – rating scale

          anchor essay
               =
           bench mark




-                             +
                             writing
                             ability
        characteristics
 Research questions & hypotheses

  The rating scale is a useful addition to the
  Is the rating scale with anchors a useful addition
  current analytic rating procedure.
  to the current analytic rating procedure?

• To what extend does the adjusted rating scheme
  affect inter-rater reliability compared to the original
  scheme?
  It is expected that reliability will be affected
  positively.

• To what extend does the adjusted rating scheme add
  proof of validity compared to the original scheme?
  It is expected that the adjusted rating scheme will
  add proof of validity.
                                                        7
 Selecting writing tasks

• Selection of writing tasks

             A - Tigors &      B - Pookie   C - Yummy
             Giraks
text genre   story             leaflet      letter
text goal    narrative         directive    argumentative


• Aspects to be assessed:
   – Content: task requirements
   – Structure: composition, lay out
   – Correctness: syntax, spelling, punctuation

                                                        8
Collecting essays


      5 schools



      4 grades per school (5 to 8 = 8 to 12 yrs)



      >550 pupils



      In total: >1350 essays to rate
Construction of rating scales

• Choosing 3 anchors essays per scale:




        -1 sd        mean          + 1 sd

• In total: 3 x 3 rating scales (1 per task, per aspect)
Construction of rating scales

• 4 expert raters rating 40 essays per task, per aspect
   – Average essays as bench marks



  Agreement        A           B           C
  Content         .93         .88          .85
  Structure       .88         .80          .87
  Correctness     .89         .75          .80




                                                    11
Construction of rating scales




                                12
Poster




         13
     Method – design
CONDITION 1                                                               1st
                                                                          2nd
      A- T&G              B- Pookie               C- Yummy                3rd
rater A1 A2 A3 A4 A5 A6 B1 B2         B3 B4 B5 B6 C1 C2 C3 C4 C5 C6     essays
  1    50                             50                  50              150
  2        50                   50                           50           150
  3           50           50                                   50        150
  4              50                      50                        50     150
  5                 50                      50     50                     150
  6                    50                      50 50                      150
  7                    50 50                           50                 150
  8                 50          50                           50           150
  9              50        50                                   50        150
 10           50                         50        50                     150
 11        50                               50                     50     150
 12 50                                         50      50                 150
 13 50                                50                  50              150
       3 2 2 2 2 2 3 2                 2 2 2 2 3 2 2 2 2 2              ratings
First results – agreement
task   old   new    scale
  A    .84   .78     .79      content
 B     .86   .86     .79
 C     .87   .87     .87
       .85   .84     .82

task   old   new    scale
  A    .78    .80    .82      structure
 B     .78    .79    .80
 C     .72    .84    .84
       .76   .81*    .82    * significant (p .008)

task   old   new    scale
  A    .75   .79     .82     correctness
 B     .80   .75     .82
 C     .74   .77     .82
       .76   .77     .82                             15
   First results – evaluation

  Usefulness rating schale

                                          
   New                 1            8         2
   Old       1         2            8         4

“Good to have some                       “Short summaries
reference while                          proved very
rating.”                                 useful.”
                     “Overall score
                     sometimes differs
                     from score on
                     items.”
Discussion


• Agreement is reasonably high (new and old)
   – Task effect?


• In total: little effect on reliability
   – Only useful for assessment of structure?


• Rating scales (single items) provide high
  agreement
• Raters consider anchors useful while rating
   – Useful in classroom assessment?
                                                17
       Assessing writing ability
Using anchor essays to enhance reliability



         hiske.feenstra@cito.nl




                                             18
Future research


• Evaluation of items
   – IRT


• Validity (known group)
   – Assessing writing development known group


• Automated essay scoring
   – Linguistic quality
   – Using syntax parser


                                                 19
Discussion
Items with low agreement:
  Structure
  – The student uses connectives.                  Y/N
  – Relations between sentences are clear.         Y/N

  Content
  – The story contains surprising elements.
  – The story has diverse timing (e.g. leaps in time).

  Correctness (style)
  - The writer has made an effort to provide a readable text
    (readability, writing technique, style).


                                                         20
Research project – rating scales

Quality           Holistic scale         Analytic scale
Reliability       ± acceptable           + higher

Construct         ± asumes all aspects   + assesses each
validity          develop at same rate   aspect seperately

Information       -- single score        + more information

Practicality      + fast and easy        -- time-consuming;
                                         expensive

Authenticity      + natural process      ± less natural

 Weigle (2002), Knoch (2009)


                                                          21

								
To top