Strategy Review Wrap-Up by hcj


									Designing and Implementing School Level
     Assessments with District Input

   John Sabatini, Kelly Bruce, and Srinivasa (Pavan) Pillarisetti
                   Educational Testing Service

This research is funded in part by grants by the Institute of Education Sciences
(R305G04065). Any opinions expressed in this article are those of the authors and not
necessarily of Educational Testing Service. Email:
     I would like to acknowledge:
 Strategic Educational Research Partnership
   (SERP) Institute

 BPS Design Team (esp. David Francis),

 Boston University (Gloria Waters and David

 Harvard University (Catherine Snow, Sky
   Marrietta, Claire White, Joshua Lawrence)

 Boston Public Schools

 Brockton Public Schools
                      Design Team

    Design Team Assessment Subgroup -- Drs. David
      Francis, Univ. of Houston; Gloria Waters, Boston
      University; and John Sabatini, ETS

    Charge -- advise SERP and BPS in the ways in which the
      assessment of reading could be made more efficient
      and productive in the district.

                   Needs Assessment

    In initial Design team meetings, we learned that district
       leaders had made significant investments in

       Reading intervention products
       Teacher professional development to support literacy.
       State test results
       Students took lots of tests (mostly mandated)

                   Problem Definition
    However, no consistent reading/literacy instruments for
     Determining the nature or severity of reading problems
     Identifying the prevalence and profiles of struggling
     Receiving timely results

     Inefficient placement of students into intervention
     Weak/insensitive measures of effectiveness of
      interventions that target subskills

    Short term
     Build a battery of screening/diagnostic
      assessments for school-wide use
     Estimate prevalence and nature of student
      reading difficulties

    Long-term goals
     Replace other redundant assessments
     Triage students for specialized testing, thus,
     Reducing the total time spent on assessment
     Use instruments to evaluate intervention programs

     The challenge was to build a battery
    a) screened for reading difficulties across a wide
       range of skills from decoding through vocabulary;
    b) had acceptable psychometric properties;
    c) was compact (i.e., could be administered in about
       one 40-50 minute session);
    d) could feasibly be implemented school wide; and
    e) rapid turnaround of score reports (i.e., within 2-3
    f) useful at multiple stakeholder levels (e.g., teacher,
       school, districts)
    Computerized delivery and scoring could make it
       feasible to meet most all of the above design
       constraints, and was viewed as desirable by BPS.
               Rationale & Background
    Theoretical perspective grounded in componential
      approaches to reading assessment (Cain, Bryant, & Oakhill,
      2004; Oakhill, Cain, & Bryant, 2003; Perfetti, Landi, & Oakhill, 2005).

    Although skilled, proficient readers are characterized by
       the integrative, interactive nature of processing
       during any reading task, there is nonetheless
       evidence for subcomponent skills.

    Component reading measures can be used as
        indicators of skill profiles of struggling readers,
        adding value over and above the types of off-the-shelf
         comprehension tests the district was using (Sabatini, 2009).

              Background Literature &
    As a general principal, test designed to align with
      empirical research on struggling reader difficulties
      and effective instructional programs (e.g., NRC; 1998; NICHD,

    As well as with cognitive and linguistic theories of the
      skills underlying reading development and difficulty
      (e.g., Kintsch, 2000; Perfetti, Landi, & Oakhill, 2005; Perfetti, Van Dyke, & Hart,
      2001; Rayner et al, 2007; Vellutino, Tunmer, Jaccard, & Chen, 2007).

               Typical Test Design Steps

     Step 1: Construct definition

     Step 2: Design Specifications/Test Blueprint

     Step 3: Test construction

     Step 4: Conduct pilot

     Step 5: Conduct field trial

     Step 6: Go operational
                    ‘Use-inspired’ approach
     Step 1: Define an assessment problem and information
        need/criteria for success

     Step 2: Get research assessment team(s) to ‘volunteer’ to
        commit time and resources to problem (in return for data)

     Step 3: Cobble together funding to accomplish initial aims
        (e.g., SERP foundation support; researcher grants)

     Step 4: Get district approval; find some schools willing (and
        able) to work with you on pilot implementation

     Step 5: Design/adapt items; conduct pilot/field studies;
        analyze data, report back to district and SERP

     Step 6: Rinse and repeat as necessary.
               ‘Use-inspired’ process

                     In sum:
         Process is variable and complex
         Involves multiple, iterative pilots

     Each pilot designed to investigate different
      research and practical questions, ultimately
       moving the team towards an assessment
     solution that met the needs of both district and
                 research stakeholders.

             Pilot 1, June 2006 – September 2006
     Participants: two middle schools Summer 2006, follow up
       Sept 2006.

     1. How prevalent were basic reading skill difficulties -- basic
        decoding, word recognition, and reading fluency?
     2. Can we implement this without schools and districts

     1. Yes, at least in these schools, significant numbers
     2. Yes. [We have a great team.]

     Conclusion: So far so good, let’s try again.
                           Pilot 2, June 2007
     Participants: three Middle and three High schools. CORE battery;
       random half take ETS or BU subtests

     1. Confirm basic reading difficulties finding with externally valid tests.
     2. Begin exploring relationship of subtests to external test criteria

     1. Substantial numbers of students with word reading difficulties on
        (TOWRE) (Torgesen, Wagner, & Rashotte, 1999); both BU and ETS tests
     2. Moderate to strong correlations with MCAS and other external

     Conclusion: evidence supported the directions chosen by the
       intervention design teams to develop vocabulary and basic skills
       programs; but how to reduce battery?

           Pilot 3, September 2007 - December 2007
     Participants: two Middle and two High schools from previous; two
       new middle schools. CORE battery; random half take ETS or BU

     1. Feasible scoring: test multiple choice vs. oral response measures.
     2. How best to combine measures into a feasible, parsimonious
        mixture of measures that spanned the range of reading skills,

     1. Multiple choice can work
     2. Indeterminate; total battery too long, but no clear path for

     Conclusion: Rinse and repeat.
                   Pilot 4, Fall and Spring 2008
     Participants: two middle schools and a follow up with one school in
       the Spring. Six subtest battery

     1. Improve psychometric and scale qualities of subtests
     2. Gather evidence of added value in subtests over total scores.

     1. Reliability and other test properties showed improvement,
        - cross grade performance levels in predicted ranges.
        - sentence and comprehension tests need improvement.
     2 Evidence that subscores were contributing added value over and
        above total scores (Sabatini, Bruce, & Sinharay, 2009).

     Conclusion: Given the success of the battery so far, it seems
       appropriate to implement a larger-scale trial. [Repeat]
                              Pilot 5, Fall 2009
     Participants: Field test with over 4000 6th- 8th graders (Form 6) and 500 4th-
        5th graders (Form 4, which was new). Forms shared 50% of their content.

     1. Improve item and form psychometrics
     2. Build scales linked to previous year MCAS scores and
     3. refine the score reporting.
     4. Pilot versions designed for grades 4 and 5.

     1. Reliability and other test properties showed improvement
     2. Created a scale for each subtest, aligned with MCAS: Warning, Needs
        Improvement; Proficient level.
     3. Presented with SERP in district meeting so that individual schools could
        use the data to plan for future literacy needs.
     4. Initial data promising, but needs further work.

     Conclusion: Now have scaled test that is functional for operational needs at
       6-8 range.
Summary: Pilot Site Information

            Pilot 1 Pilot 2 Pilot 3 Pilot 4 Pilot 5
Number of
              3       6       6       2       12

Number of
             373    573     960      785     4908

Grades       6-7    6-11     6-9     6-8      4-8
           Summary: Reliability Estimates
                                               Cronbach’s Alphas (Number of Items)

Subtest                             Pilot 1      Pilot 2    Pilot 3     Pilot 4      Pilot 5

Real Word Recognition               .93 (40)    .87 (20)
Pseudoword Reading                  .89 (20)
Pseudohomophone Judgment            .84 (56)
Lexical Decision                                                       .88 (52)      .91 (50)
Semantic Similarity                                        .70 (28)    .82 (35)      .88 (38)
Morphological Awareness             .81 (18)    .65 (10)   .82 (24)    .89 (30)      .91 (32)
Sentence Processing                                                    .73 (25)      .82 (26)
Efficiency of Basic Reading Comp.
                                                           .93 (25)    .86 (33)      .91 (36)
Reading Comprehension                                                  .75 (22)      .78 (21)
         Challenges and Lessons Learned
     •    Designing for multiple purposes and

     •    Adapting to the fits and starts of district and
          school level decision-making

     •    Sharing actionable results with stakeholders

     •    Technological infrastructure of schools and

     •    Collaborating with other research groups
     Contact Information

       John Sabatini

                   Typical Test Design Steps
     Step 1: Construct definition
      defines the target population,
      the content and constructs to be measured, and
      the inferences or claims that test scores are intended to be
        used to make.

     Step 2: Design Specifications
      specification process which includes
      defining a test blueprint,
      test administration and scoring logistics, and
      constraints.

     Step 3: Test construction
      Generate and review items,
      develop test forms
      drafting of administration and scoring guidelines.
                   Typical Test Design Steps
     Step 4: Conduct pilots
      Assess basic administration and scoring assumptions and
        revise accordingly.
      identify poorly performing items, which are then revised or

     Step 5: Conduct field trial
      sample the target population.
      Stat analysis/psychometrics
      Scales, norming, equating (as needed)
      Validity studies

     Step 6: Go operational
      that is, they are administered (or sold) for use under test
        conditions such that score reports are used to inform
        educational decisions.

         Challenges and Lessons Learned
     •    Designing for multiple purposes and

     •    Adapting to the fits and starts of district and
          school level decision-making

     •    Sharing actionable results with stakeholders

     •    Technological infrastructure of schools and

     •    Collaborating with other research groups
                Background Literature &
     Assessment of components skills useful in screening
       struggling readers who may have failed to acquire
       efficient fundamental skills in the elementary school
        – measures of fluency and word reading efficiency are
          common in research and classrooms across grade levels (e.g.,
           Deno & Marsten, 2006; Wayman et al., 2007).

     Reading component proficiency is typically
       characterized by increased automatic and efficient
        – important in the middle grades and beyond in handling the
          increasing quantity and complexity of texts (ACT Inc., 2009; Adlof, Catts,
           & Little, 2006; Jenkins et al., 2003; Kuhn et al., 2010; Rayner et al., 2003; Torgesen, Wagner, &
           Rashotte, 1999).


To top