SomeStats for Exams

Document Sample
SomeStats for Exams Powered By Docstoc
                             Dept of Mathematical Sciences
                King Fahd University of Petroleum and Minerals

Presenter                       Dr. Mohammad H. Omar,
                         Dept of Mathematical Sciences, KFUPM

Title             The Role of Statistics in Developing Standardized
                               Examinations in the US
Audience              All KFUPM community are cordially Invited
Date                               Tuesday, Apr 19, 2005
Time                                      12:30 PM
Location                    Building 5, Smart Classroom # 201

  Prior to joining KFUPM, the presenter spent more than 6 years working in
  educational testing organizations in the United States and attended many
  conferences on educational assessment issues organized by associations like
  the National Council of Measurement in Education and American Educational
  Research Association. In this presentation, the presenter plans to share with
  the audience the role of statistical indices and procedures in informing the
  process of developing standardized examinations (such as ACT, SAT, TOEFL,
  and state-mandated exams). Definition of standardized and non-standardized
  examinations will be provided in this talk. Issues related to item and
  examination level test-construction and analyses and how statistics help
  address some of these issues will also be discussed.

                     Tea and Coffee will be provided
 Role of Statistics in Developing
Standardized Examinations in the

   Mohammad Hafidz Omar, Ph.D.
         April 19, 2005
Map of Talk
• What is a standardize test?
• Why standardize Tests?
• Who builds standardize tests in the United
•   Steps to Building a standardize test
•   Test Questions & some statistics used to
    describe them
•   Statistics used for describing exam scores
•   Research studies in educational testing that uses
    advanced statistical procedures
               What is a “standardized
• A standardized test: A test which the
  conditions of administration and the
  scoring procedures are designed to be the
  same in all uses of the test
  – Conditions of administration:
      •   1)   physical test setting
      •   2)   directions for examinees
      •   3)   test materials
      •   4)   administration time
  – Scoring procedures:
      • 1) derivation of scores
      • 2) transformation of raw scores
      Why standardize tests?

• Statistical reason:
  – Reduction of unwanted variations in
     • Administration conditions
     • Scoring practices
• Practical reason:
  – Appeal to many test users
  – Same treatment and conditions for all
    students taking the tests (fairness)
Who builds standardize tests in the
         United States?
• Testing Organizations
   –   Educational Testing Service (ETS)
   –   American College Testing (ACT)
   –   National Board of Medical Examiners (NBME)
   –   Iowa Testing Programs (ITP)
   –   Center for Educational Testing and Evaluation (CETE)
• State Department of Education
   – New Mexico State Department of Education
        • Build tests themselves or
        • Contract out job to testing organizations
• Large School Districts
   – Wichita Public School Districts
a) Administration conditions

• Design of experiment concept of control
  for unnecessary factors
• Apply the same treatment conditions for
  all test takers
    • 1) physical test setting (group vs individual testing, etc)
    • 2) directions for examinees
    • 3) test materials
    • 4) administration time
b) Scoring Procedures
• Same scoring process
  – Scoring rubric for open-ended items
• Same score units and same measurements for
  – Raw test scores (X)
  – Scale Scores
• Same Transformation of Raw Scores
  – Raw (X)  Equating process  Scale Scores h(X)
    Overview of Typical Standardized
     Examination building Process
• Costly process
• Important Quality control procedures
    at each phase
•   Process takes time (months to years)
     1) Creating Test specifications
     2) Fresh Item Development
     3) Field-Test Development
     4) Operational (Live) Test Development
    1) Creating Test specifications
• Purpose:
   – To operationalize the intended purpose of testing
• A team of content experts and stakeholders
   – discuss the specifications vs the intended purpose
• Serves as a guideline to building examinations
   – How many items should be written in each content/skill
   – Which Content/skill area is more important than others?
• 2-way table of specifications typically contains
   – content areas (domains) versus
   – learning objectives
   – with % of importance associated in each cell
     2) Fresh Item Development
• Purpose:
    – building quality items to meet test specifications
• Writing Items to meet Test Specifications
    – Q: Minimum # of items to write?
    – Which cell will need to have more items?
    – Item Review (Content & Bias Review)
• Design of Experiment stage
    – Design of Test (easy items first, then mixture – increase motivation)
    – Design of Testing event (what time of year, sample, etc)
• Data Collection stage:
    – Pilot-testing of Items
    – Scoring of items & PT exams
• Analyses Stage:
    – analyzing Test Items
• Data Interpretation & decision-making stage:
    – Item Review with aid of item statistics
        • Content Review
        • Bias review
    – Quality control step: (1) Keep good quality item, (2) Revise items with minor problem
      & re-pilot or (3)Scrap bad items
      3) Field-Test Development
• Purpose:
   – building quality exam scales to measure the construct (structure) of
     the test as intended by the test specifications
• Design of Experiment stage
   – Designing Field-Test Booklets to meet Specifications
       • Use good items only from previous stage (items with known descriptive
   – Design of Testing event
• Data collection:
   – Field-Testing of Test booklets
   – Scoring of items and FT Exams
• Analyses
   – analyzing Examination Booklets (for scale reliability and validity)
• Interpreting results: Item & Test Review
   – Do tests meet the minimum statistical requirements. (rxx’ > 0.90)
   – If not, what can be done differently?
                4) Operational (Live)
                   Test Development
• Purpose:
   – To measure student abilities as intended by the purpose of the test
• Design of Experiment stage
   – Design of Operational Test
       • Use only good FT items and FT item sets
       • Assembling Operational Exam Booklets
   – Design of Pilot Tests (e.g. some state mandated programs)
       • New & Some of the revised items
   – Design of Field Test (e.g. GRE experimental section)
       • Good items that has been piloted before
       • How many sections? How many students per section?
   – Design of additional research studies
       • e.g. Different forms of the test (Paper-&-pencil vs computer version)
   – Design of Testing events
• Data Collection:
   – First Operational Testing of Students with Final version of examinations
   – Scoring of items and Exams
• Analyses of Operational Examinations
• Research studies to establish Reporting scales
      Different types of Exam item
• Machine –Scorable formats
  – Multiple-choice Questions
  – True-False
  – Multiple true-false
  – Multiple-mark questions (Pomplun & Omar, 1997) –
    aka multiple-answer multiple-choice questions
  – Likert-like Type Items (agree/disagree continuum)
• Manual (Human) scoring formats
  – Short answers
  – Open-ended test items
     • Requires a scoring rubric to score papers
        Statistical considerations in
         Examination construction
• Overall design of tests
    – to achieve reliable (consistent) and valid results
• Designing testing events
    – to collect reliable and valid data (correct pilot sample, correct
      time of the year, etc)
    – e.g. SAT: Spring/Summer student population
• Appropriate & Correct Statistical analyses of
    examination data
•   Quality Control of test items and exams
     Analyses & Interpretation:
 Descriptive statistics for distractors
          (Distractor Analysis)
• Applies to Multiple-choice, true-false,
  multiple true-false format only
• Statistics:
  – Proportion endorsing each distractor
  – Informs the exam authors which distractor(s)
     • are not functioning or
     • Counter-intuitively more attractive than the
      intended right answer (hi ability wrong answer)
          Analyses and Interpretation:
                     Item-Level Statistics
    Difficulty of Items
    – Statistics:
        • Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer
        • Item mean – mm, open-ended items
    – Describes how difficult an item is
• Discrimination
    – Statistics:
        • Discrimination index: high vs Low examinee difference in p-value
             – An index describing sensitivity to instruction
        • item-total correlations: correlation of item (dichotomously or
          polychotomously scored) with the total score
             – pt-biserials: correlation between total score & dichotomous (right/wrong)
               item being examined
             – Biserials: same as pt-biserials except that the dichotomous item is now
               assumed to come from a normal distribution of student ability in
               responding to item
             – Polyserials: same as biserials except that the item is polychotomously
    – Describes how an item relates (thus, contributes) to the total score
Examination-Level Statistics
• Overall Difficulty of Exams/Scale
    – Statistics: Test mean, Average item Difficulty
• Overall Dispersion of Exam/Scale scores
    – Statistics: Test variability – standard deviation, variance, range, etc
• Test Speededness
    – Statistics: 1)Percent of students attempting the last few questions
    –            2) Percentages of examinees finishing the test within the allotted time period
    – Not speeded test: percentage is more than 95%
• Consistency of the Scale/Exam Scores
    – Statistics:
        • Scale Reliability Indices
              – KR20: for dichotomously scored items
              – Coefficient alpha: for dichotomously and polychotomously scored items
        • Standard error of Measurement Indices
• Validity Measures of Scale/Exam Scores
    – Intercorrelation matrix
        • High Correlation with similar measures
        • Low correlation with dissimilar measures
    – Structural analyses (Factor analyses, etc)
Statistical procedures describing
Validity of Examination scores for its
intended use
• Is Reality of Exam for the students same as Authors’
  Exam Specifications?
   – Construct validity: Analyses on exam structures (Intercorrelation
     matrix, Factor analyses, etc)
       • Can the exam measure the intended learning factors (constructs)?
       • Answer: with Factor analyses (Data Reduction method)
   – Predictive validity: predictive power of exam scores for
     explaining important variables
       • e.g. Can exam scores explain (or predict) success in college?
       • Regression Analyses
   – Differential Item Functioning: statistical bias in test items
       • Are test items fair for all subgroups (Female, Hispanic, Blacks, etc)
           of examinees taking the test?
       •   Mantel-Haenszel chi-squared Statistics
Some research areas in Educational
Testing that involve further
statistical analyses
• Reliability Theory
   – How consistent is a set of examination scores? Signal to
     signal+noise, 2/(2+ 2), ratio in educational measurement
• Generalizability Theory
   – Describing & Controlling for more than 1 source of error variance
• Differential Item Functioning
   – Pair-wise difference (F vs M, B vs W) in student performance on
   – Type I error rate control (many items & comparison  inflate
     false detection rates) issue
Some research areas in Educational
Testing that involve further statistical
analyses (continued)
• Test Equating
   – Two or more forms of the exam: Are they interchangeable?
   – If scores on form X is regressed on scores from form Y, will the scores
     from either test editions be interchangeable? Different regression
• Item Response Theory
   – Theory relating students’ unobserved ability with their responses to
   – Probability of responding correctly to test items for each level of ability
     (item characteristic curves)
   – Can put items (not test) on the same common scale.
• Vertical Scaling
   – How do student performance from different school grade groups
     compare with each other?
   – Are their means increasing rapidly, slowly, etc?
   – Are their variances constant, increasing, or decreasing?
Some research areas in Educational
Testing that involve further
statistical analyses (continued)
• Item Banking
   – Are the same items from different administrations significantly different
     in their statistical properties?
   – Need Item Response Theory to calibrate all items so that there’s one
     common scale.
   – Advantage: Can easily build test forms with similar test difficulty.
• Computerized Test
   – Are score results taken on computers interchangeable with those on
     paper-and-pencil editions? (e.g.
   – Is measure of student performances free from or tainted by their level
     of computer anxiety?
• Computer Adaptive Testing
   – increase measurement precision (test information function) by allowing
     students to take only items that are at their own ability level.

Shared By: