# SomeStats for Exams

Document Sample

```					                                Seminar
Dept of Mathematical Sciences
King Fahd University of Petroleum and Minerals

Dept of Mathematical Sciences, KFUPM

Title             The Role of Statistics in Developing Standardized
Examinations in the US
Audience              All KFUPM community are cordially Invited
Date                               Tuesday, Apr 19, 2005
Time                                      12:30 PM
Location                    Building 5, Smart Classroom # 201

Abstract
Prior to joining KFUPM, the presenter spent more than 6 years working in
educational testing organizations in the United States and attended many
conferences on educational assessment issues organized by associations like
the National Council of Measurement in Education and American Educational
Research Association. In this presentation, the presenter plans to share with
the audience the role of statistical indices and procedures in informing the
process of developing standardized examinations (such as ACT, SAT, TOEFL,
and state-mandated exams). Definition of standardized and non-standardized
examinations will be provided in this talk. Issues related to item and
examination level test-construction and analyses and how statistics help
address some of these issues will also be discussed.

Tea and Coffee will be provided
Role of Statistics in Developing
Standardized Examinations in the
US

by
April 19, 2005
Map of Talk
• What is a standardize test?
• Why standardize Tests?
• Who builds standardize tests in the United
States?
•   Steps to Building a standardize test
•   Test Questions & some statistics used to
describe them
•   Statistics used for describing exam scores
•   Research studies in educational testing that uses
What is a “standardized
Examination”?
• A standardized test: A test which the
scoring procedures are designed to be the
same in all uses of the test
•   1)   physical test setting
•   2)   directions for examinees
•   3)   test materials
– Scoring procedures:
• 1) derivation of scores
• 2) transformation of raw scores
Why standardize tests?

• Statistical reason:
– Reduction of unwanted variations in
• Scoring practices
• Practical reason:
– Appeal to many test users
– Same treatment and conditions for all
students taking the tests (fairness)
Who builds standardize tests in the
United States?
• Testing Organizations
–   Educational Testing Service (ETS)
–   American College Testing (ACT)
–   National Board of Medical Examiners (NBME)
–   Iowa Testing Programs (ITP)
–   Center for Educational Testing and Evaluation (CETE)
• State Department of Education
– New Mexico State Department of Education
• Build tests themselves or
• Contract out job to testing organizations
• Large School Districts
– Wichita Public School Districts

• Design of experiment concept of control
for unnecessary factors
• Apply the same treatment conditions for
all test takers
• 1) physical test setting (group vs individual testing, etc)
• 2) directions for examinees
• 3) test materials
b) Scoring Procedures
• Same scoring process
– Scoring rubric for open-ended items
• Same score units and same measurements for
everybody
– Raw test scores (X)
– Scale Scores
• Same Transformation of Raw Scores
– Raw (X)  Equating process  Scale Scores h(X)
Overview of Typical Standardized
Examination building Process
• Costly process
• Important Quality control procedures
at each phase
•   Process takes time (months to years)
1) Creating Test specifications
2) Fresh Item Development
3) Field-Test Development
4) Operational (Live) Test Development
1) Creating Test specifications
• Purpose:
– To operationalize the intended purpose of testing
• A team of content experts and stakeholders
– discuss the specifications vs the intended purpose
• Serves as a guideline to building examinations
– How many items should be written in each content/skill
category?
– Which Content/skill area is more important than others?
• 2-way table of specifications typically contains
– content areas (domains) versus
– learning objectives
– with % of importance associated in each cell
2) Fresh Item Development
• Purpose:
– building quality items to meet test specifications
• Writing Items to meet Test Specifications
– Q: Minimum # of items to write?
– Which cell will need to have more items?
– Item Review (Content & Bias Review)
• Design of Experiment stage
– Design of Test (easy items first, then mixture – increase motivation)
– Design of Testing event (what time of year, sample, etc)
• Data Collection stage:
– Pilot-testing of Items
– Scoring of items & PT exams
• Analyses Stage:
– analyzing Test Items
• Data Interpretation & decision-making stage:
– Item Review with aid of item statistics
• Content Review
• Bias review
– Quality control step: (1) Keep good quality item, (2) Revise items with minor problem
& re-pilot or (3)Scrap bad items
3) Field-Test Development
• Purpose:
– building quality exam scales to measure the construct (structure) of
the test as intended by the test specifications
• Design of Experiment stage
– Designing Field-Test Booklets to meet Specifications
• Use good items only from previous stage (items with known descriptive
statistics)
– Design of Testing event
• Data collection:
– Field-Testing of Test booklets
– Scoring of items and FT Exams
• Analyses
– analyzing Examination Booklets (for scale reliability and validity)
• Interpreting results: Item & Test Review
– Do tests meet the minimum statistical requirements. (rxx’ > 0.90)
– If not, what can be done differently?
4) Operational (Live)
Test Development
• Purpose:
– To measure student abilities as intended by the purpose of the test
• Design of Experiment stage
– Design of Operational Test
• Use only good FT items and FT item sets
• Assembling Operational Exam Booklets
– Design of Pilot Tests (e.g. some state mandated programs)
• New & Some of the revised items
– Design of Field Test (e.g. GRE experimental section)
• Good items that has been piloted before
• How many sections? How many students per section?
– Design of additional research studies
• e.g. Different forms of the test (Paper-&-pencil vs computer version)
– Design of Testing events
• Data Collection:
– First Operational Testing of Students with Final version of examinations
– Scoring of items and Exams
• Analyses of Operational Examinations
• Research studies to establish Reporting scales
Different types of Exam item
format
• Machine –Scorable formats
– Multiple-choice Questions
– True-False
– Multiple true-false
– Multiple-mark questions (Pomplun & Omar, 1997) –
– Likert-like Type Items (agree/disagree continuum)
• Manual (Human) scoring formats
– Open-ended test items
• Requires a scoring rubric to score papers
Statistical considerations in
Examination construction
• Overall design of tests
– to achieve reliable (consistent) and valid results
• Designing testing events
– to collect reliable and valid data (correct pilot sample, correct
time of the year, etc)
– e.g. SAT: Spring/Summer student population
difference
• Appropriate & Correct Statistical analyses of
examination data
•   Quality Control of test items and exams
Analyses & Interpretation:
Descriptive statistics for distractors
(Distractor Analysis)
• Applies to Multiple-choice, true-false,
multiple true-false format only
• Statistics:
– Proportion endorsing each distractor
– Informs the exam authors which distractor(s)
• are not functioning or
• Counter-intuitively more attractive than the
Analyses and Interpretation:
•
Item-Level Statistics
Difficulty of Items
– Statistics:
• Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer
• Item mean – mm, open-ended items
– Describes how difficult an item is
• Discrimination
– Statistics:
• Discrimination index: high vs Low examinee difference in p-value
– An index describing sensitivity to instruction
• item-total correlations: correlation of item (dichotomously or
polychotomously scored) with the total score
– pt-biserials: correlation between total score & dichotomous (right/wrong)
item being examined
– Biserials: same as pt-biserials except that the dichotomous item is now
assumed to come from a normal distribution of student ability in
responding to item
– Polyserials: same as biserials except that the item is polychotomously
scored
– Describes how an item relates (thus, contributes) to the total score
Examination-Level Statistics
• Overall Difficulty of Exams/Scale
– Statistics: Test mean, Average item Difficulty
• Overall Dispersion of Exam/Scale scores
– Statistics: Test variability – standard deviation, variance, range, etc
• Test Speededness
– Statistics: 1)Percent of students attempting the last few questions
–            2) Percentages of examinees finishing the test within the allotted time period
– Not speeded test: percentage is more than 95%
• Consistency of the Scale/Exam Scores
– Statistics:
• Scale Reliability Indices
– KR20: for dichotomously scored items
– Coefficient alpha: for dichotomously and polychotomously scored items
• Standard error of Measurement Indices
• Validity Measures of Scale/Exam Scores
– Intercorrelation matrix
• High Correlation with similar measures
• Low correlation with dissimilar measures
– Structural analyses (Factor analyses, etc)
Statistical procedures describing
Validity of Examination scores for its
intended use
• Is Reality of Exam for the students same as Authors’
Exam Specifications?
– Construct validity: Analyses on exam structures (Intercorrelation
matrix, Factor analyses, etc)
• Can the exam measure the intended learning factors (constructs)?
• Answer: with Factor analyses (Data Reduction method)
– Predictive validity: predictive power of exam scores for
explaining important variables
• e.g. Can exam scores explain (or predict) success in college?
• Regression Analyses
– Differential Item Functioning: statistical bias in test items
• Are test items fair for all subgroups (Female, Hispanic, Blacks, etc)
of examinees taking the test?
•   Mantel-Haenszel chi-squared Statistics
Some research areas in Educational
Testing that involve further
statistical analyses
• Reliability Theory
– How consistent is a set of examination scores? Signal to
signal+noise, 2/(2+ 2), ratio in educational measurement
• Generalizability Theory
– Describing & Controlling for more than 1 source of error variance
• Differential Item Functioning
– Pair-wise difference (F vs M, B vs W) in student performance on
items
– Type I error rate control (many items & comparison  inflate
false detection rates) issue
Some research areas in Educational
Testing that involve further statistical
analyses (continued)
• Test Equating
– Two or more forms of the exam: Are they interchangeable?
– If scores on form X is regressed on scores from form Y, will the scores
from either test editions be interchangeable? Different regression
functions
• Item Response Theory
– Theory relating students’ unobserved ability with their responses to
items
– Probability of responding correctly to test items for each level of ability
(item characteristic curves)
– Can put items (not test) on the same common scale.
• Vertical Scaling
– How do student performance from different school grade groups
compare with each other?
– Are their means increasing rapidly, slowly, etc?
– Are their variances constant, increasing, or decreasing?
Some research areas in Educational
Testing that involve further
statistical analyses (continued)
• Item Banking
– Are the same items from different administrations significantly different
in their statistical properties?
– Need Item Response Theory to calibrate all items so that there’s one
common scale.
– Advantage: Can easily build test forms with similar test difficulty.
• Computerized Test
– Are score results taken on computers interchangeable with those on
paper-and-pencil editions? (e.g. http://ftp.ets.org/pub/gre/002.pdf)
– Is measure of student performances free from or tainted by their level
of computer anxiety?