Kirk Allen, Andrea D. Stone, Maria Cohenour,
Teri Reed Rhoads, Teri J. Murphy, Robert A. Terry,
The University of Oklahoma, Norman, Oklahoma
Overall Goals Background Publications Results
All publications are available on the Average scores are typically low to mid-40% range
for pre-tests and around 50% for post-tests. A
According to ABET’s EC 2000 criteria, website: summary from Fall 2003 is shown below. Other
How do we demonstrate that 16 of 24 engineering disciplines directly semesters are similar.
graduates have an ability to design http://coecs.ou.edu/sci
or indirectly mention probability and
and conduct experiments, as well as statistics within their accreditation “The Statistics Concept Inventory” : Mean
Course Mean Pre Gain
to analyze and interpret data in Post
criteria. ARTIST Roundtable Conference,
combination with the additional August 4, 2004. Engr 42.2% 44.1% +1.9%
program criteria of applying and Math #1 43.4% -- --
understanding statistics? “The Statistics Concept Inventory:
The Force Concept Inventory (FCI) in Math #2 49.7% 49.1% -0.6%
Developing a Valid and Reliable
• Develop a multiple choice test physics has been instrumental in
Instrument” : ASEE 2004 Conference, Outside #1 43.1% 49.5% +6.4%
which attempts to answer this improving educational methods.
Salt Lake City.
question – Statistics Concept Outside #2a 48.1% 52.4% +4.3%
Inventory (SCI) “The Statistics Concept Inventory: A Outside #2b 46.1% 49.9% +3.8%
Other concept inventories are being Pilot Study” : FIE 2003 Conference,
• Evaluate the reliability and developed in many engineering Colorado.
validity of the SCI according to The lack of large gains is similar to early findings on
disciplines. the Force Concept Inventory for classes which used
standards of test analysis “Progress on Concept Inventory
traditional lecture format for teaching.
Assessment Tools” : Panel Session,
• Disseminate the SCI to other FIE 2003 Conference.
universities and departments
Validity Reliability Sample Question #1
Content Validity (very important)
• Faculty survey at OU rated the importance Which would be more likely to have 70% boys born on a given day: A small rural hospital or a large urban
of statistics topics. On-line survey for outside The most common measure of reliability is hospital?
OU to be conducted soon. coefficient alpha. a) Rural
• Searched statistics textbooks and journals • Above 0.80 is an accepted standard for a b) Urban
for common topics and misconceptions. reliable test c) Equally likely
• Focus groups helped identify more d) Both are extremely unlikely
misconceptions and questions where • Some sources may consider above 0.60
students use test-taking tricks. reliable for classroom tests Pre #1 Post #1 Pre #2 Post #2 Pre #3 Post #3
Construct Validity (very important) The largest testing effort thus far was Fall 2003. 17% 32% 23%
Data from six introductory statistics courses at Choice a% 36% 32% 26%
• Factor analysis suggests that the sub- three four-year universities are shown below. (-19%) (none) (-3%)
topics of the SCI are Descriptive, Inferential,
Probability, and Graphical. Course – Fall 03 Pre-Test alpha Post-Test alpha Choice b% 6% 7% 5% 3% 9% 10%
Concurrent Validity Engr 0.6863 0.7496 Choice c% 43% 69% 45% 45% 51% 63%
Math #1 0.7122 --
• Course grades used as a concurrent Choice d% 15% 7% 18% 19% 14% 3%
measure of the SCI post-test validity. Math #2 0.6715 0.7232
Results from 3 classes, percent of students choosing each letter (Spring 2004).
• Generally, it has been valid for Engr External #1 0.7025 0.7314
courses but not Math courses.
External #2a 0.5709 0.6452 A is the correct answer. Change in percent correct provided in parenthesis.
External #2b 0.6648 0.5843
• SCI pre-test scores have little value in Misconception: do not realize the importance of sample size
determining final course grades. Reliability numbers from other semesters are
• No long-term predictive validity available. very similar to the above chart. Discrimination index on the post test is 0.44, 0.27, and 0.50. So the question could be considered
basically “good” psychometrically, as well as demonstrating the lack of knowledge gain.
Sample Question #2
Item Response On-Line Test
Theory An on-line testing system was developed in Fall
Example of a question where students demonstrate gain on a topic that they definitely will cover but may
not have been formally introduced to prior to a statistics class.
2004. Some of the features:
General idea: For each item, a logistic curve is fit A scientist takes a set of 50 measurements. The standard deviation is reported as -2.30. Which of the
which describes the probability of answering the • Interface programmed with PHP.
following must be true?
item correctly as a function of item parameters • Data contained in mySQL database a) Most of the measurements were negative
(difficulty & discrimination) and a student’s latent
b) All of the measurements less than the mean
ability. • Students are added to the system by an
c) All of the measurements were negative
From the SCI administrator (e.g. instructor or TA) and then
d) The standard deviation was calculated incorrectly
receive password via email.
Slightly Difficult item Easy item
• Questions are presented in random order Pre #1 Post #1 Pre #2 Post #2 Pre #3 Post #3
(strong discrimination) (weak discrimination)
to reduce likelihood of collaboration. Choice a% 9% 14% 8% 3% 26% 20%
Item Characteristic Curv e: XP2A Item Characteristic Curv e: XG5 • SCI contains four topic areas, and Choice b% 21% 0% 21% 10% 17% 23%
a = 2.210 b = 0.746 a = 0.584 b = -1.774
1.0 1.0 instructors have the option of administering
only certain areas. Choice c% 7% 3% 5% 0% 9% 3%
The system has been tested by a small group of 79% 87% 53%
Choice d% 60% 61% 43%
students at the end of the Fall semester. More (+19%) (+26%) (+10%)
extensive testing is planned for the Spring
0.2 0.2 semester. Discrimination index: 0.45, 0.27, 0.79
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Misconception: do not understand how standard deviation is calculated (i.e., it can never be negative)