In Linden, K.W. (1985) Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Educational 524.
ITEM ANALYSIS AND EVALUATION PROCEDURES
I. Evaluation of the effectiveness of the items in a test requires answers to the following
questions dealing with distribution of responses, discriminating power of the item, and difficulty
value of the item:
Distribution of Responses:
Did the distractors (incorrect response choices) prove to be attractive to students who did
not know the correct answer? Did the distractors or the stem contain ambiguities?
Discriminating Power of the Item (D):
Did the item discriminate between “good” and “poor” students? That is, did more of the
students obtaining high scores on the test as a whole get the item right than did students
who made low scores on the test as a whole?
Difficulty Value of the Item (P):
Was the item high, low or medium in difficulty? What was the tendency of the test as a
whole with respect to the difficulty of the items?
In the following sections, each of the above questions is discussed specifically, and a
method is recommended for determining the answer to each question. Certain of these questions
are evaluated by means of a study of the pattern of responses; others are described
mathematically by numerical indices.
The methods of item analysis described are applicable to groups as small as 30 cases.
However, with small groups, the discrimination index and the difficulty value will tend to
fluctuate more from one use go another than they will with large groups (approximately 100
cases or more). In order to obtain reasonably stable estimates of discriminating power and
difficulty value from one administration, it is suggested that at least 100 students take the test. If
this is not possible, item data can be cumulated from several groups taking the same test.
II. Distribution of Responses
A. Directions for obtaining the distribution of responses:
1. Place the test papers in order of score from the highest to the lowest.
2. Select the 27% of the papers having the highest scores and the 27% of the papers having
the lowest scores.
3. For each item in turn, tabulate the number of students in the top 27% (designated the
High group, H) and the number of students in the bottom 27% (designated the Low group,
L) choosing each possible response to the item.
B. The tabulation of the following four-choice objective item might look like this:
Item 38: The height of the tide is dependent, in part, upon the position of the
moon in relation to the
(c) plane of the ecliptic
Item No. Responses
Group Omit A B* C D NR
38 H 17 15 3 5
L 3 11 3 12 0 1
Thus, for item number 38, 17 students in the high group chose distracter A; 15 chose the
correct answer, choice B, 3 picked C; and 5 picked D. Of the low groups, 3 skipped the item
(even though they answered the later items); 11 picked choice A; 3 chose the correct answer; 12
chose option C; and one picked D. One student in the low group failed to reach this item (none
of the items that followed were answered).
It would appear from the tabulation above that item 28 discriminated quite well between the
students in the high and low groups, since five times as many students in the high group as in the
low group responded correctly. Distracters A and C operated well since both drew a fairly large
number of students who did not know the correct answer. Distracter D, however, might bear
study, as several students in the high group chose this answer but none in the low group did. It is
entirely possible that choice D is so worded that the students in the high group read into the
distracter more than was intended or the question may have two correct answers (as in the case of
this example). Thus, the distribution of responses can often point up ambiguities and
misinterpretations that are not readily apparent on preliminary consideration of the item.
C. Any objective type item may be analyzed by this method. Completion (Fill-in-blank)
questions (and short-answer essay questions) may be analyzed by making it a two choice
situation after scoring as in the example below.
Item 21: In order to measure the specific gravity of battery acid, one would use a
Item No. Responses
Group Omit Right Wrong NR
21 H 1 14 5
L 5 5 9 1
In this example, 14 students in the high group gave an acceptable answer to the question, one
did not try to answer but did answer later questions, while 5 gave unacceptable answers, 5 did not
attempt to answer the question, though they tried later questions, and one student who did not
answer this also did not answer later questions.
Interpretation of the analysis would suggest that the item apparently was not ambiguous but
does not indicate either the quality or diversity of the acceptable and unacceptable responses. For
such test items, criterion answers must be specified against which student answers are compared.
D. Study of the number of students, “not reaching,” i.e., neither attempting this item nor later
items, especially the items near the end of the test will provide information regarding the
“speededness’ of a test. In a power test - and this includes most achievement measures - it is
desirable to have at least 90% of the students complete the test. As the “speededness” increases,
the premium placed on mental agility increases also.
III. Discriminating Power of an Item (D)
A. The question of how well an item allows for the discrimination between “good” and
“poor” students is very important to the test constructor. If it can be assumed that the test as a
whole is a suitable measure of the abilities it was designed to measure, then it may be
assumed that the total test score reflects the students’ abilities. Assuming again that these are
valid assumptions, the performance of each item in light of the student’s performance on the
total test may then be evaluated.
B. A number of methods have been designed for the determination of the discriminating
power of the items. One method which is simple and which provides a very useful index is
that proposed by Robert L. Ebel. This method defines the discrimination index (D) as equal
to the percent of the high group responding correctly to the item minus the percent of the low
group answering the item correctly. Thus, D = U- L reduces to D = U - L; that is take the
n n n
Between the number of rights in the high group and the number of rights in the low group,
divide by the total number in either the high or low group, carry calculations to two decimal
places and obtain the index. For Item 38, this becomes D:
D = 15 - 3 = .40;
For Item 25, the discriminating power is:
D = 14 - 5 = .45
C. The underlying logic for the discriminating index is as follows: The optimal item would
be one on which all students in the high group obtained the correct answer while none in the
low group responded correctly. For example, suppose the high group and low group each
contained 20 students. The ideal item would allow for the discrimination of each of the 20
individuals in the low group, thus having 20 x 20 or 400 discriminations. The discrimination
index for an item represents what percentage of the maximum possible discriminations was
D. The size of the index of discrimination that would be considered “good” will vary
according to the purpose of the test, the range of ability within the group, the size of the
sample, and the complexity of the material. It is generally thought that, for most achievement
tests, indices that are negative, or that range from 0.00 to .20, are low; those from .20 to .40
are average; and those above .40 are highly discriminating. Obviously, the minimum and
maximum values of the discrimination index are -1.00 and +1.00, respectively.
IV. Difficulty Value of an Item (P)
A. Knowledge of the difficulty of an item is of great value in the construction of further tests.
Teacher frequently would like to use some easy, some average and some difficult items in
constructing a test in order to provide opportunity for both good and poor students to
demonstrate what they know and understand.
B. One of the most usable, and yet simple, indices of difficulty is the percentage of students
in the total group who answered the items correctly. A close approximation of this figure may
be obtained quite readily by considering the responses of only the highest 27% of the total
group and the lowest 27% of the total group; thus rather than involving the entire group,
slightly more than half the group is concerned.
C. The difficulty index is the mean of the percent of the high group and percent of the low
group responding correctly to the item. This may be expressed quite simply by the formula:
P = H + L , where
H = the number of students in the high group responding correctly,
L = the number of students in the low group answering correctly, and
N = the number of cases involved in 27% of the total group. The minimum and maximum
values of this index are .00 and +1.00, respectively.
a. For Item 28, the difficulty index becomes P = 15 + 3 = .30; i.e., 30 percent of the
students taking the test responded correctly to the item.
b. For Item 25, P = 14 + 5 = .48; thus, 48 percent of the students taking the test answered
the item correctly.
D. If the purpose of the test is to differentiate among the students, it is obviously useless to
include items that all students answer correctly or that all students miss, because these items
make no differentiation. In general items in the middle range of difficulty with indices
between .50 - .60 as optimum provide the most differentiation among students.
E. For some purposes, however, it is desirable to have most students get all, or nearly all, of
the items right. This is true whenever it is desirable to learn how many students have reached
a given level of proficiency in any particular area. Thus, most of the items on a test of this
nature would demonstrate difficulty values of .75 or higher; i.e., 75 percent or more of the
group would respond correctly to the items.
F. When the purpose for testing is to identify outstanding students, the test items should
demonstrate indices of less than .25; i.e., most of the items on the test should be difficult that
fewer than 25 percent of the students respond correctly to the item.
V. The index of discrimination, index of difficulty, and distribution of responses are important
statistical tools for the teacher. Their ultimate value, however, depends solely upon their
interpretation by the test builder and the test user.
Test: High Group =
Date: Low Group =
ITEM Hi L P D ITEM Hi L P D ITEM Hi L P D ITEM Hi L P D
o o o o
Item Analysis Summary
P\D ++++ +++ ++ + Non Neg Total