PSY 513 – Lecture 1
Reliability
Characteristics of a psychological test or measuring procedure
1. Reliability – The extent to which a test or measuring procedure yields the same score for the
same person from one administration to the next.
2. Validity – the extent to which scores on a test correlate with some valued criterion – another
measures of the same construct, other measures of different constructs, performance on a job or
task.
3. Reading level
4. Face validity – Extent to which test appears to measure what it measures.
5. Content validity – Extent to which test content corresponds to content of what it is designed
to measure or predict.
6. Cost
What makes a good test?
A good psychological test is reliable, valid, has a reading level appropriate to the intended
population, has acceptable face and content validity, and is cheap.
PSY 513: Lecture 1: Reliability - 1 11/23/11
Scoring a psychological test
Most tests have multiple items. The test score is usually the sum or average of responses to the multiple items.
If the test is one of knowledge, the sum is of the number of correct responses.
If the test is a personality test, the sum or average is of numerically coded responses, e.g., 1s, 2s, . . . 5s.
Sometimes subtest scores are computed and the overall score will be the sum of scores on subtests.
Occasionally, the overall score will be the result of performance on some task, such as holding a stylus on a
revolving disk, as in the Pursuit Rotor task or moving pegs from holes in one board to holes in another, as in
the Pegboard dexterity task.
Invariably the result of the “measurement” of a characteristic using a psychological test is a number – the
person’s score on that test, just as the result of measurement of weight is a number – the score on the face of the
bathroom scale.
Reliability
Working Definition: The extent to which a test or measuring procedure yields the same score for the same
person from one administration to the next in instances when the person has not changed from one time to the
next.
Consider the following hypothetical measurements of IQ
Highly Reliable Test Test with Low Reliability
IQ at Time 1 IQ at Time 2 Person IQ at Time 1 IQ at Time 2
112 111 1 112 105
140 141 2 140 128
85 86 3 85 92
106 108 4 106 100
108 107 5 108 116
95 93 6 95 105
117 118 7 117 110
120 121 8 120 126
135 134 9 135 130
High reliability: Persons' scores will be about the same from measurement to measurement.
Low reliability: Persons' scores will be different from measurement to measurement.
Note that there is no claim that these IQ scores are the “correct” values for Persons 1-9. That is, this is not
about whether or not they are valid or accurate measures. It’s just about whether whatever measures we have
are the same from one time to the next.
Why do we care about reliability?
Later, although think about your bathroom scale and the number it gives you from day to day. What would you
prefer – a number that varied considerably from day to day or a number that, assuming you haven’t changed,
was about the same from day to day.
PSY 513: Lecture 1: Reliability - 2 11/23/11
Classical Test Theory: Model of an observed score
Key concepts: (Concepts not actually observable are dimmed.)
Observed score. The score of a person on the measuring instrument.
True score. The actual amount of the characteristic possessed by an individual.
It is assumed to be unchanged from measurement to measurement (within reason).
Error of measurement. An addition to or subtraction from the true score which is random and unique to
the person and time of measurement.
In Classical Test Theory, the observed score is the sum of true score and the error of measurement.
Symbolically: Observed Score = True Score + Error of Measurement.
Xj = T + Ej where j represents the measurement time.
Note that T is not subscripted because it is assumed to be constant across times of measurement.
It is assumed that if there were no error of measurement the observed score would equal the true score. But,
typically error of measurement causes the observed score to be different from the true score.
So, For a person, Observed Score at time 1 = True Score + Measurement Error at time 1.
Observed Score at time 2 = True Score + Measurement Error at time 2.
Note again that the true score is assumed to remain constant across measurements.
PSY 513: Lecture 1: Reliability - 3 11/23/11
Conceptualizing reliability
Two possibilities, both requiring measurement at two points in time.
1. Conceptualizing reliability as differences between scores from one time to another.
This is the conceptualization that follows naturally from the Classical Test Theory notions above.
Consider just the absolute differences between measures.
Highly Reliable Test with Low Reliability
Person
IQ at Time 1 IQ at Time 2 Difference IQ at Time 1 IQ at Time 2 Difference
1 112 111 1 112 109 2
2 140 140 0 140 128 12
3 85 86 -1 85 92 -7
4 106 108 -2 106 100 6
5 108 107 1 108 116 -8
6 95 93 2 95 105 -10
7 117 118 -1 117 110 7
8 120 120 0 120 123 -2
9 135 135 0 135 130 5
The distributions of differences
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 10 12
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 10 12
A measure of variability of the differences could be used as a summary of reliability.
One such measure is Standard Error of Measurement, abbreviated SEM.
The SEM is the standard deviation of difference scores obtained from two applications of the same test.
The smaller the SEM, the more reliable the test.
Advantages
1) This conceptualization naturally stems from the Classical Test Theory framework – it is the
variability of the Es in the Xi = T + Ei formulation.
2) So it’s easy to understand.
Problems: 1) It's a golf score, smaller is better. Some nongolfers have trouble with such measures.
2) The SEM depends on the response scale. Tests with a 1-7 scale will have larger SEMs than tests
that use a 1-5 scale, even though the test items might be identical.
3) It requires that the test be given twice, with no memory of the first test when participants take the
2nd test, a situation that’s hard to create.
It is useful however, to assess how much one could expect a person’s score to vary from one time to another.
For example: You miss the cutoff for a program by 10 points. If the SEM is 40, then you have a good chance
of exceeding the cutoff next time you take the test. If the SEM is 2, then your chances of exceeding the cutoff
by taking the test again are much smaller.
PSY 513: Lecture 1: Reliability - 4 11/23/11
2. Conceptualizing reliability as the correlation between measurements at two time periods.
This conceptualization is based on the fact that if the differences in values of scores on two successive
measurements are small, than the correlation between those two sets of scores will be high and positive.
150
140
130
Score
at 120
Time
Correlation between the
2 110 two administrations of a
highly reliable test.
100
HIGHREL1
90
80
80 90 100 110 120 130 140 150
HIGHREL2 Score at Time 1
150
140
130
Score
at 120
Correlation between the
Time two administrations of a
2 110
test with low reliability.
100
LOWREL1
90
80
90 100 110 120 130 140
LOWREL2 Score at Time 1
If the measurements are identical from time 1 to time 2, r = 1.
If there is no correspondence between measures at the two time periods, r = 0.
Advantages of using the correlation between two administrations as a measure of reliability -
1) It’s a bowling score – bigger r means higher reliability.
2) It is relatively independent of response scale – items scores on a 1-5 scale are about as reliable as the
same items scored on a 1-7 scale.
3) The correlation is a standardized measure ranging from 0 to 1, so it’s easy to conceptualize reliability
in an absolute sense – Close to 1 is good; close to 0 is bad..
Disadvantages 1) Relationship to Classical Test Theory requires some thought.
2) Assessment requires two administrations.
Conclusion
Most common measures of reliability are based on the conception of reliability as the correlation between
successive measures.
PSY 513: Lecture 1: Reliability - 5 11/23/11
Definition of reliability
The reliability of a test is the correlation between the population of values of the test at time 1 and the
population of values at time 2 assuming constant true scores and no carryover between the two
measurements.
Symbolized as Population rXX' or simply as rXX' This is pronounced “r sub X, X-prime”.
Issues associated with the definition
The definition of reliability refers to a situation that most likely is not realizable in practice.
1) If the population is large, vague, or infinite, then it will be impossible to access all the
members of the population.
2) The assumption of no carry-over from Time 1 to Time 2 is very difficult to realize in
practice, since people remember how they performed or responded on tests. For this reason, it is
usually (though not always) not feasible in practice to test people twice to measure reliability.
The bottom line is that the true reliability of a test is a quantity that we’ll never actually know. What we will
know is one or more estimates of reliability.
You’ll hear people speak about “the reliability of the test”. You should remember that they should say, “the
estimate of the reliability of the test”.
As an aside, recent research has emphasized that persons determine the reliability estimates.
Persons who are inconsistent will yield lower estimates of reliability than those who are consistent.
I’ll use the phrase “true reliability” or “population reliability” to refer to the population value.
I’ll try to remember to use “estimate of reliability” when referring to one of the estimates.
Some facts about reliability from Classical Test Theory
1. Variance of Observed scores = Variance of True scores + Variance of Errors of Measurement
σ2X = σ2T + σ2E
2. True reliability = Variance of True scores / Variance of Observed scores.
rXX' = σ2T / σ2X
Neither of these is of particular use in practice, though. They’re presented here for completeness.
PSY 513: Lecture 1: Reliability - 6 11/23/11
Estimates of Reliability
As said above, we never know the true reliability of a test. So we have to get by with estimates of reliability.
Test-retest estimate
Operational Definition
1. Give the test to a normative group.
2. Minimize memory/carryover from the first administration.
3. Give the test again to the same people.
4. Compute the correlation between scores on the two administrations.
Most straightforward – fits nicely with the conceptual definition of true reliability
Disadvantages
Requires two administrations of the test – more time.
May be inflated by memory/carryover from the first administration to the second
Advantages
Has good “face” validity.
For performance tests, the test-retest method may be the only feasible method.
For single-item scores, may be the only feasible method.
PSY 513: Lecture 1: Reliability - 7 11/23/11
Parallel Forms estimate
Operational Definition
1. Develop two equivalent forms of the test. Should have same mean and variance.
2. Give both forms to the normative group.
3. Compute the correlation between paired scores.
2nd administration of the test as opposed to. administration of an equivalent form of the test.
Note that this definition has introduced a new notion – the notion that an equivalent form can “stand in” for the
original test when computing the correlation that is the estimate of reliability.
If we give the same test twice, we can be reasonably sure it’s the same test on the second administration as it
was on the first.
But giving an equivalent form of the test requires a leap of faith – that the 2nd form is interchangeable with the
original form on that second administration.
The notion of alternative measures being used instead of repeated administration of the same measure has
implications for other estimates of reliability to be considered shortly.
The key to the success of the parallel forms method is that the two forms be equivalent. Equal means and
variances are a primary way of insuring that equivalence.
Advantages
Don’t have to worry about memory/carryover between two administrations.
Having two forms that can be used interchangeably may be useful in practice.
Disadvantages
Takes more time to develop two forms than it does one.
It may not be possible to develop alternative, equivalent forms.
A low reliability estimate, i.e., low r between forms, has two interpretations
1. Low reliability.
2. Forms are not equivalent.
PSY 513: Lecture 1: Reliability - 8 11/23/11
Split-half estimate
“Halving your test and using it two.”
Operational Definition
1. Identify two equivalent halves of the test.
2. Give the test once.
3. Score the halves separately, so that you have two scores for each person – score on 1st half and on 2nd half.
4. Compute the correlation between the 1st Half and 2nd Half scores. Call that correlation rH1,H2.
5. Plug the correlation into the following Spearman-Brown Prophecy Formula
2 * rH1,H2
Split-half reliability estimate = -------------------------------
1 + rH1,H2
Note that the higher the correlation between the two halves, the larger the estimated reliability.
An Internal Consistency Estimate
The split-half method is the simplest example of what are called internal consistency estimates of reliability.
The estimate relies on the consistency (correlation) of the two halves, both of which are internal to the test.
The greater the consistency – correlation - of the two halves, the higher the reliability.
Advantages
1. It allows you to estimate reliability in a single setting.
2. Very computerizable. The program that scores the whole test can be program to score the two halves and
compute a reliability estimate at the same time.
Disadvantages
1. Requires equivalent halves. This may be hard to achieve.
2. A low reliability estimate may be the result of either 1) low reliability or 2) nonequivalence of the halves.
3. Test may not be splittable.
4. Different halving techniques give different estimates of reliability.
PSY 513: Lecture 1: Reliability - 9 11/23/11
Cronbach’s coefficient alpha estimate
Coefficient alpha takes the notion introduced by the split-half technique to its logical conclusion.
Logic
The split-half uses the consistency of two halves to estimate the reliability of the whole – the sum of the two
halves.
But it’s surely the case that the particular halves chosen will affect the estimate of reliability. Some will lead to
lower estimates. Other possible halves might lead to larger estimates of reliability.
So, the logic goes, why not look at all possible halves; compute a reliability estimate for each possible split;
then average all those reliability estimates.
Coefficient alpha essentially does this, although it is not directly based on halving the test.
Instead, alpha is based on splitting the test into as many pieces as you can, usually into as many items as there
are on the test, and computing the correlation between them.
Operational Definition of Standardized Alpha..
1. Identify as many equivalent pieces of the test as possible. Let K be the number of pieces identified.
2. Compute the correlations between all possible pairs of pieces.
3. Compute the mean (arithmetic average) of the correlations. Call it r-bar. (r for correlation; bar for mean)
4. Plug K and r-bar into the following formula
K * r-bar
Standardized alpha = α = ----------------------------------------------
1 + (K-1) * r-bar
Relationship to split-half reliability
Coefficient alpha is simply an extension of split-half reliability to more than two pieces.
Note that if K = 2, then there is only one correlation – the correlation between the two halves.
So r-bar is simply rH1,H2.
And the formula for alpha reduces to 2*rH1,H2 / (1 + rH1,H2). This is the split-half formula.
“Regular” alpha vs. Standardized alpha
There is another formula, based on variances of the pieces and covariances between them that is typically
computed and reported. If you see alpha reported, it will likely be the variance-based version.
I presented the standardized version here, because 1) it’s formula is easier to follow than the variance-based
formula and 2) its value is typically within .02 of the variance-based formula.
SPSS reports both.
PSY 513: Lecture 1: Reliability - 10 11/23/11
Hand Computation Of
Standardized Coefficient Alpha
Suppose a measure of job satisfaction has four items.
Q1: I'M HAPPY ON MY JOB.
Q2: I LOOK FORWARD TO GOING TO WORK EACH DAY.
Q3: I HAVE FRIENDLY RELATIONSHIPS WITH MY COWORKERS.
Q4: MY JOB PAYS WELL.
Suppose I gave this "job satisfaction" instrument to a group of 100 employees. Each person responded with
extent of agreement to each item on a scale of 1 to 5. Total score, i.e., observed amount of job satisfaction, is
either the sum of the responses to the four items or the mean of the four items
The data matrix might look like the following:
Two different
Expressions of Scale scores
PERSON Q1 Q2 Q3 Q4 TOTAL MEAN
1 3 4 3 3 13 3.25
2 5 4 5 5 19 4.75
3 1 2 1 1 5 1.25
4 3 2 3 3 11 2.25
5 4 5 4 3 16 4.00
6 4 4 3 2 13 3.25
etc etc etc etc etc etc
Suppose the correlations between the items were as follows:
Q1 Q2 Q3 Q4
Q1 1 Obviously, each item correlated perfectly with itself,
Q2 .4 1 so the 1's on the diagonal will not be used in
Q3 .5 .4 1
Q4 .3 .4 .5 1
computation of alpha.
The average of the interitem correlations is
r-bar = (.4 + .5 + .3 + .4 + .4 + .5) / 6 = 2.5 / 6 = .417
Standardized Coefficient alpha is
No. items * r-bar 4 * .417 1.668 1.668
Alpha = ---------------------------- = ------------------------ = ------------ --------- = .74
1+(No.items-1)*r-bar 1 + (4-1)*.417 1 + 1.251 2.251
Notes:
1. Alpha is merely a re-expression of the correlations between the items. The more highly the items are
intercorrelated, the larger the value of alpha.
2. Alpha can be increased by adding items, as long as the average of the interitem correlations does not
decrease. So any test can be made more reliable by adding relevant items - items which correlate with the other
items.
3. Just as was the split-half reliability estimate, alpha depends on the consistency (correlations) of the pieces of
the test, all of which are internal to (part of) the test.
PSY 513: Lecture 1: Reliability - 11 11/23/11
The SPSS RELIABILITY PROCEDURE
Example Data: Items of a Job Satisfaction Scale. 60 respondents. 1=Dissatisfied; 7=Satisfied.
Q27 Q32 Q35 Q37 Q43 Q45 Q50 OVSAT
1.00 5.00 2.00 2.00 1.00 1.00 2.00 2.00
1.00 7.00 6.00 4.00 6.00 2.00 6.00 4.57
7.00 7.00 1.00 7.00 7.00 6.00 7.00 6.00
4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71
1.00 6.00 5.00 2.00 1.00 1.00 3.00 2.71
3.00 3.00 7.00 6.00 7.00 1.00 6.00 4.71
6.00 7.00 7.00 6.00 6.00 6.00 7.00 6.43
2.00 7.00 3.00 3.00 3.00 1.00 3.00 3.14
6.00 6.00 7.00 6.00 6.00 6.00 6.00 6.14
4.00 6.00 5.00 4.00 4.00 3.00 3.00 4.14
1.00 3.00 6.00 5.00 5.00 6.00 5.00 4.43
1.00 5.00 1.00 1.00 1.00 1.00 1.00 1.57
1.00 5.00 1.00 1.00 1.00 5.00 1.00 2.14
1.00 7.00 2.00 2.00 3.00 3.00 3.00 3.00
7.00 7.00 6.00 7.00 7.00 7.00 7.00 6.86
6.00 4.00 4.00 7.00 6.00 7.00 7.00 5.86
7.00 7.00 7.00 7.00 5.00 7.00 7.00 6.71
7.00 7.00 4.00 4.00 7.00 7.00 6.00 6.00
6.00 5.00 7.00 7.00 6.00 5.00 6.00 6.00
7.00 5.00 5.00 6.00 6.00 2.00 9.00 5.17
3.00 6.00 6.00 3.00 5.00 5.00 5.00 4.71
3.00 7.00 6.00 7.00 4.00 3.00 7.00 5.29
6.00 6.00 7.00 7.00 7.00 6.00 7.00 6.57
3.00 7.00 7.00 7.00 6.00 1.00 7.00 5.43
5.00 7.00 6.00 6.00 7.00 6.00 6.00 6.14
4.00 6.00 6.00 6.00 6.00 3.00 6.00 5.29
5.00 5.00 6.00 5.00 5.00 1.00 5.00 4.57
3.00 6.00 2.00 5.00 6.00 6.00 5.00 4.71
4.00 4.00 2.00 3.00 3.00 2.00 2.00 2.86
7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
5.00 6.00 6.00 4.00 7.00 6.00 6.00 5.71
7.00 6.00 4.00 7.00 7.00 5.00 7.00 6.14
4.00 5.00 4.00 5.00 7.00 5.00 7.00 5.29
3.00 7.00 7.00 7.00 6.00 6.00 7.00 6.14
6.00 6.00 6.00 6.00 5.00 5.00 6.00 5.71
4.00 5.00 7.00 4.00 6.00 4.00 7.00 5.29
7.00 7.00 6.00 7.00 7.00 6.00 7.00 6.71
6.00 5.00 2.00 7.00 6.00 6.00 7.00 5.57
3.00 6.00 7.00 5.00 3.00 7.00 6.00 5.29
6.00 6.00 7.00 7.00 6.00 6.00 7.00 6.43
6.00 4.00 5.00 7.00 6.00 6.00 6.00 5.71
4.00 4.00 4.00 6.00 4.00 1.00 2.00 3.57
5.00 5.00 6.00 6.00 7.00 5.00 6.00 5.71
4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71
5.00 6.00 6.00 6.00 6.00 6.00 6.00 5.86
2.00 2.00 2.00 2.00 2.00 2.00 3.00 2.14
5.00 6.00 6.00 5.00 5.00 6.00 6.00 5.57
2.00 6.00 6.00 5.00 3.00 5.00 6.00 4.71
5.00 6.00 2.00 5.00 5.00 6.00 4.00 4.71
5.00 6.00 7.00 6.00 6.00 7.00 7.00 6.29
1.00 6.00 6.00 2.00 5.00 1.00 5.00 3.71
5.00 6.00 7.00 6.00 6.00 3.00 7.00 5.71
6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
7.00 7.00 7.00 7.00 7.00 6.00 7.00 6.86
7.00 1.00 6.00 7.00 3.00 5.00 6.00 5.00
4.00 6.00 6.00 5.00 5.00 6.00 6.00 5.43
1.00 5.00 5.00 5.00 1.00 2.00 5.00 3.43
1.00 6.00 5.00 3.00 5.00 5.00 3.00 4.00
7.00 7.00 7.00 7.00 7.00 5.00 7.00 6.71
4.00 6.00 7.00 7.00 7.00 5.00 7.00 6.14
PSY 513: Lecture 1: Reliability - 12 11/23/11
Analyze -> Scale -> Reliability Analysis …
PSY 513: Lecture 1: Reliability - 13 11/23/11
Alpha and standardized
alpha should be
approximately equal.
All items should have
approximately equal
standard deviations. Item 32
is suspect here. In general,
items with small standard
deviations will tend to
suppress reliability.
Look for items with
small or negative
correlations with the
other items. They'll be
the most likely
candidates for exclusion
from the scale. Item 32’s
correlations have been
highlighted.
PSY 513: Lecture 1: Reliability - 14 11/23/11
Use this column to
identify items
whose removal
would result in an
increase in scale
reliability, such as
Item 32.
I’ve reproduced the display of alpha for the whole scale to make it easier to use the values in the rightmost
column above.
PSY 513: Lecture 1: Reliability - 15 11/23/11
Reliability Example
Tests with Right/Wrong Answers
The example below illustrates how reliability analysis would be performed on a multiple choice test in which there was a right and
wrong answer to each item.
I chose to enter the raw responses to the items into SPSS from within a Syntax Window.
The DATA LIST command tells SPSS the names of the variables (q1, q2, . . ., q36) and where each is located within a line (columns
1-36). For this example, q36 was an essay question and was not included in the reliability analysis.
The values represent responses marked by test takers as follows:
1=a 2=b 3=c 4=d 9=no answer provided.
DATA LIST /q1 to q36 1-36.
BEGIN DATA.
333331112113322114241221114421423122
311333432212311114341422321112133224
323333431212311424242411331222413225
333333441212321421242411322921423223
313333441232311121241411324121423225
323333141212311121242412321221423225
111411431212213434342421111222433225
211321413212333314142413224121443124
333311112313122412341411222122423221
332333133212311414142411224222423220
323333431212311414242441321221433223
213332131212311134142211221221433225
313333441312322114122411214222333325
323333431212314121242411324221422223
331332412212312114241111311422413221
333133413232311124142214131121433224
313333431212311414242411221121423225
313321432313341324342431311221433224
312321431333212424232111223221433223
323332131213311414242411321421433224
323333441212311124242411321221423225
313333431212311123242411321221421225
331333431212311121242411214312423293
113333412313313121242241221321423324
END DATA.
PSY 513: Lecture 1: Reliability - 16 11/23/11
The following commands "score" each response and put the score for each question into a new variable.
RECODE q1 (3=1) (ELSE=0) INTO q1score.
RECODE q2 (2=1) (ELSE=0) INTO q2score.
RECODE q3 (3=1) (ELSE=0) INTO q3score.
RECODE q4 (3=1) (ELSE=0) INTO q4score.
RECODE q5 (3=1) (ELSE=0) INTO q5score.
RECODE q6 (3=1) (ELSE=0) INTO q6score.
RECODE q7 (4=1) (ELSE=0) INTO q7score. This question
had two correct
RECODE q8 (3=1) (ELSE=0) INTO q8score. answers.
RECODE q9 (1=1) (ELSE=0) INTO q9score.
RECODE q10 (2,3=1) (ELSE=0) into q10score.
RECODE q11 (1=1) (ELSE=0) INTO q11score.
RECODE q12 (2=1) (ELSE=0) INTO q12score.
RECODE q13 (3=1) (ELSE=0) INTO q13score.
RECODE q14 (1=1) (ELSE=0) INTO q14score.
RECODE q15 (1=1) (ELSE=0) INTO q15score.
RECODE q16 (1=1) (ELSE=0) INTO q16score.
RECODE q17 (2=1) (ELSE=0) INTO q17score.
RECODE q18 (1=1) (ELSE=0) INTO q18score.
RECODE q19 (2=1) (ELSE=0) INTO q19score.
RECODE q20 (4=1) (ELSE=0) INTO q20score.
RECODE q21 (2=1) (ELSE=0) INTO q21score. This question
RECODE q22 (4=1) (ELSE=0) INTO q22score. had two correct
RECODE q23 (1=1) (ELSE=0) INTO q23score. answers.
RECODE q24 (1=1) (ELSE=0) INTO q24score.
RECODE q25 (2,3=1) (ELSE=0) INTO q25score.
RECODE q26 (2=1) (ELSE=0) INTO q26score.
RECODE q27 (1=1) (ELSE=0) INTO q27score.
RECODE q28 (2=1) (ELSE=0) INTO q28score.
RECODE q29 (2=1) (ELSE=0) INTO q29score.
RECODE q30 (1=1) (ELSE=0) INTO q30score.
RECODE q31 (4=1) (ELSE=0) INTO q31score.
RECODE q32 (2=1) (ELSE=0) INTO q32score.
RECODE q33 (3=1) (ELSE=0) INTO q33score.
RECODE q34 (2=1) (ELSE=0) INTO q34score.
RECODE q35 (2=1) (ELSE=0) INTO q35score.
PSY 513: Lecture 1: Reliability - 17 11/23/11
These are the
The following is a list of the newly created "score" variables. variable names.
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q Q 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S
C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E TOTSCORE
1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 16
1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 21
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 30
1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 29
1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 29
1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 32
0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 18
0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 17
1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1 1 17
1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 25
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 30
0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 26
1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 21
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 32
1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 1 21
1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 22
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 30
1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 23
1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 21
1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 27
1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 33
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 32
1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 27
0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 25
The RELIABILITY procedure was invoked with the following syntax command. Obviously, it can also be
invoked from a pull down menu:
Note that the variables which are assessed are the 1/0 "score" variables, not the original responses.
RELIABILITY
/VARIABLES=q1score q2score q3score q4score q5score q6score q7score q8score
q9score q10score q11score q12score q13score q14score q15score q16score
q17score q18score q19score q20score q21score q22score q23score q24score
q25score q26score q27score q28score q29score q30score q31score q32score
q33score q34score q35score
/FORMAT=NOLABELS
/SCALE(ALPHA)=ALL/MODEL=ALPHA The syntax invoking the
/STATISTICS=DESCRIPTIVE SCALE RELIABILITY procedure.
/SUMMARY=TOTAL CORR .
PSY 513: Lecture 1: Reliability - 18 11/23/11
Reliability output from a previous version of SPSS.
****** Method 2 (covariance matrix) will be used for this analysis ******
_
R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
Mean Std Dev Cases
1. Q1SCORE .8333 .3807 24.0
2. Q2SCORE .2500 .4423 24.0
3. Q3SCORE .7083 .4643 24.0
4. Q4SCORE .9167 .2823 24.0
5. Q5SCORE .7917 .4149 24.0
6. Q6SCORE .6250 .4945 24.0
7. Q7SCORE .7500 .4423 24.0
8. Q8SCORE .5417 .5090 24.0
9. Q9SCORE .6250 .4945 24.0
10. Q10SCORE .9583 .2041 24.0
11. Q11SCORE .8750 .3378 24.0
12. Q12SCORE .7500 .4423 24.0
13. Q13SCORE .8750 .3378 24.0
14. Q14SCORE .7500 .4423 24.0
15. Q15SCORE .6250 .4945 24.0
16. Q16SCORE .5417 .5090 24.0
17. Q17SCORE .5000 .5108 24.0
18. Q18SCORE .2500 .4423 24.0
19. Q19SCORE .6250 .4945 24.0
20. Q20SCORE .9167 .2823 24.0
21. Q21SCORE .7917 .4149 24.0
22. Q22SCORE .7500 .4423 24.0
23. Q23SCORE .7500 .4423 24.0
24. Q24SCORE .8333 .3807 24.0
25. Q25SCORE .8750 .3378 24.0
26. Q26SCORE .6667 .4815 24.0
27. Q27SCORE .5833 .5036 24.0
28. Q28SCORE .5000 .5108 24.0
29. Q29SCORE .9167 .2823 24.0
30. Q30SCORE .6667 .4815 24.0
31. Q31SCORE .9167 .2823 24.0
32. Q32SCORE .5000 .5108 24.0
33. Q33SCORE .9167 .2823 24.0
34. Q34SCORE .8333 .3807 24.0
35. Q35SCORE .9583 .2041 24.0 This message will be printed
* * * Warning * * * Determinant of matrix is zero
whenever the number of
variables exceed the number of
Statistics based on inverse matrix for scale ALPHA persons. Alpha is not affected.
are meaningless and printed as .
_
PSY 513: Lecture 1: Reliability - 19 11/23/11
R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
N of Cases = 24.0
N of
Statistics for Mean Variance Std Dev Variables
Scale 25.1667 28.7536 5.3622 35
Inter-item
Correlations Mean Minimum Maximum Range Max/Min Variance
.0972 -.3780 .7977 1.1757 -2.1106 .0454
_
R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
Item-total Statistics
Scale Scale Corrected
Mean Variance Item- Squared Alpha
if Item if Item Total Multiple if Item
Deleted Deleted Correlation Correlation Deleted
Q1SCORE 24.3333 27.6232 .2463 . .8050
Q2SCORE 24.9167 26.0797 .5486 . .7937
Q3SCORE 24.4583 26.6938 .3844 . .7999
Q4SCORE 24.2500 27.9348 .2477 . .8051
Q5SCORE 24.3750 26.3315 .5285 . .7950
Q6SCORE 24.5417 25.4764 .6075 . .7901
Q7SCORE 24.4167 28.2536 .0647 . .8119
Q8SCORE 24.6250 27.7228 .1440 . .8101
Q9SCORE 24.5417 25.5634 .5890 . .7909
Q10SCORE 24.2083 27.9982 .3304 . .8042
Q11SCORE 24.2917 28.5634 .0211 . .8113
Q12SCORE 24.4167 27.0362 .3308 . .8021
Q13SCORE 24.2917 27.1721 .4166 . .8001
Q14SCORE 24.4167 26.5145 .4486 . .7976
Q15SCORE 24.5417 25.6504 .5707 . .7917
Q16SCORE 24.6250 28.1576 .0624 . .8135
Q17SCORE 24.6667 26.1449 .4495 . .7969
Q18SCORE 24.9167 26.9493 .3503 . .8013
Q19SCORE 24.5417 25.8243 .5342 . .7933
Q20SCORE 24.2500 28.1087 .1888 . .8065
Q21SCORE 24.3750 27.0272 .3604 . .8011
Q22SCORE 24.4167 27.2101 .2921 . .8035
Q23SCORE 24.4167 27.3841 .2536 . .8050
Q24SCORE 24.3333 28.1449 .1148 . .8092
Q25SCORE 24.2917 27.1721 .4166 . .8001
Q26SCORE 24.5000 26.9565 .3130 . .8028
Q27SCORE 24.5833 27.4710 .1949 . .8079
Q28SCORE 24.6667 27.1884 .2449 . .8058
Q29SCORE 24.2500 28.6304 .0144 . .8106
Q30SCORE 24.5000 27.1304 .2774 . .8042
Q31SCORE 24.2500 28.1087 .1888 . .8065
Q32SCORE 24.6667 26.8406 .3122 . .8029
Q33SCORE 24.2500 30.0217 -.4356 . .8208
Q34SCORE 24.3333 27.0145 .4028 . .7999
Q35SCORE 24.2083 28.9547 -.1105 . .8117
_
R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
Reliability Coefficients 35 items
Alpha = .8080 Standardized item alpha = .7902
PSY 513: Lecture 1: Reliability - 20 11/23/11
The logic behind coefficient alpha
Coefficient alpha is based on the premise that originated with the use of the Spearman-Brown
split half estimate: If different parts of a test correlate highly with each other, then that means
they would be likely to correlate higher with themselves
Factors affecting estimates of reliability
There are at least three major factors that will affect the relationship of a reliability estimate to the true
reliability of the test in the population in which the test will be used.
Let’s call the sample of persons upon whom the reliability estimate is based the reliability sample.
1. Variability of the reliability sample relative to variability of the population in which the instrument will
be used.
If the reliability sample is too homogeneous, the reliability estimate will be too small.
On the assumption that you want to report as high a reliability coefficient as possible, this suggests that
you should make the sample from whom you obtain the estimate of reliability as heterogeneous as
possible.
2. Errors of measurement specific to the reliability sample.
Suppose the test requires the reading level of a college graduate, but you include a variety of persons,
including some persons not in college in the reliability sample.
This means that some of the people won’t understand some of the items and will guess.
Guessing is represented in Classical Test Theory by large errors of measurement.
So test characteristics such as reading level, poor wording that cause large errors of measurement reduce
reliability and estimates of reliability.
3. Consistency of the people making up the reliability sample.
The specific people making up the sample may contribute to the errors of measurement referred to in 2
above.
Some people are more careless (?) inconsistent (?) than others. If the reliability sample is composed of a
bunch of careless respondents, the reliability estimates will be smaller than if the reliability sample were
composed of consistent responders.
We split a sample into two groups based on the variability of their responses to items within the same
Big Five dimension. Here are the reliability values for the two groups . . .
Group Extraversion Agreeableness Conscientiousness Stability Intellect
Consistent .92 .83 .84 .90 .85
Inconsistent .85 .69 .79 .83 .76
PSY 513: Lecture 1: Reliability - 21 11/23/11
The Reliability Ceiling: Why be concerned with reliability?
Goal of research: To find relationships (correlations) between independent and dependent variables.
If we find significant correlations, our work is lauded, published, rewarded.
If we don’t find significant correlations, our work is round-filed.
So, most of the time we want large correlations between the measures we administer.
Of all the tests out there, with which test will your test correlate most highly?
The answer is that a test will correlate most highly with itself.
And reliability is the extent to which a test correlates with itself.
So if reliability is low, that means that a test doesn’t even correlate highly with itself.
That being the case, how could we expect it to correlate highly with any other test?
And the answer is: we couldn’t. If a test doesn’t correlate highly with itself, it won’t correlate highly with any
other test.
The fact that the reliability of a test limits its ability to correlate with other tests is called the reliability ceiling
associated with a test.
Reliability Ceiling Formula.
Suppose X is the independent variable and Y is the dependent variable in the relationship being tests.
Let rXX’ and rYY’ be the true reliability of X and Y respectively.
Let rtX,tY be the correlation between the True scores on the X dimension and True scores on the Y dimension.
Then rXY < = rtX,tY * sqrt(rXX’*rYY’)
The correlation between observed X and Y scores will be less than the correlation between True X and True Y
by a factor that is the square root of the products of the two reliabilities. Unless reliabilities are 1, this means
that the observed correlation will always be less than the true correlation.
PSY 513: Lecture 1: Reliability - 22 11/23/11
Reasons for nonsignficant correlations between independent and dependent variables.
We’ve now covered three reasons for failure to achieve statistically significant correlations.
1. The correlation between true scores is 0, i.e., X and Y are not related to each other.
This means that our theory which predicts a relationship between X and Y is wrong.
We must revise our thinking about the X – Y relationship.
From a methodological point of view, this is the only excusable reason for a nonsignificant result.
2. Low power.
There could really be a relationship between True X and True Y, but our sample size is too small for our
statistical test to detect it.
This is inexcusable.
We should always have sufficient sample size to detect the relationship we expect to find.
3. Low reliability.
There could really be a relationship between True X and True Y, but our measures of X and Y are so
unreliable that the observed correlation is not significant.
This is inexcusable.
We should always have measures sufficiently reliable to allow us to detect the relationship we expect to
find.
The above is a good candidate for an essay test question or a couple of multiple choice questions.
Acceptable Reliability
How high should reliability be?
How tall is tall?
Some very general guidelines
Reliability Range Characterization
0 - .6 Poor
.6 - .7 Marginally Acceptable
.7 - .8 Acceptable
.8 - .9 Good
.9+ Very Good
.95+ Too good.
PSY 513: Lecture 1: Reliability - 23 11/23/11
Introduction to Path Diagrams
Symbols
Observed variables are symbolized by squares or rectangles.
103
84
Observed 121
76
Variable ...
97
81
Theoretical Constructs, also called Latent Variables are symbolized by Circles or ellipses.
Latent 106
Variable 78
/ 115
80
Theoretical ...
Construct 93
83
Correlations between variables are represented by double-headed arrows
"Correlation" "Correlation"
Arrow Arrow
Observed Observed Latent Latent
Variable Variable Variable Variable
106 104
103 101 78
84 90 79
115 114
121 128 80
76 72 79
... ...
... ... 93
97 93 92
83 81
81 80
"Causal" or "Predictive" relationships between variables are represented by single-headed arrows
"Causal" "Causal"
Arrow Arrow
Latent Observed Observed Latent
Variable Variable Variable Variable
"Causal"
Arrow "Causal"
Latent Latent Arrow Observed
Variable Observed
VariablePSY 513: Lecture 1: Reliability - 24 11/23/11
Variable
Variable
Representation of Classical Test Theory
In Equation form: Observed Scores = True Scores + Errors of Measurement Xi = T + Ei
"Causal" "Causal"
T Arrow X E
Arrow
True Scores Observed Errors of
Scores Measurement
That is, every observed score is the sum of a person's true position on the dimension of interest plus whatever
error occurred in the process of measuring. The relationship between T and O is one in which Observed score
is said to be a reflective indicator of the True amount.
In terms of the labels of the diagrams . . .
"Causal"
Error of
"Causal"
True Score Arrow Observed Measurement
Arrow
Latent Latent
Variable
Variable Variable
106 103 -3
78 84 +6
+6
115 "Causal" 121 "Causal"
80 76 -4
Arrow Arrow ...
... ...
93 97 +4
83 81 -2
PSY 513: Lecture 1: Reliability - 25 11/23/11
The Reliability Ceiling considered using Path Notation
SYMBOLICALLY, rXY ≤ rTxTy sqrt(rxx' * ryy')
Error of Error of
Measurement Measurement
Observed rXY What we
observe.
Wonderlic GPA’s
True rXY
Intelligence Academic
Ability
What we want.
PSY 513: Lecture 1: Reliability - 26 11/23/11
When reliability is high - variance due to errors of measurement is small so the observed r will be about equal
to the observed r.
Error
of Error of
Measu Measure
rement ment
Observed rXY
Observed
WPT GPA’s
r is close
Scores to true r
True rXY
Intelligence Academic
Success
When reliability is low, variance due to errors of measurement is large making the observed r smaller than the
true r.
Error
Error
of
of
Measurement
Measurement
Observed rXY
Observed
r is less
WPT than true r GPA’s
Scores
True rXY
Intelligence Academic
Success
PSY 513: Lecture 1: Reliability - 27 11/23/11