# Reliability by nikeborome

VIEWS: 22 PAGES: 42

• pg 1
```									                                          1

Chapter 4 – Reliability

1. Observed Scores and True Scores
2. Error
3. How We Deal with Sources of Error:
A. Domain sampling – test items
B. Time sampling – test occasions
C. Internal consistency – traits
4. Reliability in Observational Studies
5. Using Reliability Information
6. What To Do about Low Reliability
2

Chapter 4 - Reliability

• Measurement of human ability and
knowledge is challenging because:
 ability is not directly observable – we infer
ability from behavior
 all behaviors are influenced by many
variables, only a few of which matter to us
3

Observed Scores

O=T+e

O = Observed score
T = True score
e = error
4

Reliability – the basics

1. A true score on a      3. We assume that
test does not             errors are random
change with               (equally likely to
repeated testing          increase or
2. A true score would        decrease any test
be obtained if there      result).
were no error of
measurement.
5

Reliability – the basics

• Because errors are     • Mean of many
random, if we test one   observed scores for
person many times,       one person will be the
the errors will cancel   person‟s true score
each other out
• (Positive errors
cancel negative
errors)
6

Reliability – the basics

• Example: to measure   • Ask Sarah to spell a
Sarah‟s spelling        subset of English
ability for English     words
words.                • % correct estimates
• We can‟t ask her to     her true English
spell every word in     spelling skill
the OED, so…          • But which words
should be in our
subset?
7

Estimating Sarah‟s spelling ability…

• Suppose we choose    • What if, by chance,
20 words randomly…     we get a lot of very
easy words – cat,
tree, chair, stand…
• Or, by chance, we get
a lot of very difficult
words – desiccate,
arteriosclerosis,
numismatics
8

Estimating Sarah‟s spelling ability…

• Sarah‟s observed       • But presumably her
score varies as the      true score (her actual
difficulty of the        spelling ability)
random sets of words     remains constant.
varies
9

Reliability – the basics

• Other things can       • E.g. on the first day
produce error in our     that we test Sarah
measurement              she‟s tired
• But on the second
day, she‟s rested…

• This would lead to
different scores on
the two days
10

Estimating Sarah‟s spelling ability…

• Conclusion:        • The variation in
Sarah‟s scores is
O=T+e                  produced by
measurement error.
• How can we measure
But e1 ≠ e2 ≠ e3 …     such effects – how
can we measure
reliability?
11

Reliability – the basics

• In what follows, we   • Different ways of
consider various        measuring reliability
sources of error in     are sensitive to
measurement.            different sources of
error.
12

How do we deal with sources of error?

• Error due to test items • Domain sampling
error
13

How do we deal with sources of error?

• Error due to test items • Time sampling error
• Error due to testing
occasions
14

How do we deal with sources of error?

• Error due to test items • Internal consistency
• Error due to testing      error
occasions
• Error due to testing
multiple traits
15

Domain Sampling error

• A knowledge base or      • We can‟t test the
skill set containing       entire set of items.
many items is to be          So we select a sample
tested.                       of items.
 E.g., the chemical         That produces domain
properties of foods.        sampling error, as in
Sarah‟s spelling test.
16

Domain Sampling error

• There is a “domain” of   • A person‟s score may
knowledge to be            vary depending upon
tested                     what is included or
excluded from the
test.
17

Domain Sampling error

• Smaller sets of items    • As a result, reliability
may not test entire        of a test increases
knowledge base.            with the number of
• Larger sets of items       items on that test
should do a better job
of covering the whole
knowledge base.
18

Domain Sampling error

• Parallel Forms            • Across all people
Reliability:                tested, if correlation
• choose 2 different          between scores on 2
sets of test items.         parallel forms is low,
• these 2 sets give you       then we probably
“parallel forms” of the     have domain
test                        sampling error.
19

Time Sampling error

• Test-retest Reliability    • Give same test
 person taking test        repeatedly & check
might be having a         correlations among
very good or very bad     scores
day – due to fatigue,
emotional state,        • High correlations
preparedness, etc.        indicate stability –
less influence of bad
or good days.
20

Time Sampling error

• Test-retest approach        • Not all low test-retest
is only useful for traits     correlations imply a
– characteristics that        weak test
don‟t change over           • Sometimes, the
time                          characteristic being
measured varies with
time (as in learning)
21

Time Sampling error

• Interval over which    • Not all low test-retest
correlation is           correlations imply a
measured matters         weak test
• E.g., for young        • Sometimes, the
children, use a very     characteristic being
short period (< 1        measured varies with
month, in general)       time (as in learning)
• In general, interval
should not be > 6
months
22

Time sampling error

• Test-retest approach   • Carryover: first testing
advantage: easy to       session influences scores
on next session
evaluate, using
• Practice: when carryover
correlation
effect involves learning
carryover & practice
effects
23

Internal Consistency error

• Suppose a test         • Would you expect
includes both items      much correlation
on social psychology     between scores on
and items requiring      the two parts?
mental rotation of        No – because the two
abstract visual            „skills‟ are unrelated.
shapes.
24

Internal Consistency Approach

• A low correlation        • A good test has high
between scores on 2        correlations between
halves of a test,          scores on its two
suggests that the test     halves.
is tapping two
different abilities or      But how should we
divide the test in two to
traits.
check that correlation?
25

Internal Consistency error

• Split-half method   • All of these assess
• Kuder-Richardson      the extent to which
formula               items on a given test
• Cronbach‟s alpha      measure the same
ability or trait.
26

Split-half Reliability

• After testing, divide    • Various ways of
test items into halves     dividing test into two –
A & B that are scored      randomly, first half vs.
separately.                second half, odd-
• Check for correlation      even…
of results for A with
results for B.
27

Split-half Reliability – a problem

• Each half-test is    • So, we shouldn‟t use
smaller than the       the raw split-half
whole                  reliability to assess
• Smaller tests have     reliability for the
lower reliability      whole test
(domain sampling
error)
28

Split-half reliability – a problem

• We correct reliability   re = estimated reliability for
estimate using the          the test
Spearman-Brown           rc = computed reliability
(correlation between
formula:
scores on the two halves
re = 2rc                   A and B)
1+ rc
29

Kuder-Richardson 20

• Kuder & Richardson       • KR-20 avoids
(1937): an internal-       problems associated
consistency measure        with splitting by
that doesn‟t require       simultaneously
arbitrary splitting of     considering all
test into 2 halves.        possible ways of
splitting a test into 2
halves.
30

Kuder-Richardson 20

•   The formula          1. a measure of all the
contains two basic      variance in the
terms:                  whole set of test
results.
31

Kuder-Richardson 20

•   The formula          2. “item variance” –
contains two basic      when items measure
terms:                  the same trait, they
co-vary (same
people get them
right or wrong). More
co-variance = less
“item variance”
32

Internal Consistency – Cronbach‟s α

• KR-20 can only be         • Cronbach‟s α (alpha)
used with test items        generalizes KR-20 to
scored as 1 or 0 (e.g.,     tests with multiple
right or wrong, true or     response categories.
false).                   • α is a more generally-
useful measure of
internal consistency
than KR-20
33
Review: How do we deal with sources of
error?
Approach         Measures                  Issues

Test-Retest      Stability of scores       Carryover

Parallel Forms   Equivalence & Stability   Effort

Split-half       Equivalence & Internal    Shortened
consistency               test
KR-20 & α        Equivalence & Internal    Difficult to
consistency               calculate
34

Reliability in Observational Studies

• Some psychologists        • This approach
collect data by             requires time
observing behavior          sampling, leading to
rather than by testing.     sampling error
• Further error due to:
 observer failures
 inter-observer
differences
35

Reliability in Observational Studies

• Deal with possibility of • Deal with inter-
failure in the single-     observer differences
observer situation by      using:
having more than 1           Inter-rater reliability
observer.                    Kappa statistic
36

Reliability in Observational Studies

• Inter-rater reliability   • % agreement between 2
or more observers
 problem: in a 2-choice
case, 2 judges have a 50%
chance of agreeing even if
they guess!
 this means that %
agreement may over-
estimate inter-rater
reliability.
37

Reliability in Observational Studies

• Kappa Statistic    • estimates actual inter-
(Cohen,1960)         rater agreement as a
proportion of potential
inter-rater agreement
after correction for
chance.
38

Using Reliability Information

• Standard error of   • estimates extent to
measurement (SEM)     which test score
misrepresents a true
score.
• SEM = (S)(1 – r)
39

Standard Error of Measurement

• We use SEM to            • The interval is centered
compute a confidence       on the test score
interval for a           • We have confidence that
the true score falls in this
particular test score.
interval
• E.g., 95% of the time the
true score will fall within
1.96 SEM either way of
the test (observed) score.
40

Standard Error of Measurement

• A simple way to think   • The standard
of the SEM:               deviation of the
• Suppose we gave           resulting set of test
one student the same      scores (for this one
test over and over        student) would be the
• Suppose, too, that no     standard error of
learning took place       measurement.
between tests and the
student did not
memorize questions
41

What to do about low reliability

• Increase the number   • To find how many you
of items                need, use Spearman-
Brown formula
• Using more items
may introduce new
sources of error such
as fatigue, boredom
42

What to do about low reliability

• Discriminability   • Find correlations
analysis             between each item
and whole test
• Delete items with low
correlations

```
To top