Reliability

Document Sample
Reliability Powered By Docstoc
					                                          1

Chapter 4 – Reliability

1. Observed Scores and True Scores
2. Error
3. How We Deal with Sources of Error:
   A. Domain sampling – test items
   B. Time sampling – test occasions
   C. Internal consistency – traits
4. Reliability in Observational Studies
5. Using Reliability Information
6. What To Do about Low Reliability
                                                    2

Chapter 4 - Reliability

• Measurement of human ability and
  knowledge is challenging because:
   ability is not directly observable – we infer
    ability from behavior
   all behaviors are influenced by many
    variables, only a few of which matter to us
                       3

Observed Scores

               O=T+e




O = Observed score
T = True score
e = error
                                                  4

Reliability – the basics

1. A true score on a      3. We assume that
   test does not             errors are random
   change with               (equally likely to
   repeated testing          increase or
2. A true score would        decrease any test
   be obtained if there      result).
   were no error of
   measurement.
                                                5

Reliability – the basics

• Because errors are     • Mean of many
  random, if we test one   observed scores for
  person many times,       one person will be the
  the errors will cancel   person‟s true score
  each other out
• (Positive errors
  cancel negative
  errors)
                                                 6

Reliability – the basics

• Example: to measure   • Ask Sarah to spell a
  Sarah‟s spelling        subset of English
  ability for English     words
  words.                • % correct estimates
• We can‟t ask her to     her true English
  spell every word in     spelling skill
  the OED, so…          • But which words
                          should be in our
                          subset?
                                               7

Estimating Sarah‟s spelling ability…

• Suppose we choose    • What if, by chance,
  20 words randomly…     we get a lot of very
                         easy words – cat,
                         tree, chair, stand…
                       • Or, by chance, we get
                         a lot of very difficult
                         words – desiccate,
                         arteriosclerosis,
                         numismatics
                                                    8

Estimating Sarah‟s spelling ability…

• Sarah‟s observed       • But presumably her
  score varies as the      true score (her actual
  difficulty of the        spelling ability)
  random sets of words     remains constant.
  varies
                                                   9

Reliability – the basics

• Other things can       • E.g. on the first day
  produce error in our     that we test Sarah
  measurement              she‟s tired
                         • But on the second
                           day, she‟s rested…

                         • This would lead to
                           different scores on
                           the two days
                                        10

Estimating Sarah‟s spelling ability…

• Conclusion:        • The variation in
                       Sarah‟s scores is
O=T+e                  produced by
                       measurement error.
                     • How can we measure
But e1 ≠ e2 ≠ e3 …     such effects – how
                       can we measure
                       reliability?
                                                  11

Reliability – the basics

• In what follows, we   • Different ways of
  consider various        measuring reliability
  sources of error in     are sensitive to
  measurement.            different sources of
                          error.
                                              12

How do we deal with sources of error?

• Error due to test items • Domain sampling
                            error
                                                  13

How do we deal with sources of error?

• Error due to test items • Time sampling error
• Error due to testing
  occasions
                                                   14

How do we deal with sources of error?

• Error due to test items • Internal consistency
• Error due to testing      error
  occasions
• Error due to testing
  multiple traits
                                                     15

Domain Sampling error

• A knowledge base or      • We can‟t test the
  skill set containing       entire set of items.
  many items is to be          So we select a sample
  tested.                       of items.
   E.g., the chemical         That produces domain
    properties of foods.        sampling error, as in
                                Sarah‟s spelling test.
                                                16

Domain Sampling error

• There is a “domain” of   • A person‟s score may
  knowledge to be            vary depending upon
  tested                     what is included or
                             excluded from the
                             test.
                                                        17

Domain Sampling error

• Smaller sets of items    • As a result, reliability
  may not test entire        of a test increases
  knowledge base.            with the number of
• Larger sets of items       items on that test
  should do a better job
  of covering the whole
  knowledge base.
                                                       18

Domain Sampling error

• Parallel Forms            • Across all people
  Reliability:                tested, if correlation
• choose 2 different          between scores on 2
  sets of test items.         parallel forms is low,
• these 2 sets give you       then we probably
  “parallel forms” of the     have domain
  test                        sampling error.
                                                       19

Time Sampling error

• Test-retest Reliability    • Give same test
    person taking test        repeatedly & check
     might be having a         correlations among
     very good or very bad     scores
     day – due to fatigue,
     emotional state,        • High correlations
     preparedness, etc.        indicate stability –
                               less influence of bad
                               or good days.
                                                          20

Time Sampling error

• Test-retest approach        • Not all low test-retest
  is only useful for traits     correlations imply a
  – characteristics that        weak test
  don‟t change over           • Sometimes, the
  time                          characteristic being
                                measured varies with
                                time (as in learning)
                                                     21

Time Sampling error

• Interval over which    • Not all low test-retest
  correlation is           correlations imply a
  measured matters         weak test
• E.g., for young        • Sometimes, the
  children, use a very     characteristic being
  short period (< 1        measured varies with
  month, in general)       time (as in learning)
• In general, interval
  should not be > 6
  months
                                                   22

Time sampling error

• Test-retest approach   • Carryover: first testing
  advantage: easy to       session influences scores
                           on next session
  evaluate, using
                         • Practice: when carryover
  correlation
                           effect involves learning
• Disadvantage:
  carryover & practice
  effects
                                                       23

Internal Consistency error

• Suppose a test         • Would you expect
  includes both items      much correlation
  on social psychology     between scores on
  and items requiring      the two parts?
  mental rotation of        No – because the two
  abstract visual            „skills‟ are unrelated.
  shapes.
                                                       24

Internal Consistency Approach

• A low correlation        • A good test has high
  between scores on 2        correlations between
  halves of a test,          scores on its two
  suggests that the test     halves.
  is tapping two
  different abilities or      But how should we
                               divide the test in two to
  traits.
                               check that correlation?
                                                25

Internal Consistency error

• Split-half method   • All of these assess
• Kuder-Richardson      the extent to which
  formula               items on a given test
• Cronbach‟s alpha      measure the same
                        ability or trait.
                                                    26

Split-half Reliability

• After testing, divide    • Various ways of
  test items into halves     dividing test into two –
  A & B that are scored      randomly, first half vs.
  separately.                second half, odd-
• Check for correlation      even…
  of results for A with
  results for B.
                                                 27

Split-half Reliability – a problem

• Each half-test is    • So, we shouldn‟t use
  smaller than the       the raw split-half
  whole                  reliability to assess
• Smaller tests have     reliability for the
  lower reliability      whole test
  (domain sampling
  error)
                                                            28

Split-half reliability – a problem

• We correct reliability   re = estimated reliability for
  estimate using the          the test
  Spearman-Brown           rc = computed reliability
                              (correlation between
  formula:
                              scores on the two halves
   re = 2rc                   A and B)
       1+ rc
                                                       29

Kuder-Richardson 20

• Kuder & Richardson       • KR-20 avoids
  (1937): an internal-       problems associated
  consistency measure        with splitting by
  that doesn‟t require       simultaneously
  arbitrary splitting of     considering all
  test into 2 halves.        possible ways of
                             splitting a test into 2
                             halves.
                                                   30

Kuder-Richardson 20

•   The formula          1. a measure of all the
    contains two basic      variance in the
    terms:                  whole set of test
                            results.
                                                31

Kuder-Richardson 20

•   The formula          2. “item variance” –
    contains two basic      when items measure
    terms:                  the same trait, they
                            co-vary (same
                            people get them
                            right or wrong). More
                            co-variance = less
                            “item variance”
                                                   32

Internal Consistency – Cronbach‟s α

• KR-20 can only be         • Cronbach‟s α (alpha)
  used with test items        generalizes KR-20 to
  scored as 1 or 0 (e.g.,     tests with multiple
  right or wrong, true or     response categories.
  false).                   • α is a more generally-
                              useful measure of
                              internal consistency
                              than KR-20
                                                          33
Review: How do we deal with sources of
error?
Approach         Measures                  Issues

Test-Retest      Stability of scores       Carryover

Parallel Forms   Equivalence & Stability   Effort

Split-half       Equivalence & Internal    Shortened
                 consistency               test
KR-20 & α        Equivalence & Internal    Difficult to
                 consistency               calculate
                                                      34

Reliability in Observational Studies

• Some psychologists        • This approach
  collect data by             requires time
  observing behavior          sampling, leading to
  rather than by testing.     sampling error
                            • Further error due to:
                               observer failures
                               inter-observer
                                differences
                                                          35

Reliability in Observational Studies

• Deal with possibility of • Deal with inter-
  failure in the single-     observer differences
  observer situation by      using:
  having more than 1           Inter-rater reliability
  observer.                    Kappa statistic
                                                         36

Reliability in Observational Studies

• Inter-rater reliability   • % agreement between 2
                              or more observers
                               problem: in a 2-choice
                                case, 2 judges have a 50%
                                chance of agreeing even if
                                they guess!
                               this means that %
                                agreement may over-
                                estimate inter-rater
                                reliability.
                                             37

Reliability in Observational Studies

• Kappa Statistic    • estimates actual inter-
  (Cohen,1960)         rater agreement as a
                       proportion of potential
                       inter-rater agreement
                       after correction for
                       chance.
                                               38

Using Reliability Information

• Standard error of   • estimates extent to
  measurement (SEM)     which test score
                        misrepresents a true
                        score.
                      • SEM = (S)(1 – r)
                                                        39

Standard Error of Measurement

• We use SEM to            • The interval is centered
  compute a confidence       on the test score
  interval for a           • We have confidence that
                             the true score falls in this
  particular test score.
                             interval
                           • E.g., 95% of the time the
                             true score will fall within
                             1.96 SEM either way of
                             the test (observed) score.
                                                40

Standard Error of Measurement

• A simple way to think   • The standard
  of the SEM:               deviation of the
• Suppose we gave           resulting set of test
  one student the same      scores (for this one
  test over and over        student) would be the
• Suppose, too, that no     standard error of
  learning took place       measurement.
  between tests and the
  student did not
  memorize questions
                                              41

What to do about low reliability

• Increase the number   • To find how many you
  of items                need, use Spearman-
                          Brown formula
                        • Using more items
                          may introduce new
                          sources of error such
                          as fatigue, boredom
                                               42

What to do about low reliability

• Discriminability   • Find correlations
  analysis             between each item
                       and whole test
                     • Delete items with low
                       correlations