Docstoc

Reliability

Document Sample
Reliability Powered By Docstoc
					Reliability

• a measure is reliable  • reliability is assessed
  if it gives the same     by a number –
  information every time   typically a correlation
  it is used.              between two sets of
                           scores
Reliability

• Measurement of         • ability is not directly
  human ability and        observable – we infer
  knowledge is             ability from behavior
  challenging because:   • all behaviors are
                           influenced by many
                           variables, only a few
                           of which matter to us
Observed Scores

     O=T+e        O = Observed score
                  T = True score
                  e = error
Reliability – the basics

1. A true score on a      3. We assume that
   test does not             errors are random
   change with               (equally likely to
   repeated testing          increase or
2. A true score would        decrease any test
   be obtained if there      result).
   were no error of
   measurement.
Reliability – the basics

• Because errors are     • Mean of many
  random, if we test one   observed scores for
  person many times,       one person will be the
  the errors will cancel   person’s true score
  each other out
• (Positive errors
  cancel negative
  errors)
Reliability – the basics

• Example: to measure   • Ask Sarah to spell a
  Sarah’s spelling        subset of English
  ability for English     words
  words.                • % correct estimates
• We can’t ask her to     her true English
  spell every word in     spelling skill
  the dictionary, so…   • But which words
                          should be in our
                          subset?
Estimating Sarah’s spelling ability…

• Suppose we choose    • Then, by chance, we
  20 words randomly…     may get a lot of very
                         easy words – cat,
                         tree, chair, stand…
                       • Or, by chance, we
                         may get a lot of very
                         difficult words –
                         desiccate,
                         arteriosclerosis,
                         numismatics
Estimating Sarah’s spelling ability…

• Sarah’s observed        • But presumably her
  score will vary with      actual spelling ability
  the difficulty of the     remains constant.
  random sets of words
  we choose
Reliability – the basics

• Other things can       • E.g. on the first day
  produce error in our     that we test Sarah
  measurement              she’s tired
                         • but on the second
                           day, she’s rested…
Estimating Sarah’s spelling ability…

• Conclusion:        • The variation in
                       Sarah’s scores is
O=T+e                  produced by
                       measurement error.
                     • How can we measure
But e1 ≠ e2 ≠ e3 …     such effects – how
                       can we measure
                       reliability?
Reliability – the basics

• In what follows, we   • Different ways of
  consider various        measuring reliability
  sources of error in     are sensitive to
  measurement.            different sources of
                          error.
How do we deal with sources of error?

• Error due to test items • Domain sampling
                            error
Domain Sampling error

• A knowledge base or      • We can’t test the
  skill set containing       entire set of items.
  many items is to be          So we sample items.
  tested.                      That produces
   E.g., chemical              sampling error, as in
    properties of foods.        Sarah’s spelling test.
Domain Sampling error

• Smaller sets of items   • Reliability increases
  may not test entire       with number of items
  knowledge base.           on a test
• A person’s score may
  vary depending upon
  what is included or
  excluded from test.
Domain Sampling error

• Parallel Forms        • Across all people
  Reliability:            tested, if correlation
• Choose 2 different      between scores on 2
  sets of test items.     sets of words is low,
                          then we probably
                          have domain
                          sampling error.
How do we deal with sources of error?

• Error due to test items • Time sampling error
• Error due to testing
  occasions
Time Sampling error

• Test-retest Reliability    • Give same test
    person taking test        repeatedly & check
     might be having a         correlations among
     very good or very bad     scores
     day – due to fatigue,
     emotional state,        • High correlations
     preparedness, etc.        indicate stability –
                               less influence of bad
                               or good days.
Time sampling error

• Advantage: easy to   • Disadvantage:
  evaluate, using        carryover & practice
  correlation            effects
How do we deal with sources of error?

• Error due to test items • Internal consistency
• Error due to testing      error
  occasions
• Error due to testing
  multiple traits
Internal consistency approach

• Suppose a test         • Would you expect
  includes both (1)        much correlation
  items on social          between scores on
  psychology and (2)       the two parts?
  items requiring mental     No – because the two
  rotation of abstract        ‘skills’ are unrelated.
  visual shapes.
Internal consistency approach

• A low correlation        • In such a case, the
  between scores on 2        two halves of the test
  halves of a test,          give information about
  suggests that the test     two different,
  is tapping two             uncorrelated traits
  different abilities or
  traits.
Internal consistency approach

• So we assess internal      • But how should we
  consistency by               divide the test into
  dividing the test into 2     halves to check the
  halves and computing         correlation?
  the correlation
  between scores on
  those two halves for
  the people who took
  the test
Internal consistency approach

• Split-half method   • All of these assess
• Kuder-Richardson      the extent to which
  formula               items on a given test
• Cronbach’s alpha      measure the same
                        ability or trait.
Split-half Reliability

• After testing, divide    • Various ways of
  test items into halves     dividing test into two –
  A & B that are scored      randomly, first half vs.
  separately.                second half, odd-
• Compute correlation        even…
  of results for A with
  results for B.
Kuder-Richardson 20

• Kuder & Richardson       • KR-20 avoids
  (1937): an internal-       problems associated
  consistency measure        with splitting by
  that doesn’t require       simultaneously
  arbitrary splitting of     considering all
  test into 2 halves.        possible ways of
                             splitting a test into 2
                             halves.
Internal Consistency – Cronbach’s α

• KR-20 can only be         • Cronbach’s α (alpha)
  used with test items        generalizes KR-20 to
  scored as 1 or 0 (e.g.,     tests with multiple
  right or wrong, true or     response categories.
  false).                   • α is a more generally-
                              useful measure of
                              internal consistency
                              than KR-20
Review: How do we deal with sources of
error?
Approach         Measures                  Issues

Test-Retest      Stability of scores       Carryover

Parallel Forms   Equivalence & Stability   Effort

Split-half       Equivalence & Internal    Shortened
                 consistency               test
KR-20 & α        Equivalence & Internal    Difficult to
                 consistency               calculate
Reliability in Observational Studies

• Some psychologists        • This approach
  collect data by             requires time
  observing behavior          sampling, leading to
  rather than by testing.     sampling error
                            • Further error due to:
                               observer failures
                               inter-observer
                                differences
Reliability in Observational Studies

• Deal with possibility of • Deal with inter-
  failure in the single-     observer differences
  observer situation by      using:
  having more than 1           Inter-rater reliability
  observer.                    Kappa statistic
Validity

• We distinguish          • A subsequent lecture
  between the validity      will consider the
  of a measure of some      validity of
  psychological process     conclusions.
  or state and the
  validity of a
  conclusion.
• Here, we focus on
  validity of measures.
                    Theory: A
                    influences B
We’ll consider
validity of these
in a few weeks
                    Prediction: A  B




                    Operationalization
                    of A = a, B = b
                                         We’ll look at validity
                                         of these phases
                                         today

                    Measurement
                    of b
Validity

• a measure is valid if it   • we traditionally
  measures what you            distinguish between
  think it measures.           four types of validity:
                                   face
                                   content
                                   construct
                                   criterion
Four types of validity

• Face                   • The test appears to
                           measure what it is
                           supposed to measure
                            not formally
                             recognized as a type
                             of validity
Four types of validity

• Face                   • The measure
• Construct                captures the
                           theoretical construct it
                           is supposed to
                           measure
Four types of validity

• Face                   • The measure
• Construct                samples the range of
• Content                  behavior covered by
                           the construct.
Four types of validity

•   Face                 • Results relate closely
•   Construct              to those produced by
•   Content                other measures of the
                           same construct.
•   Criterion
                         • Results do not relate
                           to those produced by
                           measures of other
                           constructs
Review (last week & this week)

• We’re not really          • But only systematic
interested in things that     variation, not random
stay the same.                variation
• We’re interested in          systematic variation
variation.                      can be explained
                               random variation can’t
Quick Review

• Some variation in      • The scientist’s tasks
performance is random      are to separate the
and some is systematic     systematic variation
                           from the random, and
                           then to build models
                           of the systematic
                           variation.
Quick Review

• We choose a              • We try to maximize
measurement scale.           both the reliability and
                             the validity of our
• We prefer either ratio     measurements using
or interval scales, when     that scale.
we can get them.
Review questions

• Which would you          • What are some
 expect to be easier to      analogues for rulers
 assess – reliability or     and scales, used
 validity?                   when we measure
• Why do we have tools       psychological
 and machines to             constructs?
 measure some things
 for us (such as rulers,
 scales, and money)?