Reliability by wuyunyi



• In our last class, we began to discuss some of the
  ways in which we can assess the quality of our
• We discussed the concept of reliability (i.e., the
  degree to which measurements are free of random
  Why reliability alone is not enough

• Understanding the degree to which measurements
  are reliable, however, is not sufficient for evaluating
  their quality.
• In-class scale example
   – Recall that test-retest estimates of reliability tend
     to range between 0 (low reliability) and 1 (high
   – Note: An on-line correlation calculator is available

• In this example, the measurements appear reliable,
  but there is a problem . . .
• Validity reflects the degree to which measurements
  are free of both random error, E, and systematic
  error, S.
• O=T+E+S
• Systematic errors reflect the influence of any non-
  random factor beyond what we’re attempting to
     Validity: Does systematic error
• Question: If we sum or average multiple observations
  (i.e., using a multiple indicators approach), how will
  systematic errors influence our estimates of the ―true‖
   Validity: Does error accumulate?

• Answer: Unlike random errors, systematic errors
• Systematic errors exert a constant source of
  influence on measurements. We will always
  overestimate (or underestimate) T if systematic error
  is present.
               O=   T    +   E   +   S
      Obs. 1   12   10       0       +2
      Obs. 2   12   10       0       +2
      Obs. 3   12   10       0       +2
      Obs. 4   12   10       0       +2
      Obs. 5   12   10       0       +2
      Obs. 6   12   10       0       +2
      Obs. 7   12   10       0       +2
     Average   12   10       0       +2

Note: Each measurement is 2 points higher than the
true value of 10. The errors do no average out.
                   O=   T    +   E    +   S
          Obs. 1   12   10        0       +2
          Obs. 2   11   10       -1       +2
          Obs. 3   12   10        0       +2
          Obs. 4   13   10       +1       +2
          Obs. 5   10   10       -2       +2
          Obs. 6   12   10        0       +2
          Obs. 7   14   10       +2       +2
         Average   12   10        0       +2

Note: Even when random error is present, E averages
to 0 but S does not. Thus, we have reliable measures
that have validity problems.
         Validity: Ensuring validity

• What can we do to minimize the impact of systematic
• One way to minimize their impact is to use a variety
  of indicators—different sources of information.
• Different kinds of indicators of a latent variable may
  not share the same systematic errors
• If true, then S will behave like random error across
  measurements (but not within measurements)

• As an example, let’s consider the measurement of
   – Some methods, such as self-report
     questionnaires, may lead people to over-estimate
     their self-esteem. Most people want to think
     highly of themselves.
   – Other methods, such as clinical ratings by
     trained observers, may lead to under-estimates of
     self-esteem. Clinicians, for example, may be
     prone to assume that people are not as well-off as
     they say they are.
                      O=   T    +   E    +   S
           Method 1
  Self-     Obs. 1    13   10       +1       +2
            Obs. 2    12   10        0       +2
            Obs. 3    12   10        0       +2
            Obs. 4    11   10       -1       +2
           Method 2
Clinical    Obs. 5    10   10       +2       -2
            Obs. 6     8   10        0       -2
 ratings    Obs. 7     8   10        0       -2
            Obs. 8     6   10       -2       -2
           Average    10   10        0        0

   Note: Method 1 systematically overestimates T whereas
   Method 2 systematically underestimates T. In
   combination, however, those systematic errors cancel
               Another example

• One problem with the use of self-report questionnaire
  rating scales is that some people tend to give high (or
  low) answers consistently (i.e., regardless of the
  question being asked).
• This is sometimes referred to as a ―yay-saying‖ or
  ―nay-saying‖ bias.
    1 = strongly disagree | 5 = strongly agree

Item                           T    S      O
                                                 In this example, we
I think I am a worthwhile      4    +1     5
                                                 have someone with
                                                 relatively high self-
I have high self-esteem.       4    +1     5
                                                 esteem, but this
I am confident in my ability   4    +1     5
to meet challenges in life.                      systematically rates
My friends and family value    4    +1     5
                                                 questions one point
me as a person.                                  higher than he or
 Average score:                4    +1     5     she should.
    1 = strongly disagree | 5 = strongly agree

Item                            T   S      O     If we ―reverse key‖ half
I think I am a worthwhile       4   +1     5     of the items, the bias
                                                 averages out.
I have high self-esteem.        4   +1     5
                                                 Responses to reverse
I am NOT confident in my        2   +1     3     keyed items are
ability to meet challenges in
                                                 counted in the opposite
My friends and family DO        2   +1     3     direction.
NOT value me as a person.
 Average score:                 4   +1     4
                                                 (4 + 4 + [6-2] + [6-2]) / 4 = 4
                                                 (5 + 5 + [6-3] + [6-3]) / 4 = 4

• To the extent to which a measure has validity, we say
  that it measures what it is supposed to measure
• Question: How do you assess validity?

                      ** Very tough question to answer! **
 Different ways to think about validity

• To the extent that a measure has validity, we can say
  that it measures what it is supposed to measure.

• There are different reasons for measuring
  psychological variables. The precise way in which
  we assess validity depends on the reason that we’re
  taking the measurements in the first place.

• As an example, if one’s goal is to develop a way to
  determine who is at risk for developing
  schizophrenia, one’s goal is prediction.
              Predictive Validity

• We may begin by obtaining a group of people who
  have schizophrenia and a group of people who do
• Then, we may try to figure out which kinds of
  antecedent variables differentiate the two groups.
                                     Correct classifications

Lost a parent before the age of 10            10%

Parent or grandparent had                     50%

Mother was cold and aloof to the              15%
person when he or she was a child
               Predictive Validity

• In short, some of these variables appear to be better
  than others at discriminating schizophrenics from
• The degree to which a measure can predict what it is
  supposed to predict is called it’s predictive validity.
• When we are taking measurements for the purpose
  of prediction, we assess validity as the degree to
  which those predictions are accurate or useful.
                               Reality: Schizophrenic
                               No                Yes
Measure: Schizophrenic

                                40               10

                               10               40

  80% ( [40 + 40] / 100) people were correctly classified
  (50% base rate)
                               Reality: Schizophrenic
                               No                Yes
Measure: Schizophrenic

                                10               10

                               40               40

  50% ( [40 + 10] / 100) people were correctly classified
  (with a 50% base rate. Yuck.)
                                       Reality: Schizophrenic
                                       No                   Yes
        Measure: Schizophrenic

                                        98                   0

                                        1                   1

99% ( [98 + 1] / 100) people were correctly classified, but note the base rate problem.
Cohen’s kappa is used to account for this problem. Kappa in this example is 66%
              Construct Validity

• Sometimes we’re not interested in measuring
  something just for ―technological‖ purposes, such as
• We may be interested in measuring a construct in
  order to learn more about it
   – Example: We may be interested in measuring self-
     esteem not because we want to predict something
     with the measure per se, but because we want to
     know how self-esteem develops, whether it
     develops differently for males and females, etc.
              Construct Validity

• Notice that this is much different than what we were
  discussing before. In our schizophrenia example, it
  doesn’t matter whether our measure of schizophrenia
  really measured schizophrenic tendencies per se.
• As long as the measure helps us predict
  schizophrenia well, we don’t really care what it
              Construct Validity

• When we are interested in the theoretical construct
  per se, however, the issue of exactly what is being
  measured becomes much more important.
• The general strategy for assessing construct
  validity involves (a) explicating the theoretical
  relations among relevant variables and (b) examining
  the degree to which the measure of the construct
  relates to things that it should and fails to relate to
  things that it should not.
            Nomological Network

• The nomological            achieve
  network represents        in school
  the interrelations                                 ability to
  among variables              +                 +    cope
  involving the construct
  of interest.                           self-

    Nomological Network & Validity

• The process of assessing construct validity basically
  involves determining the degree to which our
  measure of the construct behaves in the way
  assumed by the theoretical network in which it is
• If, theoretically, people with high self-esteem should
  be more likely to succeed in school, then our
  measure of self-esteem should be able to predict
  people’s grades in school.
              Construct Validity

• Notice here that establishing construct validity
  involves prediction. The difference between
  prediction in this context and prediction in the
  previous context is that we are no longer trying to
  predict school performance as best as we possibly
• Our measure of self-esteem should only predict
  performance to the degree to which we would expect
  these two variables to be related theoretically.
             Discriminant Validity

• The measure should            achieve
  also fail to be related to   in school
  variables that,                                       ability to
  theoretically, are              +                 +    cope
  unrelated to self-esteem.
• The ability of a measure                 esteem
  to fail to predict
  irrelevant variables is                   -            like
  referred to as the                                    coffee
  measure’s discriminant       distrust
        Validity: Assessing validity

• Finally, it is useful, but not necessary, for a measure
  to have face validity.
• Face validity: The degree to which a measure
  appears to measuring what it is supposed to
• A questionnaire item designed to measure self-
  esteem that reads ―I have high self-esteem‖ has face
  validity. An item that reads ―I like cabbage in my
  Frosted Flakes‖ does not.
• In the context of prediction, face validity doesn’t
  matter. In the context of construct validity, it matters
  A Final Note on Construct Validity

• The process of establishing construct validity is one
  of the primary enterprises of psychological research.
• When we are measuring the association between two
  variables to assess a measure’s predictive or
  discriminant validity, we are evaluating both (a) the
  quality of the measure and (b) the soundness of the
  nomological network.
• It is not unusual for researchers to refine the
  nomological network as they learn more about how
  various measures are inter-related.

To top