Validity and reliability of scales

Document Sample
Validity and reliability of scales Powered By Docstoc
					      Lecture 3: Reliability and
          validity of scales
• Reliability:
  –   internal consistency
  –   test-retest
  –   inter- and intra-rater
  –   alternate form
• Validity:
  – content, criterion, and construct validity
  – responsiveness
                                                 1
            Multi-item scales
• Measure constructs without a gold standard
  – e.g., depression, satisfaction, quality of life
• Items are intended to sample the content of
  the underlying construct
• Items summarized in various ways:
  – sum or average of responses to individual items
  – item weighting or other algorithm
  – profiles/sub-scale scores
                                                      2
Example: Reliability and validity of
 a measure of severity of delirium
  Source: McCusker et al, Internat Psychogeriatrics 1998; 10:421-33

• Delirium - acute confusion
• Common in older hospitalized patients
• Diagnosis of delirium is based on the
  following symptoms:
  –   acute onset, fluctuations
  –   inattention, disorganized thinking
  –   altered consciousness, disorientation
  –   memory impairment, perceptual disturbances
  –
                                                                      3
      psychomotor agitation or retardation
    Requirements of new scale

• Administered by interviewer at bedside
• Not using patient chart (to maintain
  blinding)
• Brief (avoid patient burden)
• Responsive to within-patient changes over
  time


                                              4
           Delirium Index (DI)
• Assesses severity of 7 symptoms of
  delirium (excl. acute onset, fluctuations,
  sleep disorder):
  –   inattention, disorganized thinking
  –   altered consciousness, disorientation
  –   memory impairment, perceptual disturbances
  –   psychomotor agitation or retardation

                                                   5
    Administration and scoring
• Administered in conjunction with first 5
  questions of Mini-Mental State Exam
  (MMSE)
• Each symptom rated on 4-point scale:
     0 = absent
     1 = mild
     2 = moderate
     3 = severe
• Operational definition of each symptom
                                             6
                   Scoring
• Score is sum of 7 item scores
• Scoring of symptoms that could not be
  assessed:
  – patient non-responsive - coded as “severe” for
    items 1,2,4,5
  – coding instructions provided for questions 3, 6,
    7
  – patient refuses - questions 1, 2, 4, 5 scores
    replaced by score of item 3
                                                       7
                 Reliability
• Internal consistency
• Test-retest reliability
• Inter-rater and intra-rater reliability




                                            8
          Internal consistency
• Relevant to additive scales (that sum or
  average items)
• Split-half reliability:
  – correlation between scores on arbitrary half of
    measure with scores on other half
• Coefficient alpha (Cronbach)
  – estimates split half correlation for all possible
    combinations of dividing the scale
                                                        9
      Internal consistency of DI
• Cronbach’s alpha (overall) = 0.74
• After exclusion of perceptual disturbance:
  0.82
• In sub-groups of patients:
  –   delirium and dementia:   0.69, 0.79
  –   delirium alone:          0.67, 0.79
  –   dementia alone:          0.55, 0.59
  –   neither                  0.44, 0.52
                                               10
 Test-retest reliability (stability)
• Scale is repeated
  – short-term
     • for constructs that fluctuate, 2 weeks often used to
       reduce effects of memory and true change
  – long-term
     • for constructs that should not fluctuate (e.g.,
       personality traits)
• Correlation between 2 scores is computed
• Also important to look at systematic
                                                              11
  increase or decrease in score
    Test-retest reliability of DI
• Delirium is marked by fluctuations
• Variability over time is expected




                                       12
 Mean within-patient standard deviation
 in DI score during 1st week in hospital
 3
2.5
 2

1.5
 1
0.5
 0
      Delirium+dementia Delirium   Dementia   Neither
            (n=157)     (n=57)     (n=55)     (n=41)



                                                        13
  Inter- and intra-rater reliability
Inter-rater reliability
• For scales requiring rater skill, judgment
• 2 or more independent raters of same event
Intra-rater reliability
• Independent rating by same observer of
  same event

                                               14
Measures of inter- and intra-rater
  reliability: categorical data
• Percent agreement
  – can be used for di- and polychotomous scales
  – limitation: value is affected by prevalence -
    higher if very low or very high prevalence
• Kappa statistic
  – takes chance agreement into account
  – defined as fraction of observed agreement not
    due to chance
                                                    15
              Kappa statistic
Kappa = p(obs) - p(exp)
          1 - p(exp)

p(obs): proportion of observed agreement
p(exp): proportion of agreement expected by chance




                                                 16
                   Example of Computation of Kappa


Agreement between the First and the Second Readings to Identify Atherosclerosis Plaque
in the Left Carotid Bifurcation by B-Mode Ultrasound Examination in the
Atherosclerosis Risk in Communities (ARIC) Study


                                                               First Reading
                                         Plaque            Normal          Total
Second reading         Plaque             140                52            192
                       Normal              69               725            794
                       Total              209               777            986


Observed agreement = 140 +725/986 = 0.877

Chance agreement for plaque – plaque cell = (209 x 192)/986 = 40.7

Chance agreement for normal- normal cell = 777 x 794/986 = 625.7

Total chance agreement = 40.7 + 625.7/986 = 0.676

Kappa = 0.877 – 0.676 = 0.62
          1 – 0.676

                                                                                         17
       Interpretation of kappa
• Various suggested interpretations
• Example: Fleiss (1981)
    excellent:    0.75 and above
    fair to good: 0.40 - 0.74
    poor:         less than 0.40
• Limitations
  – depends on prevalence (see Szklo & Nieto)
  – do not use as only measure of agreement
                                                18
Measures of inter- and intra-rater
  reliability: continuous data
• Measures of correlation
  – Correlation graph (scatter diagram)
  – Correlation coefficients
• Measures of pairwise comparison




                                          19
       Correlation coefficients
• Pearson’s r
  – assesses linear association, not systematic
    differences between 2 sets of observations
  – sensitive to range of values, especially outliers
• Spearman r
  – ordinal or rank order correlation
  – less influenced by outliers
  – doesn’t assess systematic differences
                                                        20
       Correlation coefficients
• Intra-class correlation coefficient (ICC)
  – Estimate of total measurement variability due to
    between-individuals (vs error variance)
  – Equivalent to kappa and same range of values
  – Reflects true agreement, including systematic
    differences
  – Affected by range of values - if less variation
    between individuals, ICC will be lower
                                                  21
     Inter-rater reliability of DI
• Intraclass correlation coefficient (ICC):

    n = 26 patients (39 pairs of ratings)

    ICC = 0.98 (SD 0.06)




                                              22
     Alternate form reliability
• Agreement between alternate forms of same
  instrument:
  – longer vs shorter version
  – alternate method of administration:
     • face-to-face vs telephone
     • subject vs proxy (see Magaziner paper)




                                                23
                 Validity
• Content and face validity
• Criterion validity: concurrent and predictive
• Construct validity




                                              24
                   Validity
• Depends on purpose:
  – screening: discrimination
  – outcome of treatment: responsive, sensitivity to
    change
  – prognosis: predictive validity




                                                   25
     Content and face validity
• Judgment of “experts” and/or members of
  target population
• Does measure adequately sample domain
  being measured?
• Does it appear to measure what it is
  intended to measure? (eyeball test)


                                            26
       Content validity of DI
• Based on Confusion Assessment Method
  (CAM)
  – based on accepted diagnostic criteria (DSM)
  – widely used




                                                  27
            Criterion validity
• Criterion (“gold” standard)
• Concurrent criterion validity
  – e.g., screening test vs diagnostic test
• Predictive criterion validity
  – e.g., cancer staging test vs 5-year survival




                                                   28
       Criterion validity of DI
• Correlation between psychiatrist-scored DI
  (based only on patient observation) and
  Delirium Rating Scale (using all available
  information)
  – original scale
  – adjusted scale, omitting 4 items not assessed by
    DI


                                                   29
 Criterion validity of DI: results
• Spearman correlation coefficient ( and 95%
  CI) between DI and adjusted DRS (using
  multiple observations):
  – at one point in time    0.84 (0.75, 0.89)
  – within-subject change
            over time       0.71 (0.53, 0.82)



                                                30
  Delirium severity and survival
• Proportional hazards regression of delirium
  severity in delirium cohort
• Mean of 1st 2 DI scores
• Results
  – significant interaction: DI predicted survival in
    patients with delirium alone, not in those with
    dementia

                                                    31
           Construct validity
• Is the theoretical construct underlying the
  measure valid?
• Development and testing of hypotheses
• Requires multiple data sources and
  investigations:
  – Convergent validity: measure is correlated with
    other measures of similar constructs
  – discriminant validity: measure is not correlated
    with measures of different constructs           32
      Construct validity (cont)
• Multitrait-multi-method:
  – Convergent validity: measure is correlated with
    other measures of similar constructs
  – discriminant validity: measure is not correlated
    with measures of different constructs
• Factorial method:
  – factor analysis or principle components analysis
    to identify underlying dimensions
                                                   33
Spearman correlation coefficients between Delirium
  Index and 3 baseline measures of current status

    0.5                                              Delirium+Dementia (n=165)

                                                     Delirium (n=57)
    0.4

    0.3

    0.2

    0.1

     0
          Barthel Index   Clinical   Physiological
                          severity     severity

                                                                        34
Spearman correlation coefficients between Delirium
   Index and 3 baseline measures of prior status

   0.6
                                    Delirium+dementia (n=165)
   0.5                              Delirium (n=57)

   0.4

   0.3

   0.2

   0.1

    0
         IADL    IQCODE   Comorbidity

                                                           35
   Responsiveness of measures
• Ability to detect clinically important change
  over time or differences between treatments
• Requirement of evaluative measures




                                              36
  Some sources of bias in scales
• “Response sets”
  – Social desirability
  – Acquiescent




                                   37
              Social desirability
• Tendency to give answers to questions that are
  perceived to be more socially desirable than the
  true answer
• Different from deliberate distortion (“faking
  good”)
• Depends on:
   – Individual characteristics (age, sex, cultural
     background)
   – Specific question
                                                      38
             Social desirability
• Measures of social desirability (SD)
   – SD scales (e.g., Jackson SD scale, Crowne & Marlowe
     SD scale)
   – individual tendency to SD bias
• Prevention
   – phrasing of questions
   – questionnairemode
   – training of interviewers


                                                       39
     Acquiescent response set
• Tendency to agree with Likert-type
  questions
• Can be prevented by mix of positively and
  negatively-phrased questions, e.g.:
  – My health care is just about perfect
  – There are serious problems with my health care



                                                 40
 Measurement of Quality of life
           (QoL)
• Definition
  – individuals’ perception of their position in life in the
    context of the culture and value systems in which they
    live and in relation to their goals, expectations,
    standards, and concerns” (WHO QOL group, 1995)
• Domains
  – physical, psychological,level of independence, social
    relationships, environment, and
    spirituality/religion/personal beliefs

                                                               41
   Health-related quality of life
            (HRQoL)
• Dimensions of QoL related to health
• Related terms:
  – health status
  – functional status
• Usually includes:
  – physical health/function
  – mental health/function
  – social health/function
                                        42
 Evaluative HRQoL instruments
• Purpose
  – evaluate within-individual change over time
• Reliability:
  – responsiveness
• Construct validity:
  – correlations of changes in measures during
    period of time, consistent with theoretically
    derived predictions
                                                    43
       Discriminative HRQoL
             instruments
• Purpose
  – evaluate differences between individuals at
    point in time
• Reliability:
  – reproducibility
• Construct validity:
  – correlations between measures at point in time,
    consistent with theoretically derived predictions
                                                   44
    How is HRQoL measured?
• Mode
  – Interviewer
     • face-to-face
     • Telephone
  – Self-completed
• Completed by
  – self
  – proxy/surrogate
                             45
    Types of HRQoL measures
• Generic (global)
  – Health profiles
  – Utility measures
• Specific




                              46
            Generic vs specific
• Generic
  – comparisons across populations and problems
  – robust and generalizable
  – measurement properties better understood
• Disease-specific
  – shorter
  – more relevant and appropriate
  – sensitive to change
                                                  47
             Appropriateness
• Purpose:
  – describe health of population
  – evaluate effects of interventions (change over
    time)
  – compare groups at point in time
  – predict outcomes
• Areas of function covered
• Level of health
• Generic/global or specific
                                                     48

				
DOCUMENT INFO