Document Sample
validity Powered By Docstoc
					A bunch of stuff
  you need to

           Becky and Danny
Why you need to counterbalance:
To avoid order effects – some items may
influence other items
To avoid fatigue effects – subjects get tired and
performance on later items suffers
To avoid practice effects – subjects learn how
to do the task and performance on later items
2 item counterbalance:

                   Subject 1   Subject 2

      First Item      A           B

       Second         B           A
3 item counterbalance:

          Sub   Sub   Sub   Sub   Sub   Sub
           1     2     3     4     5     6
    1st    A     B     C     A     B     C
    2nd   B     C     A     C     A     B
    3rd   C     A     B     B     C     A
4 item counterbalance:

             Sub Sub Sub Sub
              1   2   3   4
    1st Item A C      B   D
    2nd Item B A      D   C
    3rd Item C D      A   B
    4th Item D   B       C   A
X > 4 item counterbalance:
1) Create a simple Latin Square
2) Randomize the rows
3) Randomize the Colums
X >>>> 4 item counterbalance:
           Randomize items
Simpson’s Paradox
Simpson’s Paradox
        Simpson’s Paradox
        Total   Total
        Admit   Deny

 Men     19      13

Women    13      19
          Simpson’s Paradox
          1    1     2    2   Total Total
        Admit Deny Admit Deny Admit Deny

Men      18    7     1     6    19   13

Women    7     1     6    18    13   19
Definitions of interactions:
The whole is greater than the sum of its parts
The relationship between the variables is
multiplicative instead of additive
The effectiveness of one intervention is
contingent upon another intervention
Why are interactions important?
1) Null effects can’t get published, the
   interaction solves that
2) Interactions are usually more interesting
   than main effects
3) Like Simpson’s paradox, interactions can
   mask an effect

        No       Yes

No        0          3

Yes       5      -20


                  No       Yes

          No        0          10
          Yes       5      100


                     Male      Female
                      10        -10
        Ya Ya
                      -10       10
• Is the translation from concept to
  operationalization accurately representing
  the underlying concept.
• Does it measure what you think it measures.
• This is more familiarly called Construct
   Types of Construct Validity
• Translation validity (Trochims term)
  – Face validity
  – Content validity
• Criterion-related validity
  – Predictive validity
  – Concurrent validity
  – Convergent validity
  – Discriminant validity
       Translation validity
• Is the operationalization a good
  reflection of the construct?
• This approach is definitional in nature
  – assumes you have a good detailed
    definition of the construct
  – and you can check the operationalization
    against it.
              Face Validity
• “On its face" does it seems like a good
  translation of the construct.
  – Weak Version: If you read it does it appear to
    ask questions directed at the concept.
  – Strong Version: If experts in that domain
    assess it, they conclude it measures that
          Content Validity
• Check the operationalization against the
  relevant content domain for the
• Assumes that a well defined concept is
  being operationalized which may not be
• For example, a depression measure
  should cover the checklist of depression
    Criteria-Related Validity
• Check the performance of operationalization
  against some criterion.
• Content validity differs in that the criteria are
  the construct definition itself -- it is a direct
• In criterion-related validity, a prediction is
  made about how the operationalization will
  perform based on our theory of the construct.
        Predictive Validity
• Assess the operationalization's ability to
  predict something it should theoretically
  be able to predict.
  – A high correlation would provide evidence
    for predictive validity -- it would show that
    our measure can correctly predict
    something that we theoretically thing it
    should be able to predict.
         Concurrent Validity
• Assess the operationalization's ability to
  distinguish between groups that it should
  theoretically be able to distinguish between.
• As in any discriminating test, the results are
  more powerful if you are able to show that
  you can discriminate between two groups
  that are very similar.
          Convergent Validity
• Examine the degree to which the
  operationalization is similar to (converges on)
  other operationalizations that it theoretically
  should be similar to.
   – To show the convergent validity of a test of arithmetic
     skills, one might correlate the scores on a test with
     scores on other tests that purport to measure basic math
     ability, where high correlations would be evidence of
     convergent validity.
        Discriminant Validity
• Examine the degree to which the
  operationalization is not similar to (diverges
  from) other operationalizations that it
  theoretically should be not be similar to.
  – To show the discriminant validity of a test of
    arithmetic skills, we might correlate the scores
    on a test with scores on tests that of verbal
    ability, where low correlations would be
    evidence of discriminant validity.
       Threats to ConstructValidity
•   From the discussion in Cook and Campbell (Cook, T.D. and Campbell, D.T. Quasi-
    Experimentation: Design and Analysis Issues for Field Settings.).

• Inadequate Preoperational Explication of
• Mono-Operation Bias
• Mono-Method Bias
• Interaction of Different Treatments
• Interaction of Testing and Treatment
• Restricted Generalizability Across
• Confounding Constructs and Levels of
 Inadequate Preoperational
  Explication of Constructs
• You didn't do a good enough job of defining
  (operationally) what you mean by the
• Avoid by:
  – Thinking through the concepts better
  – Use methods (e.g., concept mapping) to
    articulate your concepts
  – Get “experts” to critique your
       Mono-Operation Bias
• Pertains to the independent variable, cause,
  program or treatment in your study not to
  measures or outcomes.
• If you only use a single version of a program
  in a single place at a single point in time, you
  may not be capturing the full breadth of the
  concept of the program.
• Solution: try to implement multiple versions
  of your program.
          Mono-Method Bias
• Refers to your measures or observations.
• With only a single version of a self esteem
  measure, you can't provide much evidence that
  you're really measuring self esteem.
• Solution: try to implement multiple measures of
  key constructs and try to demonstrate (perhaps
  through a pilot or side study) that the measures
  you use behave as you theoretically expect them
      Interaction of Different
• Changes in the behaviors of interest may
  not be due to experimental manipulation,
  but may be due to an interaction of
  experimental manipulation with other
    Interaction of Testing and
• Testing or measurement itself may make the
  groups more sensitive or receptive to treatment.
• If it does, then the testing is in effect a part of the
  treatment, it's inseparable from the effect of the
• This is a labeling issue (and, hence, a concern of
  construct validity) because you want to use the
  label “treatment" to refer to the treatment alone,
  but in fact it includes the testing.
   Restricted Generalizability
      Across Constructs
• The "unintended consequences" treat to
  construct validity
• You do a study and conclude that Treatment
  X is effective. In fact, Treatment X does
  cause a reduction in symptoms, but what you
  failed to anticipate was the drastic negative
  consequences of the side effects of the
• When you say that Treatment X is effective,
  you have defined "effective" as only the
  directly targeted symptom.
 Confounding Constructs and
    Levels of Constructs
• If your manipulation does not work, it may
  not be the case that it does not work at all,
  but only at that level
• For example peer pressure may not work if
  only 2 people are applying pressure, but
  may work fine if 4 people are applying
    The "Social" Threats to
      Construct Validity
• Hypothesis Guessing
• Evaluation Apprehension
• Experimenter Expectancies
        Hypothesis Guessing
• Participants may try to figure out what the study is
  about. They "guess" at what the real purpose of
  the study is.
• They are likely to base their behavior on what they
  guess, not just on your manipulation.
• If change in the DV could be due to how they
  think they are supposed to behave, then the change
  cannot be completely attributed to the
• It is this labeling issue that makes this a construct
  validity threat.
    Evaluation Apprehension
• Some people may be anxious about being
  evaluated and consequently perform poorly.
• Or because of wanting to look good “social
  desirability” they may try to perform better (e.g.
  unusual prosocial behavior).
• In both cases, the apprehension becomes
  confounded with the treatment itself and you have
  to be careful about how you label the outcomes.
  Experimenter Expectancies
• The researcher can bias the results of a study in
  countless ways, both consciously or
• Sometimes the researcher can communicate what
  the desired outcome for a study might be (and
  participant desire to "look good" leads them to
  react that way).
• The researcher might look pleased when
  participants give a desired answer.
• If this is what causes the response, it would be
  wrong to label the response as a manipulation
• Means "repeatability" or "consistency".
• A measure is considered reliable if it would
  give us the same result over and over again
  (assuming that what we are measuring isn't
• There are four general classes of reliability
  estimates, each of which estimates
  reliability in a different way.
         Reliabilty (continued)
•   Inter-Rater or Inter-Observer Reliability
•   Test-Retest Reliability
•   Parallel-Forms Reliability
•   Internal Consistency Reliability
 Inter-Rater or Inter-Observer
• Used to assess the degree to which different
  raters/observers give consistent estimates of
  the same phenomenon.
• Establish reliability on pilot data or a
  subsample of data and retest often
• For categorical data a X2 can be used and for
  continuous data an R can be calculated.
      Test-Retest Reliability
• Used to assess the consistency of a measure from
  one time to another.
• This approach assumes that there is no substantial
  change in the construct being measured between
  the two occasions.
• The amount of time allowed between measures is
• The shorter the time gap, the higher the
  correlation; the longer the time gap, the lower the
   Parallel-Forms Reliability
• Used to assess the consistency of the results of two tests
  constructed in the same way from the same content
• Create a large set of questions that address the same
  construct and then randomly divide the questions into two
  sets and administer both instruments to the same sample of
• The correlation between the two parallel forms is the
  estimate of reliability.
• One major problem with this approach is that you have to
  be able to generate lots of items that reflect the same
   Parallel-Forms and Split Half
• The parallel forms approach is very similar to the
  split-half reliability described below.
• The major difference is that parallel forms are
  constructed so that the two forms can be used
  independent of each other and considered
  equivalent measures.
• With split-half reliability we have an instrument
  that we wish to use as a single measurement
  instrument and only develop randomly split halves
  for purposes of estimating reliability.
       Internal Consistency
• Used to assess the consistency of results across
  items within a test.
• In effect we judge the reliability of the instrument
  by estimating how well the items that reflect the
  same construct yield similar results.
• We are looking at how consistent the results are
  for different items for the same construct within
  the measure.
     Kinds of Internal Reliability
•   Average Inter-item Correlation
•   Average Itemtotal Correlation
•   Split-Half Reliability
•   Cronbach's Alpha (a)
Gricean Maxims:
Quality: Speaker is assumed to tell the truth
Quantity: Speakers won’t burden hearers with
 already known info. Obvious inferences will
 be made
Relation: Speaker will only talk about things
 relevant to the interaction
Manner: Speakers will be brief, orderly, clear,
 and unambiguous.
Examples of where this breaks down:

Piagetian conservation tasks
Representativeness: The Linda Problem
Dilution effect: nondiagnostic information
Implanted Memories: cooperative vs.
  adversarial sources
Mutual Exclusivity
Examples of where this breaks down:

Framing effects
Inconsistent responses due to pragmatics: the
  part whole problem
Conventional implicatures: “all” vs. “each and
         Manipulation Checks
Have them.
Lots of them.
Validity and Reliability

       Graduate Methods
          Becky Ray
         Winter, 2003
       For further reference see:

Shared By: